The over-representation of binary DNA tracts in seven sequenced chromosomes

Yagil, Gad

doi:10.1186/1471-2164-5-19

Research article
Open access
Published: 03 March 2004

The over-representation of binary DNA tracts in seven sequenced chromosomes

Gad Yagil¹

BMC Genomics volume 5, Article number: 19 (2004) Cite this article

4557 Accesses
11 Citations
Metrics details

Abstract

Background

DNA tracts composed of only two bases are possible in six combinations: A+G (purines, R), C+T (pyrimidines, Y), G+T (Keto, K), A+C (Imino, M), A+T (Weak, W) and G+C (Strong, S). It is long known that all-pyrimidine tracts, complemented by all-purines tracts ("R.Y tracts"), are excessively present in analyzed DNA. We have previously shown that R.Y tracts are in vast excess in yeast promoters, and brought evidence for their role in gene regulation. Here we report the systematic mapping of all six binary combinations on the level of complete sequenced chromosomes, as well as in their different subregions.

Results

DNA tracts composed of the above binary base combinations have been mapped in seven sequenced chromosomes: Human chromosomes 21 and 22 (the major contigs); Drosophila melanogaster chr. 2R; Caenorhabditis elegans chr. I; Arabidopsis thaliana chr. II; Saccharomyces cerevisiae chr. IV and M. jannaschii. A huge over-representation, reaching million-folds, has been found for very long tracts of all binary motifs except S, in each of the seven organisms. Long R.Y tracts are the most excessive, except in D. melanogaster, where the K.M motif predominates. S (G, C rich) tracts are in excess mainly in CpG islands; the W motif predominates in bacteria. Many excessively long W tracts are nevertheless found also in the archeon and in the eukaryotes. The survey of complete chromosomes enables us, for the first time, to map systematically the intergenic regions. In human and other chromosomes we find the highest over-representation of the binary DNA tracts in the intergenic regions. These over-representations are only partly explainable by the presence of interspersed elements.

Conclusions

The over-representation of long DNA tracts composed of five of the above motifs is the largest deviation from randomness so far established for DNA, and this in a wide range of eukaryotic and archeal chromosomes. A propensity for ready DNA unwinding is proposed as the functional role, explaining the evolutionary conservation of the huge excesses observed.

Background

In 1952, Erwin Chargaff published a paper in which he brought evidence that runs of pyrimidines are highly over-represented in eukaryotic DNA [1]. DNA was "depurinated" in formic acid and the remaining pyrimidines were subsequently size separated by the then novel technique of paper chromatography [2], see also [3]. An unexpectedly high number of pyrimidine and purine tracts ("isostichs"), 9 bases and higher, was found in human, calf, salmon and rye DNA [4, 5]. These findings were subsequently corroborated by a number of techniques, incl. molecular hybridization rates [6–8]. The over-representation of long purine and pyrimidine runs could be exactly analyzed when sequences of many genes became available [9, 10]. The phenomenon discovered by Chargaff turned out to be a very significant one – over-representation of the longer tracts reaches values of many ten-folds, as will be demonstrated on a genome-wide basis in this paper. Homopurine (R) and homopyrimidine (Y) tracts will be referred to jointly as "R.Y tracts", because whenever a run of pyrimidines is present on one strand, it is complemented by a run of purines on the opposite strand (the dot separates complementary strands, in accordance with IUBMB rules). It should be stressed that alternating A and G (poly A-G) are only one component of R tracts, and any combination of A's and G's an make an R tract – see Additional file: 7.

Examining increasing number of genes revealed that R.Y tracts are not the only over-represented binary DNA motif. Three additional combinations of two bases are possible [11], namely: A, T only ("W tracts"); G, C only ("S tracts"), and tracts which are G, T on one strand complemented by A, C on the opposite strand (jointly: "K.M tracts"). The S tracts, found in high concentrations in certain regions, are well studied as CpG islands [12]. The abundance of these combinations was previously established in an assortment of mammalian genes [13] and in a yeast chromosome [14]. In bacteria, the W motif, rather than the R.Y motif, was found to be the predominating binary motif [15, 16].

In this paper, we shall map the occurrence of binary tracts in seven recently sequenced chromosomes, representing the major currently studied eukaryotic and archeal phyla (previous studies encompassed mainly incidentally selected gene regions). These chromosomes, especially the human and plant ones, also represent a large selection of intergenic regions not previously mapped. It will be shown that the huge over-representation is prevalent in all the selected chromosomes, in particular in their intronic and intergenic subregions. A functional significance of this remarkable departure of real DNA from random DNA has yet to be established. We have previously suggested, based on our experimental findings [17], that a DNA unwinding role, necessary for initiation of transcription, replication and other DNA directed processes, could be involved, as will be detailed in the Disscusion.

Results

R.Y tracts in chromosome 22

The chromosomes selected and their basic data are given in Table 1. Program TRACTS was applied to map the occurrence of binary DNA tracts in these chromosomes (See methods). The occurrence of R.Y tracts of different lengths in "contig 23", the main contig of human chromosome 22 (66.6% of the chromosome) is shown in Table 2. In columns 2 and 3 of the table, the number of R and Y tracts of each length found in the GenBank-listed strand is listed. Opposite each Y tract there is of course an R tract, and vice versa. The number of R tracts of each length can be seen to be roughly equal to the number Y tracts. This justifies the joint consideration of the R and Y tracts as a pair (R.Y) at this stage.

Table 1

Full size table

Table 2 R.Y Tracts in Contig "23" of Chromosome 22 (22,998,450 nt)

Full size table

Every tract length up to 78 nt is represented, and many longer tracts are present. The longest tract found is a 367 nt long, an R tract (second column). In column 5, the number of R.Y tracts that are expected in random DNA of the same length and base composition as the analyzed contig is shown (see methods). It is seen that the number of tracts expected decreases much more rapidly than the number of tracts observed (column 4). In fact, for all tracts longer than 23 nt not even a single tract is expected in randomized DNA (see column 5), while 644 such tracts are found at that length alone! (column 4). This enormous over-representation certainly calls for a biological explanation. The extent of over-representation is listed in column 9, which gives the ratio between the number of tracts (or bases) observed, to the number of tracts (or bases) expected in random DNA. This ratio is below unity for the first three rows, namely for single pyrimidines (purines) flanked by two purines (pyrimidines), their doublets and triplets. The low ratio for the short tracts compensates for the over-representation of the longer tracts, which increases steadily up to enormous figures for the higher lengths (column 9). The increase in ratios is relatively smooth, as can also be seen in Fig. 1a, which indicates that a property special to a particular length or length group is not responsible for the high excesses found. We shall use the found/expected ratio values ("f/e ratios") as the main measure for the extent of binary tract over-representation in the coming Tables. In the last column, the f/e ratio is listed for all tracts longer or equal (also "Greater or Equal", or "GE") than the length given in the first column (calculated as GE bases found divided by GE bases expected, eq. (4)). The GE value is more meaningful for the longer tract lengths, when only few tracts are encountered, so that individual f/e ratios lose their significance.

R.Y tracts in seven chromosomes

Tables similar to Table 2 have been constructed for a series of other genomes as well as for the other binary DNA motifs. The Data for human chr. 21 and the Drosophila chromosome are shown in Figs. 1b and 1c as well as in the Additional files: 1 and 2, also at the authors web site http://www.weizmann.ac.il/~lcyagil. The found/expected ratio values (f/e ratios) will be shown in most following tables, as the criterion for over-representation.

Table 3 gives the f/e ratios for R.Y tracts in seven chromosomes selected from sequenced genomes across the eukaryotic and archeal kingdoms. The major characteristics of the selected chromosomes have been listed in Table 1. In the last column of Table 3 a control run is shown – five random 1 Mb DNA sequences were generated and run by TRACTS, as an additional verification of the analytical formula used to calculate the expected values and f/e ratios (see methods). It can be seen that all the f/e ratios, except the longest, are close to unity. No R.Y tract longer than 21 nt was found, as it should be. (a 24 nt K.M tract was found, see below). The standard deviations are less than 5% up to 13 nt (Additional file: 4), when found tracts begin to be few. A larger SD is indeed expected for the longest tracts, because for example, for a 19 nt tract, only 19 or 38 bases, or occasionally 57 bases, are possible for that length. The detailed data can be found in Additional file: 4. The control runs thus confirm that tracts much longer than 21 nt cannot be expected in randomly composed DNA.

Table 3 R.Y tracts in selected chromosomes. The numbers give for each length the number of found tracts divided by the number expected tracts, eq (1).

Full size table

The major conclusion from Table 3 is that the longer R.Y tracts are highly over-represented, up to extreme values, in all the seven genomes examined. In contrast, tracts of lengths up to three nt are under-represented in all the phyla studied, as already described for chromosome 22. The longest tracts found in each species are roughly related to the length of the input DNA: From 50 nt for the 1.65 Mb M. jannaschii, to 55 nt for yeast chromosome IV (the longest yeast chromosome), to 161 for the elegans chromosome (14.7 Mb); 194 nt for the 20 Mb of Arabidopsis, and up to 367; 568 nt for the two human contigs. The one exception is the Drosophila half chromosome (20 Mb) where the longest tract is just 70 nt. It will be seen that the Drosophila chromosome is exceptional in other respects as well. A correlation between the size of the longest tract up to which every length is present, with the length of the input DNA sequence is also observed (Table 3): 31 nt for jannaschii, 33 for yeast, 46 nt for elegans, 50 for Arabidopsis, 78 and 98 nt for the human contigs. Again, 39 nt for Drosophila is an outlier. The lesser over-representation in Drosophila is also evident when individual numbers are compared to the other organisms; the highest excesses of long tracts are clearly in the two human contigs. The overall result is that the two human chromosomes exhibit the highest over-representation, with most other chromosomes not far behind; the short M. jannaschii leads occasionally for 7–11 NT tracts. These conclusions can also be seen in Fig. 2a. Between the two human chromosomes, the gene poor chromosome 21 takes the lead.

K.M tracts in seven chromosomes

It was noted earlier that not only R.Y tracts are over-represented, but, at first a bit counter-intuitively, the other three binary DNA combinations as well. Thus, K.M tracts were found in large excess in the human β globin complex and in organelle DNA [13], as well as in yeast chromosome 3 [14]. The data in Tables 4 and 5 show that these findings can be extended to the wider range of phyla studied here. In Table 4, the f/e ratios for K.M tracts are shown. As for the R.Y pair, the detailed outputs for each chromosome (see Additional files: 1, 2 and 3) show that roughly equal numbers of K tracts and M tracts are present in the analyzed strand, and justifies their joint consideration. Overall, it is clear that K.M tracts are also highly over-represented, in all seven chromosomes, even if to a lesser extent than the R.Y tracts. In humans, contig 23 of chromosome 22 shows the highest over-representations but beyond 67 nt many lengths are missing, the longest tract being just 91 nt long. In contig 28 many K.M lengths beyond 62 nt are missing; there are only two K.M tracts longer than 100 nt (101 nt, 268 nt). The f/e ratios for K.M tracts are sometimes even higher than for R.Y tracts (in chr. 21 there are 9 cases between 32 and 51 nt and 5 cases in chr. 22). Beyond 52 nt, f/e ratios are always higher for R.Y than for K.M tracts.

Table 4 K.M Tracts in selected chromosomes. Numbers are f/e ratios, i.e. number of tracts found at each length, divided by the number expected, eq.(1)

Full size table

Table 5 W Tracts in Selected Chromosomes. The numbers give for each length the number of found tracts divided by the number expected tracts, eq (1).

Full size table

The interesting genome is again Drosophila: Here the over-representation of K.M tracts is eventually 2–3 times higher than for the R.Y tracts (Fig. 1c) and is sometimes higher than in the human chromosomes (between 10–20 nt it is as high as in chr. 21 and not much lower at other lengths, Table 4). Whatever the function of the binary tracts may be, in Drosophila that function seems to be taken over, at least partly, by the K.M tracts. All K.M tract lengths are represented up to 45 nt, the longest K.M tract being just 74 nt. In Arabidopsis the K.M tracts are again in high excess, but to a lesser extent than the R.Y motif – there are only two tracts longer than 48 nt (50 and 58 nt). The excess of K.M tracts in elegans and in yeast is less by an order of magnitude compared to humans (Fig. 2b; except for the yeast telomere), and is marginal but still significant in the archeon. Control runs with the same 5 × 1 Mb random sequences, but for the K.M motif, remain close to unity as expected (Table 4); the longest tract in this case is 24 nt long, present in a single random 1 Mb sequence.

W and S tracts in the seven chromosomes

W and S tracts are autocomplementary each, rather than complementing one another. W and S tracts are therefore separately compared. The f/e ratios for W tracts are shown in Table 5 and Fig. 2c. It is seen immediately that W tracts are also over-represented, but to a more variable degree than R.Y or K.M tracts. A difference of more than 100 fold is evident between the two human chromosomes for W tracts longer than 32 nt: At that length, f/e = 18,990 in contig 23, vs. f/e = 227 in contig 28. This large difference is partly due to the sensitivity of the calculated value to the percentage of AT, which is 60.9% in contig 28 vs. 52.6% in contig 23 (%AC and %AG are always close to 50%, "the second Chargaff parity rule", see end of discussion). A far higher number of W tracts are thus expected in chr. 21 by eq. (1), simply due to different p and q values. In addition, the 60.9% AT of contig 28 is an average between a very gene poor half with a high %AT (~64% between coordinates 0–7 Mb, see Additional file: 8) and a gene richer half with 56% AT (towards the telomere of the chromosome). The actual f/e ratio in the gene rich domain is much closer to that of contig 23. In yeast, Arabidopsis and jannaschii (68.5 %AT!), W tracts are under-represented up to 15 nt, but then are increasingly over-represented, reaching an excess of hundred-folds for 30–40 nt tracts. The C. elegans chromosome contains few very long W tracts, up to 96 nt. Again – the relatively low excess of W is partly due to the high percent AT. The actual number of tracts, not f/e ratios, is closer to that of the R.Y or K.M motifs (Additional files: 1, 2 and 3). It should be added that the high % AT can be explained only very partly by the mere presence of many long W tracts, because more than 89% of the A's and T's reside in the majority of short, under-represented tracts, up to 10 nt; a certain compensation may be in place for strict quantitative comparison. Still, it can be concluded that the W motif in eukaryotes is also an extensively over-represented binary motif, in similarity to the situation in bacteria [15].

Finally, S tracts. There are many fewer long S tracts in all the chromosomes studied (data in Additional file: 5). S tracts are often concentrated near transcription start site, as part of the well studied CpG Islands [12, 25]. Thus, in contig 23 (47.4% G,C) only five S tracts longer than 37 nt are found (56 the longest). Nevertheless, in the 12 – 37 nt range, over-representations increase from 1.12 up to 480,000 fold. In Arabidopsis (only 35.9% G,C) the longest S tract is 20 nt, but over-representation still increases steadily up 200 fold, at length 20. S tracts can thus be considered as another member of the over-represented class. Program TRACTS can be a convenient tool for detecting the CpG islands, espscially in its web version [26].

Distribution in genic subregions

In which genic subregions do the excessive tracts reside? Subprogram ANEX distributes the tracts between exon, intron and intercoding or intergenic classes. The term intergenic is appropriate when mRNA entries are parsed; in that case, UTR regions are evaluated as exons. The distribution between exons, introns and intergenic of all tracts 15 nt and longer (GE 15) is shown in Table 6. W and S tracts are presented here as a pair, but since S tracts are minority for tracts GE 15, the f/e ratios represent practically the W tracts alone. Very high over-representations are again evident: Over-representations is highest for R.Y tracts in all genomes surveyed, except for Drosophila, where K.M tracts are in the largest excess.

Table 6 Binary tracts longer or equal to 15 nt, in 7 genomes. Ratio of found to expected tracts.

Full size table

The f/e ratios are lowest in coding regions (exons), with the exception of R.Y in M. jannaschii, (which is 87% coding, see column 7 of Table 1), and of W tracts in elegans. The lower excess in exons can be expected, since, for instance, an oligopurine on the coding strand imposes on the coded protein mostly polar amino acids (all-purine codons code for lys, arg, gln, also for gly).

Introns are the subregion in which K.M tracts are the most excessive, except for elegans, and jannaschii. In the fly introns have more excessive K.M tracts than R.Y tracts. The introns are the subregion richest in R.Y tracts in the fly, elegans and chr. 21 by the criterion used (≤ 15 nt). The well-known oligopyrimidine close to the 3' splice site contributes to the excess of Y tracts in introns. We also observe, in the full sequence outputs, many long binary tracts in the UTR regions, particularly in the 3' UTR. An example can be seen in reference [26]: The three R.Y tracts above position 19,000 of p53 listed there are in the 3' UTR of the gene. A suggested RNA stability signal of 9 W bases [27] may explain some of the W tracts, but many other long tracts, of all motifs, are found in the 3' UTR region, appearing often in blocks, and call for an explanation.

In the intergenic regions, R.Y tracts are the highest over-represented subregion in human chr. 22, in Arabidopsis and in yeast, while in chr. 21, in fly and in worm, introns are the even somewhat richer in R.Y tracts. In the smaller, but gene rich 3.45 Mb contig of chr. 21 (data not shown) R.Y tracts are highest in the intergenic regions, as in chr. 22. The excess of K.M over R.Y tracts in intergenic regions of Drosophila is to be particularly noted, while in the Arabidopsis chromosome their contribution is not very high.

A reviewer inquired how over-representation varies along a chromosome. The data in Additional file: 8 shows that for contig 28, f/e ratios for R.Y tracts decrease somewhat from the A,T rich, gene poor "desert" in the first half, to the gene rich second half. The f/e ratio of the W tracts increase even stronger in the same direction, but that may be due to the fact, that expected values increase strongly with % A,T while actually found tracts increase much less if at all.

Interspersed elements

A major finding of the human sequencing project was that a very high portion of the human chromosomes consists of various interspersed elements introduced into the genome. To what extent can these elements be responsible for the over-represented binary tracts? For instance, most alu elements contain, at their end, 20–30 consecutive A's partly incorporated into the genome. To answer this question, several genes and chromosomal contigs were run by TRACTS after having been "masked" (interspersed elements taken out). This was done with program RepeatMasker, with parameter – nolow; this means that "simple" repeats and certain other low complexity tracts are not taken out; only LTR, MER, LINE and SINE elements were masked out (mainly alu runs, Additional file: 6). The longest sequence we could run was contig "3.45" of Chr. 21, which is the q most contig of the chromosome, a relatively gene rich contig with 51.5% GC. After masking, 2,125,818 bases out of the original 3,450,347 bases remained (61%). The masked sequence was subjected to TRACTS. The results (Table 7) show that over-representation of all three binary pairs is reduced, but only to a limited extent – over-representation remains high for all three binary compositions. The most reduced motif is the W motif – possibly because of the last bases of the alu element. This means that a certain share of the long tracts does indeed reside in the inserted elements, but that many long tracts do reside in the non-masked fraction. This was true even when masking out also the "simple" and the "low complexity" elements. It is clear that over-representation cannot be explained as stemming mainly from so far identified inserted elements.

Table 7 Masked and non-masked frequencies of long R.Y tracts. In contig 3.45 of human chromosome 21

Full size table

Discussion

The main finding reported here is that DNA tracts consisting of only two of the bases are in vast excess all over the animal and plant kingdoms, reaching mega-fold values. The highest excesses are found for R.Y tracts in humans and in other mammals, as observed originally in the pioneering work of Erwin Chargaff and coworkers [29]. In certain organisms – like in Drosophila – K.M tracts prevail. In bacteria, W tracts are the most over-represented binary motif [15], a finding also anticipated by Chargaff and coworkers [29]. One caveat – only one chromosome or contig, from a single species in a particular phylum, is discussed here, except for the two human contigs. Two yeast chromosomes and one Drosophila segment were previously reported, and all show similar abundances [14, 30].

Gentles and Karlin [31] report a distinct dinucleotide signature for each of the genomes studied here. The four dinuleotides present in homopurine tracts are AA, GG, GA, and AG. These four dinucleotides are indeed over-represented in the genomes surveyed by Gentles and Karlin (except in Drosophila!). The rarity of CpG dinucleotides most probably contributes to the low number of S tracts in humans. On the other hand, only a minor percentage of all bases (and dinucleotides) resides in long tracts: For instance, only 5.5% of all bases in contig 23 of chr. 22 are in binary tracts longer than 10 nt (Table 2). Long tracts are thus not necessarily the major factor determining the dinucleotide signature. It is worth to note that the D. melanogaster chromosome, besides the high K.M ratios, manifests also the highest excess of long W tracts (Table 5), along with E. coli and H. influenzae; a closer relationship between these organisms has also been noted when dinucleotide signatures of E. coli and Drosophila were compared [31].

A comment on the equations used to calculate expected values (eqs. 1–4): It was assumed tacitly that compositional frequencies are neighbor independent (zero Markov order). Lower tract abundances would have been obtained, if higher order dependencies were introduced. For instance, we have seen that purines avoid being flanked by pyrimidines, and prefer to be flanked by purines. Specifically, a single A, or a single G prefer an A or a G base next to them. This effect is formally a first order Markov effect, but we prefer the biological viewpoint that a particular function with selective advantage, rather than an inherent neighbor effect, drives the bases together to form binary tracts. A neutral, nonfunctional driving force towards excess of purine.pyrimidine caused by different substitution mutation rates has indeed been noted [32]. The substitution rates in the direction of all-purine or all-pyrimidine tracts were however the lowest [32] and are therefore unlikely to explain the massive excesses of R.Y tracts observed.

The vast excess of long binary tracts raises two questions: Is an essential structural and/or functional role responsible for the high numbers of binary tracts in the range of species studied? And if so, has that property been conserved throughout evolution, or have convergent processes been responsible for their wide spread presence? As to the second question, the reappearance of massive W tracts in Drosophila can be quoted in favor of independent (convergent) evolution, while if conservation would be the answer, an early progenitor with only a binary code could be suspected. A previous suggestion of an early RNY or YRN progenote is not in line with an all purine or all pyrimidine progenote [33]. More comparative binary DNA mapping will be required to answer this question.

This leaves the question of what can the essential function be. We, and others, have proposed that a special propensity of the binary tracts to unwind and be strand separated may be responsible. Ready unwinding is certainly expected for W tracts, based on their established melting properties. As to R.Y tracts, Weintraub and Larsen showed, in their seminal work [34], that certain purine/pyrimidine rich sequences in the 5' promoter region of the chicken beta globin gene complex are sensitive to single-strand DNA specific nucleases. Sensitivity to single-stranded specific nucleases means that these binary DNA regions have to be strand separated, at least temporarily. Since 1982, R.Y tracts in promoters of many genes (reviewed in [35]) have been found to be attacked by single-strand specific nucleases and hence are likely to undergo a transition into a strand separated state, at least temporarily. The list of these promoters includes a number of yeast and bacterial sequences characterized as AT rich [36, 37]. The single-strand nuclease sensitive regions have been called by Umek and Kowalski DNA Unwinding Elements, or DUE's [38]. Evidence from modification by chemical reagents attacking only unpaired bases, like permanganate [39, 40], chloroacetaldehyde [41] and osmium tetroxide [42] support at least intermittent conversion of the attacked strands into an unwound state [43, 44]. We have previously found that in yeast chromosomes III and XI [14] the highest binary tract concentrations are in the 5' promoter regions. This intriguing observation deserves a separate analysis of the promoter regions, which is in progress.

In our experimental work [17] we studied two yeast promoters that contain long oligopyrimidine tracts, namely the promoter regions of gene cyc1, which has an oligopyrimidine tracts of 40 nt, and of gene ded1, with a 32 nt pyrimidine tract (interrupted by a TATA box). These oligo Y regions, and their complementary R tracts, were found to be sensitive to the single-strand specific nuclease P1 when under normal cellular superhelical stress. Topological analysis was consistent with the opening of six turns of the primary helix. These findings strongly support the idea that binary tracts in critical regions can readily unwind and thus facilitate the transcription initiation process, possibly helped by single strand specific proteins. The notion that binary DNA tracts can lead to transitional strand opening can apply also to other DNA directed processes, including recombination, replication and segregation. We found evidence that a long W tract in the centromere yeast chromosome IV (78 nt) has a strong propensity to form an unwound structure [40]. A role in these processes can provide an explanation for the massive presence of the binary tracts in intergenic regions, far from transcriptional start sites.

As said, the early melting of W tracts is a well-established fact, while for S tracts the propensity to be methylated may be involved. It is somewhat harder to understand why R.Y or K.M tracts should readily unwind and form paranemic, unwound DNA structures [35] (also known as a local supercoil-stabilized structures [43]), especially when G or C rich. It is possible that the contribution of the different dinucleotides to stability [45] changes under superhelical stress and at ambient temperatures. Experiments to clarify this possibility have yet to be carried out. It should be noted that the bulk of the binary tracts observed here do not have the internal symmetries associated with paranemic structures such as DNA triplexes, cruciforms or even B-Z junctions. The DUE's are more likely to separate into single-strands and be stabilized by cellular proteins.

Are the observed binary tracts "simple" sequences in the usual sense, i.e. are the observed tracts composed of one or few nucleotides repeated many times, like oligo (C-T)? A detailed inspection of the tracts listed by the program demonstrates that for most tracts this is not the case: To get an idea, the last 15 longer tracts of contig 23, located beyond the last gene of the contig, are shown in Additional file: 7. The list contains a few simple sequences, for instance a 17 nt tract with GA repeated 8 times, ((GA)₈G). Some longer tracts may also show simple repeats within their sequence. For example, the long 367 nt R tract has a number of GGGAGGAGAGA repeats in it (see Additional file: 7). This repeat covers however only part of the tract and the other parts are much more random. A slippage mechanism [46] would need too many "slippages" to explain this tract, or many other tracts in the lists, as generated by TRACTS. Oligo A or Oligo T tracts are partial components of quite a number of R tracts (as well as of W and M tracts) and for these an additional mechanism may be operative. Nevertheless, the bulk of the binary tracts are just as random a mixture of two nucleotide bases as can be, and cannot be regarded as simple or even cryptic elements [46]. A full quantitative analysis has yet to be undertaken.

Finally, Table 6 documents another intriguing finding connected with the name of Erwin Chargaff, namely that in single strands the percentage of purines is equal to the percentage of pyrimidines. The same equality was found for A+C bases being equal to G+T bases, again in single strands [47, 48]. This phenomenon has been termed "the second parity rule of Chargaff". The percentages of A+G and A+C shown in Tables 3 and 6 demonstrate that their closeness to 50% is quite convincing. I have not encountered serious departures from that rule down to the length of individual genes, in all phyla studied, including bacteria. Two explanations have been raised: One explanation is that random inversion of homopurines during evolution caused this equality [49–51]. An alternative possibility is that there is a lot of potential or actual secondary structure in genomic DNA [52]. A definite explanation has yet to be provided and is beyond the scope of this paper. In conclusion, it thus seems that the analytical findings of the Late E. Chargaff will keep us busy for a while to come.

Conclusions

This paper documents one of the more significant departures of DNA from randomicity, namely that genomes exhibit an enormous excess of DNA tracts composed of only two bases. This phenomenon is conserved throughout evolution, and is therefore likely to reflect a specific DNA function. A most likely function is a propensity of these binary tracts (and possibly additional base combinations) to adopt under suitable condition an alternative, paranemic conformation. This notion is supported by a range of experimental evidence, detailed in the discussion part. We are presently examining whether a particularly high excess of the binary tracts is present in human promoters, as already found in yeast (R.Y tracts) and E. coli (W tracts), supporting a role for the binary DNA tracts in the regulation of transcription and other DNA directed processes.

Methods

Program TRACTS identifies all binary tracts in a given DNA sequence. The program was run in its original FORTRAN version [9, 15] on an UNIX platform. A cgi web server version, in perl, is now available [26] at url: http://www.weizmann.ac.il/~lcyagil/binaries_refs.html. The program calculates overall binary tract frequencies (see Table 2) as well as distributions in genic sub regions – exons, introns and intergenic regions. A further output of TRACTS shows the sequence entered, with each exon and intron indicated and each binary tract beyond a given length shown in color on or below the line. For more details, see [26]. A preprogram, ANEX, parses GenBank/DDJB/EMBL annotation files (flat format) and produces a file with a one line entry for each gene which includes a short comment on the product/function of the gene. When only a single or a few genes is examined, a list of all exons and introns is produced.

In GenBank files that contain both mRNA and CDS entries, the mRNA entries were parsed. Consequently, UTR regions are mostly part of the exonic sub regions. In yeast, C. elegans and M. jannaschii, where no mRNA data are yet available, the UTR regions are necessarily counted as intergenic (intercoding, to be strict). The accession and version numbers of the genomes analyzed are shown in Table 1. In humans, two large contigs, making up most of chromosomes 21 and 22, were analyzed: The "28" contig of chr. 21 which goes from the centromere through most of the q arm (28,515,322 nt) and makes up 85% of the sequenced chromosome; and the "23" contig of chr. 22 (22,998,450 nt), which makes up 66.6% of the sequenced chromosome.

Expected binary tract frequencies

Frequencies of binary tracts expected in random DNA are calculated as following: N(l) gives the number of tracts of length l expected in random DNA of length L and of fractional base composition p by:

N(l) = L(p^lxq² + q^lxp²), (1)

where p, q are the fractions of the participating base pairs, p+q = 1 (p is e.g. the fraction of A+G). To calculate expected values for only one member of a pair, only one member of the above sum is to be used. The number of bases expected in tracts of length n(l) is simply:

n(l) = l xN(l). (2)

The expected number of tracts equal or greater (GE) than a given length l, N(≥l), can be shown to be [9]:

N(≥l) = L (p x q^l+ q x p^l). (3)

The expected number of bases in these tracts, n(≥l), is:

n(≥l) = L {(p + ql) p^l+ (q + pl) q^l}. (4)

The validity of these expressions was tested by generating random DNA sequences and running them by TRACTS. For this paper, five 1 Mb random sequences with exactly 25% of each nucleotide base were generated and run for each binary composition, so that standard deviations could be calculated and are listed in Additional file: 4. The percentage of W and S bases in the analyzed chromosomes is not 50%, but a control run with 62.5% AT was previously run for H. influenzae, giving the same picture [15].

Abbreviations

chr.:: chromosome
GE:: Greater or Equal (longer or equal), Additional File 1
Table 1 :: Binary tract frequencies in contig 28 of human chromosome 22
Click:: Additional File 2, here for file
Table 2:: Binary tract frequencies in Drosophila chromosome R2, Click here for file, Additional File 3
Table 3 :: Binary tract frequencies of yeast chromosome IV, Click here for file, Additional File 4
Table 4. Five random sequences:: 1 Mb each., Click here for file, Additional File 5
Table 5. :: Frequencies of S tracts in seven sequenced chromosomes. Click here for file, Additional File 6
Table 6. :: Masked regions in contig 3.45 of human chromosome 21., Click here for file, Additional File 7
Table 7. All R.Y tracts longer than 10 nt:: beyond position 22,944,530 of contig 23., Click here for file, Additional File 8
Table 8.:: Contig 28 in windows of 2 Mb – f/e ratio GE 13., Click here for file

References

Tamm C, Shapiro HS, Lipshitz R, Chargaff E: Distribution density of nucleotides within a deoxyribonucleic acid chain. J Biol Chem. 1953, 203: 673-688.
CAS PubMed Google Scholar
Spencer JH, Chargaff E: Studies on nucleotide arrangement in deoxyribonucleic acids, VI. Pyrimidine clusters: Frequency and distribution in several species of the AT type. Bioch Bioph Acta. 1963, 68: 9-27. 10.1016/0006-3002(63)90105-7.
Article CAS Google Scholar
Petersen GB: The distribution of nucleotides in deoxyribonucleic acid. Biochem J. 1963, 87: 495-500.
Article PubMed Central CAS PubMed Google Scholar
Spencer JH, Chargaff E: Pyrimidine nucleotide sequences in deoxyribonucleic acids. Bioch Bioph Acta. 1961, 51: 209-211. 10.1016/0006-3002(61)91045-9.
Article CAS Google Scholar
Shapiro HS, Chargaff E: Studies on nucleotide arrangement in deoxyribonucleic acids VI. Direct estimation of pyrimidine nucleotide runs. Bioch Bioph Acta. 1963, 76: 1-8. 10.1016/0006-3002(63)90002-7.
Article CAS Google Scholar
Szybalski W, Kubinski H, Sheldrick P: Pyrimidine clusters on the transcribing strands of DNA and their possible role in the initiation of RNA synthesis. Cold Spring Harbor Symp Quant Biol. 1966, 31: 123-127.
Article CAS PubMed Google Scholar
Birnboim HC, Sederoff RR, Paterson MC: Distribution of R.Y segments in DNA from various organisms. Europ J Biochem. 1979, 98: 301-307.
Article CAS PubMed Google Scholar
Case ST, Baker R: Detection of long eukaryote-specific pyrimidine runs in repetitive DNA sequences and their relation to single-stranded regions in DNA isolated from sea urchin embryos. J Mol Biol. 1975, 98: 69-92.
Article CAS PubMed Google Scholar
Bucher P, Yagil G: Occurrence of oligopurine.oligopyrimidine tracts in eukaryotic and prokaryotic genes. DNA Sequence. 1991, 1: 27-43.
Google Scholar
Behe MJ: An overabundance of long oligopurine tracts occurs in the genome of simple and complex eukaryotes. Nucleic Acids Res. 1995, 23: 689-695.
Article PubMed Central CAS PubMed Google Scholar
Karlin S, Ghandour G: The use of multiple alphabets in kappa-gene immunoglobulin DNA sequence comparisons. EMBO J. 1985, 4: 1217-1223.
PubMed Central CAS PubMed Google Scholar
Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins. Current Biol. 1999, 9: R661-R667. 10.1016/S0960-9822(99)80418-7.
Article CAS Google Scholar
Yagil G: The frequency of two-base tracts in eukaryotic genomes. J Mol Evol. 1993, 37: 123-130.
Article CAS PubMed Google Scholar
Yagil G: The frequency of oligopurine.oligopyrimidine and of other two-base tracts in yeast chromosome III. Yeast. 1994, 10: 603-611.
Article CAS PubMed Google Scholar
Shomer B, Yagil G: Long W tracts are over-represented in the E. coli and H. Influenzae genomes. Nucl Acids Res. 1999, 27: 4491-4480. 10.1093/nar/27.22.4491.
Article PubMed Central CAS PubMed Google Scholar
Raghavan S, Haniharan R, Brahmachari SK: Polypurine.polypyrimidine sequences in complete bacterial genomes: Preference for polypurines in protein-coding regions. Gene. 2000, 242: 275-283. 10.1016/S0378-1119(99)00505-3.
Article CAS PubMed Google Scholar
Yagil G, Shimron F, Tal M: DNA unwinding in the CYC1 and DED1 yeast promoters. Gene. 1998, 225: 152-163. 10.1016/S0378-1119(98)00525-3.
Article Google Scholar
Hattori M, Fujiyama A, Taylor TD, Watanabe H, Yada T, Park HS, Toyoda A, Ishii K, Totoki Y, Choi DK, et al: The DNA sequence of human chromosome 21. Nature. 2000, 405: 311-319. 10.1038/35012518.
Article CAS PubMed Google Scholar
Dunham I, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, Ainscough R, Almeida JP, Babbage A, et al: The DNA sequence of human chromosome 22. Nature. 1999, 402: 489-495. 10.1038/990031.
Article CAS PubMed Google Scholar
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al: The genome sequence of Drosophila melanogaster. Science. 2000, 287: 2185-2195. 10.1126/science.287.5461.2185.
Article PubMed Google Scholar
C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science. 1998, 282: 2012-2018. 10.1126/science.282.5396.2012.
Lin XY, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fuji CY, Mason T, Bowman CL, Barnstead M, et al: Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature. 1999, 402: 761-768. 10.1038/45471.
Article CAS PubMed Google Scholar
Jacq C, and 138 colleagues: The nucleotide sequence of S. cerevisiae chromosome IV. Nature. 1997, 75-78. Suppl 387
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al: Complete genome sequence of the methanogenic archeon, Methanococcus jannaschii. Science. 1996, 273: 1058-1073.
Article CAS PubMed Google Scholar
Davuluuri RV, Grosse IV, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nature Genetics. 2001, 29: 412-417. 10.1038/ng780.
Article Google Scholar
Gal M, Katz T, Ovadia A, Yagil G: TRACTS: A program to map oligopurine.oligopyrimidine and other binary DNA tracts. Nucl Acids Res. 2003, 31: 3679-3681. 10.1093/nar/gkg625.
Article Google Scholar
Zubiaga AM, Belasco JG, Greenberger ME: The nonamer UUAUUUAUU is the key AU-rich sequence motif that mediates mRNA degradation. Mol Cell Biol. 1995, 15: 2219-2230.
Article PubMed Central CAS PubMed Google Scholar
Chargaff E: Essays in Nucleic Acids Amsterdam Elsevier. 1963, 126ff-146ff.
Google Scholar
Shapiro HS, Rudner R, Miura K-I, Chargaff E: Inferences from the distribution of pyrimidine isostichs in deoxynucleic acids. Nature. 1965, 205: 1068-70.
Article CAS PubMed Google Scholar
Yagil G: Binary DNA tracts can serve as DNA unwinding centers. J Biomolecular Struct Dyn. 2001, 18: 911-911.
Google Scholar
Gentles AJ, Karlin S: Genome-scale compositional comparisons in eukaryotes. Genome Res. 2001, 11: 540-546. 10.1101/gr.163101.
Article PubMed Central CAS PubMed Google Scholar
Hess ST, Blake JD, Blake RD: Wide variations in neighbor-dependent substitution rates. J Mol Biol. 1994, 236: 1022-1033. 10.1016/0022-2836(94)90009-4.
Article CAS PubMed Google Scholar
Eigen M, Osswatich-Winkler R: Transfer-RNA, an early gene?. Naturwissenschaften. 1981, 68: 282-292.
Article CAS PubMed Google Scholar
Larsen A, Weintraub H: An altered DNA conformation detected by S1 nuclease occurs at specific regions in active chick globin chromatin. Cell. 1982, 29: 609-616. 10.1016/0092-8674(82)90177-5.
Article CAS PubMed Google Scholar
Yagil G: Paranemic structures of DNA and their role in DNA unwinding. Crit Revs Biochem Mol Biol. 1991, 26: 475-559.
Article CAS Google Scholar
Kowalski D, Eddy MJ: The DNA unwinding element: A novel, cis acting component that facilitates the opening of the E. Coli replication origin. EMBO J. 1989, 8: 4335-4339.
PubMed Central CAS PubMed Google Scholar
Bramhill D, Kornberg A: A model for initiation at origins of DNA replication. Cell. 1988, 54: 915-917. 10.1016/0092-8674(88)90102-X.
Article CAS PubMed Google Scholar
Umek RM, Eddy MJ, Kowalski D: DNA sequences required for unwinding prokaryotic and eukaryotic replication origins. Cancer Cells. 1988, 6: 473-478.
CAS Google Scholar
Borowiec JA, Hurwitz J: Localized melting and structural changes in the SV40 origin of replication induced by T-antigen. EMBO J. 1988, 7: 3149-3158.
PubMed Central CAS PubMed Google Scholar
Tal M, Shimron F, Yagil G: Unwound regions in yeast centromere DNA. J Mol Biol. 1994, 243: 179-189. 10.1006/jmbi.1994.1645.
Article CAS PubMed Google Scholar
Kohwi Y, Kohwi-Shigematsu T: Structural polymorphism of homopurine-homopyrimidine sequences at neutral pH. J Mol Biol. 1993, 231: 1090-1101. 10.1006/jmbi.1993.1354.
Article CAS PubMed Google Scholar
Palecek E, Robert-Nicoud M, Jovin T: Local opening of the DNA double helix in eukaryotic cells, detected by osmium probe and adduct-specific immunofluorescence. J Cell Sci. 1993, 104: 653-661.
CAS PubMed Google Scholar
Palecek E: Local supercoil-stabilized DNA structures. Crit Rev Biochem Mol Biol. 1991, 26: 151-226.
Article CAS PubMed Google Scholar
Benham C, Kohwi-Shigematsu T, Bode J: Stress-induced duplex DNA destabilization in scaffold/matrix attachment regions. J Mol Biol. 1997, 274: 181-196. 10.1006/jmbi.1997.1385.
Article CAS PubMed Google Scholar
SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotides DNA nearest-neighbor thermodynamics. Proc Natl Acad Soc (USA). 1998, 95: 1460-1465. 10.1073/pnas.95.4.1460.
Article CAS Google Scholar
Tautz D, Trick M, Dover GA: Cryptic simplicity in DNA is a major source of genetic variation. Nature. 1986, 322: 652-656.
Article CAS PubMed Google Scholar
Elson D, Chargaff E: Evidence for common regularities in the comparison of pentose nucleic acids. Bioch Bioph Acta. 1955, 17: 362-376. 10.1016/0006-3002(55)90384-X.
Article CAS Google Scholar
Rudner R, Karkas JD, Chargaff E: Separation of B. subtilis DNA into complementary strands III. Direct analysis. Proc Natl Acad Soc (USA). 1968, 63: 921-922.
Article Google Scholar
Baisnee P-F, Hampson S, Baldi P: Why are complementary DNA strands symmetric?. Bioinformatics. 2002, 18: 1021-1033. 10.1093/bioinformatics/18.8.1021.
Article CAS PubMed Google Scholar
Lobry JR: Properties of a general model of DNA evolution under no-strand-bias conditions. J Mol Evol. 1995, 40: 326-330.
Article CAS PubMed Google Scholar
Sueoka N: Intrastrand parity rules of DNA base composition and usage biases of of synonymous codons. J Mol Evol. 1995, 40: 318-325.
Article CAS PubMed Google Scholar
Bell SJ, Forsdyke DR: Deviations from Chargaff's second parity rule correlate with direction of transcription. J Theor Biol. 1999, 197: 63-76. 10.1006/jtbi.1998.0858.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Dedicated to Erwin Chargaff (1905–2002), a pioneer. I am indebted to Dr. Jaime Prilusky for many helpful advices and to Dr. Shifra Ben-Dor for thoughtful comments to the manuscript. The help of many other members of the Biological Computing Unit of this Institute is gratefully acknowledged.

Author information

Authors and Affiliations

Dept. of Molecular Cell Biology, The Weizmann Institute of Biology, Rehovot, Israel, 76100
Gad Yagil

Authors

Gad Yagil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gad Yagil.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yagil, G. The over-representation of binary DNA tracts in seven sequenced chromosomes . BMC Genomics 5, 19 (2004). https://doi.org/10.1186/1471-2164-5-19

Download citation

Received: 18 October 2003
Accepted: 03 March 2004
Published: 03 March 2004
DOI: https://doi.org/10.1186/1471-2164-5-19

The over-representation of binary DNA tracts in seven sequenced chromosomes

Abstract

Background

Results

Conclusions

Background

Results

R.Y tracts in chromosome 22

R.Y tracts in seven chromosomes

K.M tracts in seven chromosomes

W and S tracts in the seven chromosomes

Distribution in genic subregions

Interspersed elements

Discussion

Conclusions

Methods

Expected binary tract frequencies

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us