Pervasive properties of the genomic signature

Background The dinucleotide relative abundance profile can be regarded as a genomic signature because, despite diversity between species, it varies little between 50 kilobase or longer windows on a given genome. Both the causes and the functional significance of this phenomenon could be illuminated by determining if it persists on smaller scales. The profile is computed from the base step "odds ratios" that compare dinucleotide frequencies to those expected under the assumption of stochastic equilibrium (thorough shuffling). Analysis is carried out on 22 sequences, representing 19 species and comprised of about 53 million bases all together, to assess stability of the signature in windows ranging in size from 50 kilobases down to 125 bases. Results Dinucleotide relative abundance distance from the global signature is computed locally for all non-overlapping windows on each sequence. These distances are log-normally distributed with nearly constant variance and with means that tend to zero slower than reciprocal square root of window size. The mean distance within genomes is larger for protist, plant, and human chromosomes, and smaller for archaea, bacteria, and yeast, for any window size. Conclusions The imprint of the global signature is locally pervasive on all scales considered in the sequences (either genomes or chromosomes) that were scanned.


Background
Compositional heterogeneity pervades the genome on all scales [1] and attempts to partition DNA sequences into homogeneous segments give different results depending on the method chosen, even for genomes as small as the 49 kilobases (kb) of bacteriophage lambda [2]. It therefore seems remarkable that dinucleotide relative abundance profiles exhibit local stability in the sense that, when computed for any 50 kb window on a given microbial genome, the profile is about the same as if computed globally from the bulk genomic DNA of the organism [3,4]. The dinucleotide relative abundance profile, previously called the "general design [5]," can be viewed as a "genomic signature" that reflects a "total net response to selective pressure [6]." Yet the 50 kb window spans the complete genome of bacteriophage lambda and hence it would not be surprising to discover local instability in the profile seen through smaller windows.
Bacterial genomes commonly exhibit an approximate balance between purine (A+G) and pyrimidine (C+T) fractions, and between amino (A+C) and keto (G+T) fractions, when the whole leading strand is examined. Locally, however, these fractions fluctuate from their global averages almost as dramatically as the strong (C+G) and weak (A+T) fractions that do not show global balance. In-deed, these compositional biases, extending over hundreds of kilobases, are strongly correlated with the direction of replication in some species [7]. Perhaps the stability of the dinucleotide relative abundance profile in 50 kb and longer windows conceals highly variable behavior on a shorter scale. By analogy, the variability of the profile within 50 kb windows could have as much functional significance as invariance between such windows.
The stability of the dinucleotide relative abundance profile could result from constraints on dinucleotide stacking energy and DNA helicity, context-dependent mutation pressures, and replication and repair mechanisms [3,4,6,8]. To gauge the influence of these or other factors in contributing to total net response, it again seems pertinent to ask whether stability persists on smaller scales than 50 kb. If it is reasonable to suppose that signature variation between species is due to differences in replication machinery, operating through local DNA structures and base step conformational tendencies [9], then we may expect to find stability on the scale of the machinery. For example, the proximity of dinucleotide relative abundance profiles in two samples could reflect the similarity of enzymes [10] that engage them in their replication processes. When these samples come from the same genome, which encodes the same enzymes, a close proximity should be expected.
Replicational constraints that act on all parts of the genome could also manifest themselves in codon usage preference. The genomic signature may be associated with codon usage to the extent that it embodies these constraints. Given that the signature "pervades both coding and noncoding DNA," its uniformity throughout the genome "cannot be explained by preferential codon usage [9]." The converse proposition, that codon usage is explained by dinucleotide relative abundance, is hardly more defensible, since dinucleotide frequencies do not determine trinucleotide frequencies [11]. Yet the replicational constraints could be a common factor underlying both phenomena. If the global signature pervades an open reading frame (ORF), the trinucleotide frequencies will to some extent express the signature as modulated by the local composition. (Codon usage may exhibit more diversity within genomes by virtue of its sensitivity to composition.) This kind of association would be supported by discovery of signature stability in windows smaller than the average ORF and negated by the breakdown of signature stability below some window size that far exceeds the typical ORF length, unless protein coding sequences are less sensitive to local replicational constraints than dinucleotide relative abundance [12].
Genomes are identifiable by their signatures and dissimilarity between signatures is used to estimate the evolu-tionary distance between species [4,6]. If the imprint of the global signature is locally pervasive, down to the scale of the single gene or coding sequence, large deviations on that scale could highlight segments introduced by recent horizontal transfer from another species [13]. So-called filtration methods, based on dissimilarity measures computed from dinucleotide counts, have been employed for the alignment-free computation of evolutionary distances between homologous sequences [14,15]. The "transition matrix method" was a similar technique involving raw counts of amino acid pairs in protein primary sequences [16]. Phylogeny construction based on dinucleotide relative abundance distance ("delta-distance") would seem to have a similar intent although its application has been to whole genomes [9]. Mathematically, however, the use of relative abundance instead of raw abundance (frequencies or counts) is a subtle innovation that compensates for compositional variation. This subtle change has a profound effect and is essential to achieving logical results since whole genome comparisons based on raw abundance often fail to find a close proximity between closely related species.
Can the same improvement be achieved by dinucleotide relative abundance distance calculations in smaller windows? If so, a delta-scan of one bacterial genome could detect outliers bearing the imprint of a host or foreign signature. Whether this is practical depends on how the variability of the intra-genomic delta-distance grows with shrinking window size. Purely statistical considerations will limit the detectability of small scale deviations. From the investigations of Karlin et al [3,4,6,8,9,12,17] it is clear that delta-distances between 50 kb and larger windows show fluctuations that are small compared to distances between closely related species.

Hypothesis formulation
The assessment of bias in dinucleotide relative abundance begins with the "odds ratios" r xy = f xy /f x f y where f x denotes the (normalized) frequency of nucleotide (base) x and f xy is the frequency of dinucleotide (base step) xy in the leading strand. These ratios compare observed dinucleotide frequencies to those expected from the base composition alone under the assumption of statistical independence (i.e., thorough shuffling of the sequence). When r xy is greater (less) than one, xy is over(under)-represented. The symmetrized version r* xy is computed from frequencies of the sequence concatenated with its inverted complement. The numbers {r* xy } comprise the dinucleotide relative abundance profile [3][4][5][6].
The statistical problem is to test the hypotheses that patterns of dinucleotide over-and under-representation in a given genome are invariant. Using symbol f for frequencies in windows, let g be used to represent the global fre-quencies computed for a complete sequence (a genome or chromosome). The hypothesis asserts that r* xy in any window is approximately equal to a constant. This constant is the global signature c* xy = g* xy /g* x g* y .
The local stability of r xy would be an ancillary result under the assumption of a stationary stochastic process, since then the frequencies converge in probability to fixed limits, f x → p x and f xy /f x → p xy as n → ∞. The simplest case would be a homogeneous Markov chain with base step transition probabilities p xy and stationary base composition p x . In this case the differences f x -p x and f xy /f x -p xy tend to zero as 1/√n and it is clear that r xy -p xy /p y will vanish at the same rate [18]. Moreover, the globally computed quotient c xy = g xy /g x g y would be a consistent estimate of p xy /p y .
Thus the separate terms of Σ |r xy -c xy | tend to zero as 1/√n and the same must obviously apply to δ*.
We start however with the understanding that the sequence is fundamentally nonstationary, exhibiting statistically significant variations in base frequencies between non-overlapping windows [1,2]. Locally or globally estimated Markov models may describe it better than assuming that the bases are independent and identically distributed [19] but they fail to reflect the salient features of natural sequences [20]. For example, Robin and Daudin [21] compared the frequency of a specific motif in the genome sequence of Haemophilus influenzae to Markovian predictions and found that the observed frequencies were everywhere higher than predicted.
This point aside, the stationary Markov analogy provides a useful benchmark in assessing local stability of the genomic signature. If nucleotide sequences behaved like Markov chains, then δ*√n would not depend on n. We will examine this scaled delta-distance δ*√n to see if it exhibits any trend. A decreasing ("super-Markov") trend would imply that signature stability emerges as the scale increases and could indicate the breakdown of stability for some window size below 50 kb.  This survey examines 22 sequences from 19 species and  17 genera. The sequences are listed in Table 1 along with serial numbers (SN) and 4-letter abbreviations (Abbr). Most of them have been previously studied and found to show stability of the genomic signature in 50 kb windows [9]. The shortest is the 580 kb complete genome of Mycoplasma genitalium. The longest complete sequence is the 7657 kb human chromosome XXII.

Scope of the investigation
The selected sequences, which are not always complete genomes, fall into two main classes, being typically (1) the chromosome that constitutes the largest single element in a prokaryotic genome or (2) one of the chromosomes in a eukaryotic genome. In the first case, for example, is the Borrelia burgdorferi sequence that excludes 21 identified plasmids. The second case is exemplified by Plasmodium falciparum where chromosomes II and III are selected to the exclusion of the other 12. Since our investigation focuses on scaling properties of the genomic signature, it is appropriate to consider sequences comprised of many 50 kb contigs, assuming that variation between such contigs is small compared to variation between species.
The present sample, which spans a wide range of G+C proportion, is hoped diverse enough that any consistent trends and features in the statistical picture it produces cannot be easily attributed to chance. A broader range of sequences, including mitochondrial and large viral genomes, has been surveyed by Karlin et al [3,4,6,8,9,12,17] using 50 kb and larger windows. This investigation, which applies similar methodology to smaller window sizes, concerns the intra-genomic homogeneity of dinucleotide relative abundance, and inter-sequence distance calculations are beyond its scope.
Our use of nonoverlapping windows is consistent with the methodology employed in prior studies using 50 kb and longer windows. (Overlapping windows with a high percentage overlap would be required to localize and sort out the significance of nonconforming segments that could possibly reflect a foreign signature.) Overlap would introduce statistical dependence between successive windows and such dependence could only complicate the analysis of variance within sequences. Window size was varied by factors of approximately two, the specific values being 50 kb, 25 kb, 10 kb, 5 kb, 2 kb, 1 kb, 500 b, 250 b, and 125 b.

Results and discussion
Increasing trend in mean scaled delta-distance Table 2 shows mean scaled delta-distance by sequence (Abbr) and window size (n). Scaled delta-distance, defined above as δ*√n, is a statistical invariant for any benchmark sequence generated by a Markov chain exhib-iting the same signature as the given sequence. Except for the Borrelia burgdorferi sequence, the mean scaled deltadistance is increasing in window size for every sequence. The lone exception is a drop, for B. burgdorferi, from 3.580 to 3.553, as window size increases from 10 kb to 25 kb.
The essentially increasing trend in δ*√n has an obvious implication for the scalability of standard binning levels used in classifying intra-genomic delta-distance [8]. These levels cannot be re-scaled by the reciprocal square root of window size without admitting that the profiles seen through smaller windows are statistically closer to the global signature. The profiles seen through larger windows obviously tend toward the signature but local fluctuations tend to zero slower than 1/√n (i.e., the convergence rate is "sub-Markov").

Quasi-stable hierarchy of mean scaled delta-distances
The rows of Table 2 are grouped as archaea (Aful, Mjan), protist (Pfa2, Pfa3), yeast (Sc11, Sc15), plant (Ath4), human (Hs22), or bacteria (all 14 others). Rows are averaged in groups and the results are plotted in Figure 1. Granted that the curves for archaea and bacteria are almost indistinguishable, we see that ranking mean scaled delta distances by groups produces about the same result for every window size. The one clear exception is the crossing of the curves for plant (Arabidopsis thaliana, chromosome IV) and human (chromosome XXII) between n = 2 kb and n = 5 kb. Thus the view through 50 kb windows shows that the imprint of the global signature is weaker in the latter; but 2 kb and smaller windows show relatively clearer imprints in the human chromosome.

Normality and homoscedasticity of log delta
The increasing trend in all the curves of Figure 1 suggests that local deviations from the global signature are better described by a multiplicative error process (instead of additive). Therefore we examine the (natural) logarithms of the delta-distances in windows and find that they are close to normally distributed with nearly constant variance over the range of window size. Mean values of -log(δ*) are shown in Table 3 with corresponding standard deviations in Table 4. Tables 3 and 4 defines a normal distribution with mean a variance determined by a sample of observed delta-distances. Each sample is subjected to the Kolmogorov-Smirnov test of fit to the corresponding normal distribution [22]. (We generate a random sample of the same size as the observed sample from a normal distribution with the same parameters. The Kolmogorov-Smirnov two-sample test is then invoked. The null hypothesis is that the observed and simulated samples share a common underlying distribution.) P-values    from the test, which measure support for the null hypothesis of normality, are shown in Table 5. The bottom row of Table 5 counts (by window size) the number of times the test accepts normality of log(δ*) at the 5% significance level.

Each pair of corresponding cells in
Delta-distances tend to fit the log-normal distribution better for window sizes 1 kb and larger. Rejection of log-normality for the smallest window sizes is to some extent a consequence of increasing sample size. The apparent goodness-of-fit actually gets better as window size decreases as illustrated in Figure 2 for three indicated window sizes on the Haemophilus influenzae sequence. (The sample CDF approaches a smooth curve as window size declines and the number of windows increases. The heavy plotted points are from log-normal distribution functions with means and variances of the corresponding samples.)

Sub-Markov convergence rate of mean log delta
Since δ*√n is essentially increasing in n, we infer that the locally computed dinucleotide relative abundance converges to the global signature more slowly than the reciprocal square root of window size. The log-normality of the intragenomic delta distance suggests a log-linear model of the form log(δ*) = αβlog(n) for intercept (α) and slope (β) coefficients that can be estimated for each sequence (or group) by simple linear regression. Table 3, columns 10 and 11, give intercept and slope coefficients of least squares fits to the scatter plots of mean log(δ*) versus log(n). Column 12 gives the mean absolute residual (MAR) to indicate the closeness of the straight line fit. All estimated slope coefficients are greater than the benchmark value -1/2 implied by the Markovian analogy. Figure  3 shows least squares fits to log-log scatter plots obtained by grouping rows of Table 3 (in the same as way described above for going from Table 2 to Figure 1).

Weakly consistent patterns of intragenomic variability in delta-distance
Residual accumulated delta-distance (RADD) is defined as window size times the cumulative sum of terms δ*(t)mean(δ*), t = 1, 2, ..., T where t is the position index (counted in windows from the 5' end) and T is the total number of windows. Here mean(δ*) is just the average of unscaled delta-distances in windows of size n. Since δ*(1) +...+ δ*(T) = T mean(δ*), a plot of RADD versus t always returns to zero. The RADD plot will superficially resemble

Figure 1
Mean scaled delta-distance versus window size for six subsets of the sequences. The mean intra-genomic deltadistance for windows of a given size on each sequence was computed and scaled up by the square root of window size to give the numbers shown in Table 1. Then the table is condensed by grouping its 22 rows into 6 subsets as defined in the text. Each curve shows the group-wise average.

Figure 2
Cumulative distribution functions (CDFs) of log delta-distance computed from windows of three sizes on theHaemophilus influenzae genome. The CDF gives the fraction of windows in which log delta-distance did not exceed the abscissa. The empirical CDFs are drawn with connected line segments and they increase in smoothness with the total number of windows. The discrete points (squares) trace the CDFs of corresponding normal distributions having the same mean and variance as the three sets of observations. random walks obtained by integrating counts of purine minus pyrimidine bases. Such "walking plots" are useful in depicting long range compositional biases concomitant to replication [7] and their self-similarity with respect to re-scaling has been said to instantiate the "fractal geometry of nature [23,24]." Here we use the RADD plot to examine how well deltadistances in smaller windows are smoothed by corresponding larger windows. For example, take a single 50 kb window on the M. genitalium sequence, compute its deltadistance from the global signature, then divide it into 50 contiguous 1 kb windows and repeat the calculation for each piece. Do the 50 delta-distances exhibit random fluctuations about the level indicated by the single larger window? If so, the patterns would be judged consistent, and that result would typify a stationary process. Such consistency is only weakly evidenced In Figure 4, however, as larger scales give rise to emergent trends. The same effect is seen in Figure 5 where the 125 b and 1 kb traces track each other closely but diverge from the other traces in the central region.

Conclusions
The imprint of the global signature is locally pervasive on all scales considered in the sequences that were scanned. No lower bound can yet be placed on the local scale on

Figure 3
Mean log delta-distance versus window size for six subsets of the sequences. The lines show simple linear regressions of the mean log delta-distance on log window size. The plotted points were obtained by grouping rows of Table 3 as described in the text.  which the global signature is reflected. The inter-genomic hierarchy of mean intra-genomic delta-distances is essentially preserved across the range of window sizes. Intra-genomic delta-distance is approximately log-normally distributed (in windows down to 1 kb) and the variance of log-delta is fairly uniform across the set of sequences. Delta-distance tends to zero with increasing window size but the rate of this convergence is significantly slower than for simple random processes.

Methods
The sequences listed in Table 1 were downloaded from the (National Center for Biotechnology Information) GenBank [25] in FASTA format and saved as plain text files. The last two columns of Table 1 provide GenBank accession numbers and approximate dates of the revisions used in this analysis. The sequences, as text files, were processed by routines written in SPlus, Version 4.5. The texts were read in blocks (contigs) of 50 kb (when computing delta-distances in 50 kb and 25 kb windows) or 20 kb (for smaller windows). Counts of overlapping base steps in windows were computed with the last base of one window serving as the first base of the next. Thus the number of base steps in a window is equal to the number of bases (not one less). However, one base step was uncounted at the start of every block. Incomplete windows in the last block were discarded and blocks past end-of-file were ignored. Sometimes complete blocks near end-of-file were omitted. With the exception of Arabidopsis thaliana, chromosome IV, the total sequence lengths in kilobases are listed in the fourth column (kb) of Table 1. All corresponding sample lengths are at least 96% of total length, and most are close to 99%, when texts were read in 20 kb blocks. For A. thaliana, however, the global signature was computed from a scan of the 99% complete sequence; but local delta-distances from the global signature were computed only for the first 50% of the complete sequence due to computing difficulties. The 8750 kb length for this sequence in Table 1 is therefore about half the total bases in the chromosome.