Bacterial genome sequences and equipment for wavelet analysis
Bacterial genome sequences analyzed in this study were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov; Table 1). If we produced the curves for the four indices (GC skew, AT skew, keto excess and purine excess; see below) from the downloaded sequences directly, most of the diagrams would be of the "V" or reverse "V" shape, making the positions for ori and ter obscure (e.g., Bacillus halodurans C-125, Bacillus subtilis, Borrelia burgdorferi, etc). The particular shape depends on the point chosen to initiate the cumulative summation (i.e. the point corresponding to i = 1 in the formulae for the various indices). If this point happens to be too close to either ori or ter, the curve will give an ambiguous estimate of the location of that site. In such cases, we simply chose to start the summation from a different base in the sequence, so that the peak and valley (corresponding to the ori and ter regions) can be clearly identified. The particular starting base for the analysis in each bacterial chromosome is available from the authors on request.
We used wavelet transform methods [27, 28]. in the analysis of these bacterial genomes to detect the genomic features that might be associated with the locations of ori and ter, including AT and GC skews [16, 20]. and purine and keto excesses [24]. Methods based on wavelet transforms generally require powerful visualization tools. In the implementation, we analyzed the genomes for these indices using C++ codes, performed wavelet transformations via Matlab, and made graphics with the Xmgrace Graphic software on MACI-cluster parallel computers.
AT and GC skews
Cumulative AT skew (ATS) was defined as the sum of (A-T)/(A+T) in adjacent windows and was determined by
and, similarly, cumulative GC skew (GCS) was defined as the sum of (G-C)/(G+C) in adjacent windows and was determined by
where n is the window size and N is the chromosome length.
Purine and keto excesses
Purine excess was defined as the sum of all purines (AG) minus the sum of all pyrimidines (TC) encountered in a walk along the sequence up to the point plotted and was determined by
and, similarly, keto excess was defined as the sum of all keto bases (GT) minus that of the amino bases (AC) and was determined by
where N is chromosome length, and B is the number of the particular base (A, C, G or T) occurring at the i th location.
Wavelet transform methods
Wavelet analysis has become a common tool for documenting localized variations of power within a time series, with successful applications in signal and image processing, numerical analysis and statistics. The basic procedure is to adopt a prototype function, called an analyzing wavelet or mother wavelet, and represent the signal using scaled and shifted versions of this function. Because the original function can be represented in terms of a wavelet expansion, data manipulations can be performed using corresponding wavelet coefficients. The wavelet transform is especially useful in detecting singularities in the presence of noise by examining the maxima in the modulus of the wavelet transform. In particular, we sought the abscissa where the maxima converge at fine scales. These maxima indicate positions of high curvature in a smoothed version of the signal and thus will indicate the presence of corners. At coarse scales, noise is unimportant and maxima are easy to identify, although their locations are not precise (the smoothing has "blurred" the signal). At fine scales, the smoothing is less strong, and the locations are more precise. On the other hand, at finer scales the signal-to-noise ratio becomes more dominant, so the maxima are harder to identify. As a result, the technique of following the lines of maxima from coarse to fine scales allows us to retain the advantages of both coarse- and fine-scale analysis.
We employed the continuous real wavelet transform [27] and our analyzing wavelet is the normalized first derivative of a Gaussian function:
where σ is a scaling factor. The real wavelet transform of a function f is
In order to apply this transform to a vector x of length N, x is taken to correspond to samples at the points t0 = 0, t1 = 1/ N, t = 2/N,...,t
N
= 1 - 1/N of a 1-periodic function x(t). The wavelet transform Wx, for a range of scales s, is then calculated in the Fourier domain using the Fast Fourier Transform. The result is a two-dimensional array of values of Wx at positions t and scale s.
Models
Two distinct models were employed: one for the AT and GC skews and one for the purine- and keto-excesses. For AT or GC skew, the signals were modelled with two binomial distributions in each strand. To take the AT skew as an example, the first provides the probability of obtaining a count of k A's and T's in a window of length n as
and the second gives the probability of j of those being A's (and so k-j being T's) as
For purine or keto excess, the raw signals were analyzed, given by the moving differences of the cumulative sum. This would result in a sequence of 1's and -1's, modelled as being drawn from a simple binomial distribution in each strand, with a probability p
i
of drawing a 1 and 1-p
i
of drawing a - 1.
Significance levels and the null hypothesis
In order to determine the significance levels for our data, i.e., the identification of the joins between the two replicores (halves of the chromosome between ori and ter), we used the null hypothesis that the signal was generated according to the above models, but with p
i
= 0.5 and q
i
= 0.5 in each replicore, so that the signals should be homogenous. Artificial signals were created using this hypothesis, with their wavelet transforms calculated. At each scale, this resulted in a sequence of wavelet coefficients, and the distribution of the values in this sequence was used to determine significance levels for the wavelet coefficients computed from the original signal.
Confidence intervals
In order to understand and calibrate the performance of the wavelet estimator, Monte-Carlo simulation was employed. For the keto and purine excess signals, rough estimates of the probabilities p
i
and the locations of the joins between the two replicores were obtained by examining the cumulative sums and locating the maximum and minimum values. For the AT and GC skews, the value of q
i
was assumed to be 0.5, and the probabilities p
i
were estimated as for the previous signals. Then 1000 simulated signals were created with the same properties, according to the models described above, and the wavelet estimator was used to recover the locations of the joins between the two replicores. The distribution of the resulting estimates is then used to deduce confidence intervals for the results from the wavelet estimator applied to the original signal. However, many of the signals were too long to make the creation and estimation of 1000 simulated signals and 1000 runs a realistic task. In this case, the simulations were repeated using shorter signals, with a succession of different lengths, and the results were extrapolated to the length of the original signal.
Using this method, we analyzed 40 bacterial chromosomes and located the putative positions for ori and ter by AT skew, GC skew, keto excess and purine excess. We then tested their statistical significance at 5% level. To increase the accuracy and comparative power, we adopted a 100 bp small moving window size in the GC and AT skew analysis. Finally, we used wavelet transformation to analyze all indices, to find significance regions and confidence intervals at 95% level and to test the significant peak at 5% level.