The source data were the Arabidopsis and rice complete genome sequences in XML format downloaded from **TAIR**http://www.arabidopsis.org/ and **International Rice Genome Sequencing Project**http://rgp.dna.affrc.go.jp/IRGSP/, respectively. The genomic sequences for each chromosome were extracted from the XML files and stored in FASTA format. Sequences were filtered by masking any characters not present in the set S = {A, C, G, T, a, c, g, t}.

In order to compare plants with other organisms, sequence data from *Mus musculus* and *Saccharomyces cereviciae* was obtained from the **NCBI FTP site**ftp://ftp.ncbi.nih.gov/genomes/. Centromere positions were collected from the **Saccharomyces Genome Database**http://www.yeastgenome.org/ and the **UCSC Genome Browser**http://genome.ucsc.edu/.

### Genome-scale curvature calculation

The computation of the distribution of curvature of DNA sequences was performed using the CURVATURE program [5]. This program calculates the three-dimensional path of DNA molecules and estimates the segment curvature by computing the radius of the arc approximating to the path of the axis of the DNA fragment. The dinucleotide wedge angles of Bolshoy [4] and the twist angles of Kabsch [33] were used for all calculations. Whole chromosome sequences were used as input and maps of the curvature distribution using a window size of 125 bp along the whole sequence were produced. The DNA curvature was measured in DNA curvature units (cu) introduced by Trifonov and Ulanovsky [34] and used in all of the analyses. The scale of these "curvature units" ranges from 0 (e.g. no curvature) to 1.0, which corresponds to the curvature of DNA when wrapped around the nucleosome. For example, a segment of 125 bp of length with a shape close to a half-circle has a curvature value of about 0.34 cu. Such strongly curved regions with values of >0.3 cu appear infrequently in genomic sequences. Since each chromosome presents a specific curvature distribution, its average and standard deviation (SD) values can be used to define thresholds and identify significant features. In this study a curvature signal was identified as significant if a maximal curvature value was at least 3 SD above the genomic average. The output spatial mapping file consists of two columns; the first column enumerates bases (corresponding to length of the chromosome in base pairs) and in the second column, a floating-point number less than one (<1), represents the curvature value at each base pair along the chromosome. Since this "map" file, is too large to be plotted directly (the map file of chromosome 1 of rice, for example, has about 43 million curvature values), low perturbations were removed and high perturbations were emphasized. In order to attain this, we used a method that summarizes the curvature signal as described by the following algorithm.

### Signal Processing

Our method considers a sliding window on a given signal that covers only part of the signal and each window contains a signal fragment with some high and low perturbations. In each window, we determine extreme points by a simple analysis in *O(n)* time complexity. When each point has a bigger or lower value than both its predecessor and successor points, it is called a maximal or minimal point and collected as an apex value. Thereafter in each window, two base lines for positive and negative apex values are defined such that via these base lines we construct two new coordinates for the signal's peak values. These new coordinates are suitable for exaggerating low and high perturbations. To describe this method, we focused first on positive values; if the positive peaks' values are members of the set S_{p}={P_{1}, P_{2}, ..., P_{n}}, the mean value (M_{p}) of the set can show the base line of positive apexes. By using M_{p}, a new set of positive apexes can be reached by subtracting M_{p}; thus giving a new_S_{p}={ P_{1}-M_{p}, P_{2}-M_{p}, ..., P_{n}-M_{p}}. Here the application of an exponential function {e^{x}| x is member of new_S_{p}} will emphasis high apex values and reduce low apex values. This process of changing coordinates is a type of kernel function, as used on statistical machine learning approaches (such as support vector machines). Through this change, the system's low perturbations, which have negative values in our exponential function, will be projected into small values whereas high perturbations that have positive values will be mapped to exponentially higher values after performing the exponential function. The process of analyzing negative apex values S_{n}= {N_{1}, N_{2}, ..., N_{m}} is similar to the positive values where the exponential function has changed to {-e^{-x}| x is member of new_S_{n}}. The details of the algorithm are presented below. Figure 8 shows the curvature signals before and after applying the algorithm.

### Algorithm for Signal Processing

*//For a given signal S with L sample points in an array S[1...L]*

**Begin**

Tentative window length = L/5

For *j*:0 to 5 do

//Determining maximal and minimal points

For *i*:*j**L/5 to (*j*+1)L/5 do

If (*S*[*i*] is a positive apex)

Add *i* to *S*_{p}

If (*S*[*i*] is a negative apex)

Add *i* to *S*_{n}

//Computing mean values

//Changing coordinates

For *i*:1 to *n* do

For *i*:1 to *m* do

//Performing exponential functions

For *i*:1 to *n* do

For *i*:1 to *m* do

**End**

### Computation of tandem repeats and CpG islands

Tandem repeats across whole chromosomes were first detected using the Tandem Repeats Finder (TRF) program version 4.0 [35]. Tandem Repeats Finder is an application for finding tandem repeats in DNA sequences, that employs a stochastic model of repeats and associated statistical detection criteria. We scanned CpG Islands in genomic sequences using the Takai and Jones algorithm [26], which is optimized for searching CpG Islands (CGI) in whole genomes. Its search criteria are GC content ≥ 55%, ObsCpG/ExpCpG≥0.65, and length ≥ 500 bp. Based on this algorithm, we used eight iterative steps to scan all the possible CGI in each genome as follows: (1) Set a window size of 125 bases at the start position of a sequence and calculate GC content (%) and ObsCpG/ExpCpG in the first window. Here, ObsCpG/ExpCpG = NCpG/(NC × NG) × N where NCpG, NC, NG, and N are, respectively, the number of dinucleotide CpGs, nucleotide Cs, nucleotide Gs, and all nucleotides (A, C, G, and T) in the sequence (i.e., 0 nucleotides). Shift the window 1 base each time until the window meets the criteria for a CGI. (2) Once a seed window (i.e., it meets the criteria) is found, move the window 150 bases forward and then evaluate the new window again. (3) Repeat step 2 until the window does not meet the criteria. (4) Shift the last window in steps of 1 base each time towards the 5' end until it meets the criteria again. (5) Evaluate the whole segment (i.e., from the start position of the seed window to the end position of the current window). If it does not meet the criteria, trim 1 base from each side until it meets the criteria. (6) Connect two individual CGI fragments if less than 100 bases separate them. (7) Repeat step 5 to evaluate the new sequence segment until it meets the criteria. (8) Reset start position immediately after the CGI identified at step 7 and go to step 1.

### Statistical analysis

The statistical significance of the features described above was calculated by measuring the average and distribution of curvature along the genome, as well as CpG and repeat numbers in non-overlapping windows along the genome. These distributions were used to calculate the SD, and significant features were selected by setting a threshold on the value corresponding to 3 SD. Features with values above this threshold were collected as significant (Table 2). The z-score, calculated by subtracting the average from the peak value, and dividing by the SD, gives a measure of the statistical distance between the observed feature and the natural average, and can be expressed as a probability.

A modified Markov-chain permutation process was used to obtain permuted chromosomes that conserve dinucleotide and trinucleotide distributions; the chromosome DNA sequence was split into all dimers (in 2 phases) and all trimers (in 3 phases), and the set of dinucleotides or trinucleotides was shuffled. The permuted chromosomes obtained in this manner were subjected to the same analysis as the natural chromosomes. No statistically significant features were identifiable in these permuted cases. In the additional file 2 "Markov-plots", figure S1 presents an overlay of curvature plots for a natural and trinucleotide-permuted chromosome.

### Integrated plotting of the curvature, repeats and CpG islands

Integrated results of the three analyses mentioned above for the 12 and 5 chromosomes of rice and Arabidopsis, respectively, were drawn in individual and combined-plots, using the freely distributed **Gnuplot** program http://www.gnuplot.info/ in individual and mixed plots based on different parameters. Perl scripts for extracting proper data from source result files and generating plots were developed in-house. Gnuplot parameters were automatically set and final plots saved in png format. The source code of all Perl scripts is freely available upon request.