Strand bias structure in mouse DNA gives a glimpse of how chromatin structure affects gene expression
© Evans. 2008
Received: 16 August 2007
Accepted: 14 January 2008
Published: 14 January 2008
Skip to main content
© Evans. 2008
Received: 16 August 2007
Accepted: 14 January 2008
Published: 14 January 2008
On a single strand of genomic DNA the number of As is usually about equal to the number of Ts (and similarly for Gs and Cs), but deviations have been noted for transcribed regions and origins of replication.
The mouse genome is shown to have a segmented structure defined by strand bias. Transcription is known to cause a strand bias and numerous analyses are presented to show that the strand bias in question is not caused by transcription. However, these strand bias segments influence the position of genes and their unspliced length. The position of genes within the strand bias structure affects the probability that a gene is switched on and its expression level. Transcription has a highly directional flow within this structure and the peak volume of transcription is around 20 kb from the A-rich/T-rich segment boundary on the T-rich side, directed away from the boundary. The A-rich/T-rich boundaries are SATB1 binding regions, whereas the T-rich/A-rich boundary regions are not.
The direct cause of the strand bias structure may be DNA replication. The strand bias segments represent a further biological feature, the chromatin structure, which in turn influences the ease of transcription.
Because of the Watson-Crick structure of DNA - A paired with T and C with G - the number of As must equal the number of Ts when the bases on both strands are counted. Although this equality does not have to be true for a single strand, Chargaff's second law refers to the equality of A/T and C/G bases on a single strand  and broadly speaking eukaryote genomes are free of intrastrand bias .
Early work on strand bias analysed prokaryote and viral genomes where strand biases have been observed and associated with origins of replication: the leading strand is found to be G-rich and T-rich, with the G-C bias often being found to be more consistent than the A-T bias [3–6].
Strand bias has been discovered at transcription start sites in plants and fungi , animals [8, 9], and splice sites . Strand bias has been found for long regions of DNA around actual and putative origins of replication . An analysis of nearby divergent genes concluded that both replication and transcription effects were important for strand bias in a range of eukaryotes , a result confirmed by an analysis of the bias in large vertebrate genes . Strand bias for transcribed regions has been ascribed to transcription coupled repair , but some categories of SNPs do not follow the pattern . There is a weak (~0.3) correlation between expression of human genes and strand bias . In human genes, the strand bias has been shown to be confined to non-coding regions and accentuated at boundary regions . By reversing the argument, strand bias can be used to find transcribed regions : this method predicts many more transcribed regions.
This paper has some similarities with a very recent paper by Huvet et al.  which also finds that the number of genes and their expression is more abundant near origins of replication (identified by strand bias markers), with transcription and replication usually being in the same direction. However, their domains are much larger than our segments. They use data on replication timing to support the interpretation of DNA replication: this paper uses an H-rule analysis to support the interpretation of chromatin organisation. Their mathematical model is to look for "N-shaped" skew patterns: ours is to look for segments of alternating bias. Hence they do not find an equivalent to the T-rich/A-rich boundary.
The present work has its origins in a number of peculiarities in the data. Firstly, the strand bias around the transcription start site is highly variable; secondly, an average bias can be seen in the data for hundreds of thousands of bases upstream and downstream of the start site; thirdly, in a large random piece of DNA, say 500 kb bases, (whether or not from a transcribed region), there is a large negative correlation between (A-T) and (G-C) . These results can occur because there are long-range correlations between bases . These peculiarities have led to the hypothesis that the genome is composed of strand bias segments and this paper demonstrates this. Given that there is an effect whereby transcription causes strand bias, there is a burden of proof to discharge to show that there is also a line of causality from strand bias to the placement of genes and their expression: we therefore give numerous arguments to make this point.
Although this paper emphasises that the strand bias discussed here is not caused by transcription, the main result is that there is a strand bias structure to the genome and this structure affects the placement of genes and the probability of their expression. This suggests that the strand bias structure also reflects some aspect of the chromatin structure (which in turn makes some positions advantageous for transcription): direct evidence for this is presented.
The text gives results for mouse and A:T boundaries. Similar results have been obtained for human but these have not been shown. The C:G results mirror the A:T results in nearly all respects but this has not been fully explored.
A) Number of segments
segments defined by algorithm
B) Median absolute AT-bias
segments defined by algorithm
segments of random position
C) Median absolute AT-skew
segments defined by algorithm
segments of random position
Segment lengths in the autosomal chromosomes are similar to each other with median segment lengths ranging from 62137 (chromosome 7) to 79967 (chromosome 11): the segments on the sex chromosomes are comparatively short, with median lengths 56283 for the X and 65702 for the Y. There is a relationship between AT-percentage of a segment and its length: the correlation between AT-percentage and log of the length is -0.22. Dividing segments into two according to the median AT-percentage (60%) gives a median length of 56375 for the AT-rich half and 83216 for the AT-poor half. Each segment will be A+ on one strand and T+ on the other. Because of this symmetry our later results are not a consequence of the distribution of length of segments.
Similar analyses can be made where genes are known to be on one side of the boundary and not the other and with a given direction. These analyses all give the results that the strand bias profile on both sides of the boundary is similar to the original average shown in Figure 2a, and that when an average over all segments is made, transcription gives a only small modification of the pattern - details not shown.
Another way of analysing the bias coming from transcription is to remove the bias from everywhere except for the genes. A "hybrid genome" has been constructed by taking one of the shuffled genomes from the previous section and then copying back over this genome the actual sequences of the coding genes from TSS (Transcription Start Site) to TES (Transcription End Site) including both introns and exons in their real positions. Because of the length of introns about a third of the real genome is preserved in the hybrid genome. We then ask if the results are consistent with the hypothesis that all the strand bias in these regions comes from transcription and there is no strand bias outside these regions. The algorithm finds about a third of the segments for this genome as for the real one, 8229 as against 23482: the algorithm is searching for strand bias on a much larger scale than transcription generates. There is a difference in the profile of strand bias between the real and hybrid genomes which is proved by a statistical test - see Figure 3b and key for details.
Strand bias switch has been found within long vertebrate genes : this gives direct proof that transcription is not the only cause of strand bias.
The lines of argument given above prove that the segment bias is not caused by transcription. It is therefore not a circular question to ask how transcription fits into the structure defined by the segment bias. The next series of analyses discuss this question.
These results show a very strong bias, but few genes run from one kind of boundary to the other. The pattern is more pronounced for genes with CpG islands and for long genes (details not shown). In this context, the length of the gene is the number of bases from the TSS to the TES, that is the length of the raw unspliced mRNA.
Number of genes - comparisons with hybrid genome - upstream versus downstream
Number of TSSs upstream
Number of TSSs downstream
iv = ii + iii
v = iii/iv
A - B
Number of genes - comparisons with hybrid genome - third quarter comparison
Number of TSSs Quarters 1,2 4
Number of TSSs Quarter 3
iv = ii + iii
v = iii/iv
A - B
Expression levels on different segments
level of expressed gene
The proportion of DNA that is transcribed "with the flow" of the strand bias has been calculated as follows. As a gene may cross several segment boundaries, the number of bases on the T+ strand and the number on the A+ strand were counted for each gene. The number of bases was then totalled by strand. The result is that the number of transcribed bases on the T+ strand is 77% of all transcribed bases. If the number of bases is weighted by the average expression level of the gene then the proportion rises to 82%. If transcription was the cause of the bias one would expect a value close to 100%.
Touchon et al.  is one of a number of papers (compare ) to report an average strand bias when sequences are aligned by the transcription start site or end site: for example their AT-skew measure, (A-T)/(A+T), jumps to about 5% at the TSS. The main argument that these are transcription caused biases was the comparison with the near absence of average bias in the upstream region. When allowance is made for the different measures of the bias the result is similar to Figure 4. However, the same figure shows that the strand bias discussed here is different in kind from the transcription associated strand bias.
In many cases one can get a better predictor of the strand bias of individual genes, merely by using the knowledge of the position of the gene with respect to the segment boundaries defined here. For each base take the nearest A+/T+ or T+/A+ boundary and associate with this base the average AT-bias for that position using the line from Figure 2a or 2b. The predictor is the average of these scores over the length of the gene. Far away from the boundary, this predictor is not useful and the analysis excludes those genes (about a quarter of the whole) which extend beyond 100 k bases from either type of boundary. The correlation between the ACGT-skew and this predictor is 0.24 (n = 14195). If one considers only genes longer than 10 kb the correlation is 0.43 on 9040 genes - see Figure 16b - which is a better correlation than Majewski's result on 374 genes. If the bias is measured by the average AT-bias over the whole length of the gene from TSS to TES, the correlation with the predictor is 0.31 (n = 18479) and 0.48 (n = 9040) for long genes. A predictor based on the red and green lines of Figure 6 and its T+/A+ equivalent performs slightly better than the one described.
The direct cause of the strand bias observed in this paper is not known but an appealing theory is that the strand bias comes from the mechanism of DNA replication and the A+/T+ boundaries are origins of replication. There are several reasons to think this may be so:- strand asymmetries of this type have been observed at origins of replication in bacterial and viral genomes i.e. the leading strand is G+ and often T+ [3–6], and as these references explain there is an asymmetry between the strands in the mutation/repair processes which gives a physical explanation of the strand bias. This process can be expected to affect almost all of the genome. Touchon et al  examined the region 100 kb either side of a number of human origins of replication and found this effect in six out of nine examples. Although this statistic is inconclusive, the same research goup has developed the argument in [19, 29] and the theory remains attractive.
The finding that 82% of transcription is with the flow of the strand bias adds weight to this suggestion. In almost all prokaryotes studied there is a bias in that the direction of transcription is the same as that of replication . A possible reason for this is to avoid a molecular collision between the replication and transcription machinery. A simple gene count does not suggest a very strong bias, e.g. 55%:45% for E. coli , but the bias is stronger when the volume of expression is taken into account. However, influences such as essentiality  or transcription interruption  are involved for E. Coli as well as expression levels so that less than 100% of gene expression "with the flow" is plausible when considering the relationship between replication and transcription for mouse. Experimental work  has shown that transcription against the flow of the replication machinery is associated with replication fork pause and with chromosome recombination which would be generally detrimental to the organism. I am grateful to Sascha Ott for this line of argument (personal communication, 2006).
Estimates for the size of replicons (the region of DNA controlled by one origin of replication) fall into two groups: those agreeing with the traditional view that replicons are comparatively small: around 50 kb to 300 kb , around 100 kb for animals , common sizes between 50 kb and 100 kb : and those quoting a larger size: mammalian average up to 500 kb , 1 Mb–2 Mb [11, 29], mean around 1.2 Mb . In our analyses, a replicon extends from one T+/A+ boundary to the next and has a median size 160 kb and a mean size 220 kb. The modal value is around 100 kb (Figure 1b). Our analyses use the unknown parameter s = 50k: this is a plausible value (because the results are consistent and there is a symmetry in the positions of TSS and TES of genes, Figures 8 and 9). The value of s is uncertain, but these results will be upper bounds and they argue against the larger estimates in the literature.
Another model for the relationship with DNA replication is that the direct cause of strand bias is transcription, but the placement of genes and direction of transcription is controlled by the need to keep transcription and replication in the same direction. This model has been proposed  for prokaryotes (which have only one or very few origins of replication) and where the genome is much more compact, transcription is less associated with a single gene, and all processes are in the germ-line. In the present context, this model is ruled out by the difference between the bias at segment boundaries and TSSs (Figure 4).
We have shown the mouse genome has a strand bias structure consisting of segments of alternating bias. These segments are much larger than coding genes. These segments influence the placement of genes, their length, the probability that a gene is expressed, and the size of the expression level. These effects are not caused by transcription even though transcription itself causes a strand bias effect. Although the direct cause of the bias may be DNA replication, the strand bias in question represents a further biological structure, such as the spatial organisation of the chromatin. The H-rule analysis gives direct evidence for this proposal.
A region may be mostly T+ but contain an A+ sub-region. This region might be defined to be one T+ segment or one A+ segment and two T+ segments. In order to choose between these possibilities, we use a parameter, s bases, called the characteristic scale, to show the size of the feature of interest. At an A+/T+ segment boundary, there should be more As than Ts in the window upstream of the boundary and more Ts than As in the window downstream of the boundary. The simplest operational definition would be the position where the sum of these counts is a local maximum, but this definition would depend on the exact distribution of bases around the far edge of the window as much as at the near edge. Exponentially moving averages have been used to soften the effect of the window boundary at the far edge from the candidate segment boundary. To prevent the size of the bias being an artefact of the AT% of the region, the average bias is defined as the weighted bias divided by the weighted count of the number of A and T bases. The absolute value average bias is required to be greater than a threshold value in both upstream and downstream windows, thus allowing an element of statistical significance to be included. The condition that T+/A+ and A+/T+ boundaries alternate has been imposed by removing all but the most extreme of consecutive boundaries of the same type. We have experimented with other ways of selecting the boundaries and obtained results similar in kind, but the adopted procedure has the advantage of not imposing a hard limit on the segment size.
The following equations give a precise description of the method. The exponential weighting factor w is defined by 1 - w = 2/(s + 1). With this value of w, a window size s contributes 85% of the sum of the weights in an infinite window. However, to minimise a small artefact coming from the finite size of the window, a larger sized window, N, has been chosen as N = 2s. This means that any segment boundary must be at least N bases from each end of the chromosome. For each base i in a chromosome, where N ≤ i ≤ G - N and G is the chromosome length, we calculate a window score, S L [i], for the window extending N bases to the left, and likewise for the window extending N bases to the right S R [i + 1]. This window score is defined by the following steps:
Let j be any position in the chromosome, then define m[j] and c[j] as:
m[j] = 1, if the base at position j is A; m[j] = -1, if the base is T; and m[j] = 0 for other possibilities; and
c[j] = 1, if the base at position j is A or T; and c[j] = 0 for other possibilities.
The window score in each window is then defined as the average bias:
S L [i] = B L [i]/C L [i] and S R [i + 1] = B R [i + 1]/C R [i + 1] (3)
where r = 2. The value of r gives a measure of statistical control.
Candidate A+/T+ boundaries are then chosen as those positions i where
S L [i] > Z L [i] and SR[i + 1] < -ZR[i + 1] (5)
and candidate T+/A+ boundaries as those positions i where
S L [i] < -Z L [i] and SR[i + 1] > ZR[i + 1] (6)
For these positions we define a measure:
D[i] = S L [i] - S R [i + 1] (7)
As a convenience in the computations, if any candidate positions of the same type are within 100 bases of each other we immediately chose the one with the more extreme value of D[i]. The A+/T+ and T+/A+ candidate positions are then ordered by position. For each group of consecutive A+/T+ boundaries the one with the greatest (most positive) value of D[i] is selected and for each group of consecutive T+/A+ boundaries the one with the least (most negative) value of D[i] is chosen. The resulting boundary positions define the strand bias segments.
We are interested in large scale effects. The following values of the parameters have been used for the results presented in this paper: s = 50k bases, w = 2/(s + 1) and window size N = 100k bases. A wide range of values have been analysed and have been found to give similar results. As the scale is increased, the algorithm picks out fewer but more extreme examples of segments which are longer and show greater bias.
Although the expression level of a gene is affected by a large number of variables (age of the organism, the position within the organism, phase of the cell cycle, environmental stress, etc.) and is highly variable, it is useful to consider average expression levels. Three variables have been used: a) the probability of expression, (number of experiments in which a gene is expressed divided by number of experiments), b) the average expression level if it is expressed (sum of the gene's expression levels over all experiments divided by number of experiments in which it is expressed), and c) its average expression level (sum of gene's expression levels divided by total number of experiments): for an individual gene a × b = c. These have been estimated from the data deposited with GEO : a microarray chip was chosen (i.e. a GEO platform with a GPL number) and every corresponding GSM file (i.e. set of results) was used which had rows for all probes and columns for probe-id, expression-level and either present-absent-call or detection-probability. The column present-absent-call was used if available, otherwise the detection-probability was converted to a call using a threshold of 0.04. The expression level for each chip has been recalibrated by setting the expression level for absent probes to zero, and normalising the total expression level of the present probes on the chip to unity. The platform was chosen to be an Affymetrix chip and the probes have been associated with an ENSEMBL gene using the match with the probe-id given by ENSEMBL . Where several probes have been matched to a gene, the average value for the probes has been used: where one probe has been matched to several genes, the call for the probe has been given to each gene but the expression level for the probe has been shared amongst the genes.
The data for the chromosomal sequence, the list of genes and their TSSs and TESs has been taken from ENSEMBL, which means that for each gene the transcribed unit has been taken to be the union of all alternative transcripts. The analysis includes all protein coding genes but excludes mitochondrial genes.
The mouse analysis is based on sequence assembly NCBIM36 and GEO platform GPL339, where 1744 GSM files had sufficient data to be used. This platform has 22690 probe-sets. Information on mouse genes was taken from ENSEMBL 45.
((A-T)+(C-G))/(A+C+G+T). TSS = Transcription Start Site
Transcription End Site. The A+ strand is the strand with more As than Ts, and T+ strand is defined similarly. A DNA segment may be called the A+ segment or T+ segment, if it is clear which strand is being referred to.
I am grateful to Sascha Ott of Warwick University and Annika Hansen formerly of University College London for useful discussions, to Birkbeck College for the use of its facilities as an honorary Research Associate and to the Wellcome Trust for payment of the publication fee. I thank the referees for their comments and in particular for their advice on the kind of analysis that would be convincing.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.