### Definition of strand bias segments

A region may be mostly T+ but contain an A+ sub-region. This region might be defined to be one T+ segment or one A+ segment and two T+ segments. In order to choose between these possibilities, we use a parameter, *s* bases, called the characteristic scale, to show the size of the feature of interest. At an A+/T+ segment boundary, there should be more *A*s than *T*s in the window upstream of the boundary and more *T*s than *A*s in the window downstream of the boundary. The simplest operational definition would be the position where the sum of these counts is a local maximum, but this definition would depend on the exact distribution of bases around the far edge of the window as much as at the near edge. Exponentially moving averages have been used to soften the effect of the window boundary at the far edge from the candidate segment boundary. To prevent the size of the bias being an artefact of the AT% of the region, the average bias is defined as the weighted bias divided by the weighted count of the number of A and T bases. The absolute value average bias is required to be greater than a threshold value in both upstream and downstream windows, thus allowing an element of statistical significance to be included. The condition that T+/A+ and A+/T+ boundaries alternate has been imposed by removing all but the most extreme of consecutive boundaries of the same type. We have experimented with other ways of selecting the boundaries and obtained results similar in kind, but the adopted procedure has the advantage of not imposing a hard limit on the segment size.

The following equations give a precise description of the method. The exponential weighting factor *w* is defined by 1 - *w* = 2/(*s* + 1). With this value of *w*, a window size *s* contributes 85% of the sum of the weights in an infinite window. However, to minimise a small artefact coming from the finite size of the window, a larger sized window, *N*, has been chosen as *N* = 2*s*. This means that any segment boundary must be at least *N* bases from each end of the chromosome. For each base *i* in a chromosome, where *N* ≤ *i* ≤ *G* - *N* and *G* is the chromosome length, we calculate a window score, *S*
_{
L
}[*i*], for the window extending *N* bases to the left, and likewise for the window extending *N* bases to the right *S*
_{
R
}[*i* + 1]. This window score is defined by the following steps:

Let *j* be any position in the chromosome, then define *m*[*j*] and *c*[*j*] as:

*m*[*j*] = 1, if the base at position *j* is A; *m*[*j*] = -1, if the base is T; and *m*[*j*] = 0 for other possibilities; and

*c*[*j*] = 1, if the base at position *j* is A or T; and *c*[*j*] = 0 for other possibilities.

Variables will be defined in pairs with suffix

*L* refering to the left hand window and suffix

*R* to the right hand window. The weighted bias in each window is defined to be:

and the weighted count of the number of A and T bases is defined to be:

The window score in each window is then defined as the average bias:

*S*_{
L
}[*i*] = *B*_{
L
}[*i*]/*C*_{
L
}[*i*] and *S*_{
R
}[*i* + 1] = *B*_{
R
}[*i* + 1]/*C*_{
R
}[*i* + 1] (3)

A threshold for each window is defined by:

where *r* = 2. The value of r gives a measure of statistical control.

Candidate A+/T+ boundaries are then chosen as those positions *i* where

*S*_{
L
}[*i*] > *Z*_{
L
}[*i*] and S_{R}[i + 1] < -Z_{R}[i + 1] (5)

and candidate T+/A+ boundaries as those positions *i* where

*S*_{
L
}[*i*] < -*Z*_{
L
}[*i*] and S_{R}[i + 1] > Z_{R}[i + 1] (6)

For these positions we define a measure:

*D*[*i*] = *S*_{
L
}[*i*] - *S*_{
R
}[*i* + 1] (7)

As a convenience in the computations, if any candidate positions of the same type are within 100 bases of each other we immediately chose the one with the more extreme value of D[i]. The A+/T+ and T+/A+ candidate positions are then ordered by position. For each group of consecutive A+/T+ boundaries the one with the greatest (most positive) value of D[i] is selected and for each group of consecutive T+/A+ boundaries the one with the least (most negative) value of D[i] is chosen. The resulting boundary positions define the strand bias segments.

We are interested in large scale effects. The following values of the parameters have been used for the results presented in this paper: *s* = 50*k* bases, *w* = 2/(*s* + 1) and window size *N* = 100*k* bases. A wide range of values have been analysed and have been found to give similar results. As the scale is increased, the algorithm picks out fewer but more extreme examples of segments which are longer and show greater bias.

### Data sources

Although the expression level of a gene is affected by a large number of variables (age of the organism, the position within the organism, phase of the cell cycle, environmental stress, etc.) and is highly variable, it is useful to consider average expression levels. Three variables have been used: a) the probability of expression, (number of experiments in which a gene is expressed divided by number of experiments), b) the average expression level if it is expressed (sum of the gene's expression levels over all experiments divided by number of experiments in which it is expressed), and c) its average expression level (sum of gene's expression levels divided by total number of experiments): for an individual gene *a* × *b* = *c*. These have been estimated from the data deposited with GEO [47]: a microarray chip was chosen (i.e. a GEO platform with a GPL number) and every corresponding GSM file (i.e. set of results) was used which had rows for all probes and columns for probe-id, expression-level and either present-absent-call or detection-probability. The column present-absent-call was used if available, otherwise the detection-probability was converted to a call using a threshold of 0.04. The expression level for each chip has been recalibrated by setting the expression level for absent probes to zero, and normalising the total expression level of the present probes on the chip to unity. The platform was chosen to be an Affymetrix chip and the probes have been associated with an ENSEMBL gene using the match with the probe-id given by ENSEMBL [48]. Where several probes have been matched to a gene, the average value for the probes has been used: where one probe has been matched to several genes, the call for the probe has been given to each gene but the expression level for the probe has been shared amongst the genes.

The data for the chromosomal sequence, the list of genes and their TSSs and TESs has been taken from ENSEMBL, which means that for each gene the transcribed unit has been taken to be the union of all alternative transcripts. The analysis includes all protein coding genes but excludes mitochondrial genes.

The mouse analysis is based on sequence assembly NCBIM36 and GEO platform GPL339, where 1744 GSM files had sufficient data to be used. This platform has 22690 probe-sets. Information on mouse genes was taken from ENSEMBL 45.