Scoring scheme for gene regulatory relationships
The gene expression data of m genes with n samples is represented as an m × n matrix, where rows represent genes and columns represent various samples such as experimental conditions or time points in a biological process. Each element of the matrix represents the expression level of a particular gene in a particular sample. Two genes with similar expression patterns tend to be co-expressed at different time points. Figure 1 shows an example of the gene expression data for yeast genes during the yeast cell cycle, obtained from the Yeast Cell Cycle Analysis Project [6].
The gene expression matrix is analyzed for similarity between gene expressions at different time points. Three metrics are often used to measure the similarity of genes: Pearson correlation coefficient [7], Euclidean distance [8] and Spearman correlation [9]. To evaluate the regulatory relation between two genes, we modified the Pearson correlation coefficient. R 1(X, Y, i, p) in Equation 1 represents the correlation between gene X at time point i and gene Y at time point i + p. p is the time span of the gene regulation.
In Equation 1, N is the total number of time points contained in the time span, X
k
and Y
k
are the expression levels of genes X and Y at time k, and
and
are the average gene expression levels at all time points of the time span. The R1 score is in the range of [-1, 1]. Among the total i × p candidate regulations, the regulation with the maximum absolute value of R 1(X, Y, i, p) is selected as the regulatory relation between genes X and Y. If the expression level of gene X increases before that of Y increases, X is a candidate activator of gene Y; if the expression level of gene X increases before that of Y decreases, X is a candidate inhibitor of Y.
The modified Pearson correlation coefficient R 1 is useful for finding gene regulatory relations with a significant change in expression levels. But, it cannot distinguish gene regulatory relations with the same correlation but different gene expression levels (see Figure 2 for an example). Therefore, we use an additional score R 2, which is the Euclidean distance of the expression levels of the two genes. The score R 2 for the gene regulatory relation between X and Y is computed by Equation 2.
where
In Equation 2,
and
are the average gene expression levels at all time points in the time span. X
max
is the maximum gene expression value of gene X.
and
are the maximum and minimum gene expression value of gene Y, respectively.
Both R 1 and R 2 scores are intended to represent a relation between a regulator gene X and its potential receiver gene Y. The regulatory relation between X and Y is not symmetric, so R 1(X, Y) ≠ R 1(Y, X) and R 2(X,Y) ≠ R 2(Y,X). An interesting observation from the actual data is that two genes with R 2 score < 3 tend to have an inductive relation, whereas those with R 2 scores > 6 tend to have an inhibitory relation. For example, in the dataset of 30 yeast genes, 89.2% of the activations have R 2 scores < 3, and 91.4% of the inhibitions have R 2 scores > 6. In an extended dataset of 6,177 yeast genes, 80.4% of the activations have R 2 scores < 3, and 92.1% of the inhibitions have R 2 scores > 6. Hence, we consider a gene regulation with R 2 score < 3 as an activation, and that with score > 6 as an inhibition.
Inferring gene regulatory relationships
In the microarray data for gene expression, we use the log-ratio (in base 2) of the red and green intensities. Thus, genes with mRNA abundance have positive log-ratios whereas those with absence of mRNA have negative log-ratios. From the gene expression profiles, we identify the gene regulations and include them in the regulation list. In the regulation list, +A(t) indicates that gene A is up-regulated at time t, and -A(t) indicates that gene A is down-regulated at time t. The symbol '→' represents a directional relationship between genes. There are four possible gene regulatory relations between two genes:
1. +A(t1) → +B(t2): up-regulation of A at time t1 is followed by up-regulation of B at time t2 (t2 >t1).
2. -A(t1) → +B(t2): down-regulation of A at time t1 is followed by up-regulation of B at time t2 (t2 >t1).
3. +A(t1) -> -B(t2): up-regulation of A at time t1 is followed by down-regulation of B at time t2 (t2 >t1).
4. -A(t1) → -B(t2): down-regulation of A at time t1 is followed by down-regulation of B at time t2 (t2 >t1).
The regulatory relation of gene A with gene B is determined by the sign of the R 1 score of the genes. A relation with a positive R 1 score implies that gene A activates gene B whereas a regulation with a negative R 1 score implies that A inhibits B. The R 1 score of each gene regulation is iteratively calculated using Algorithm 1. For genes A and B, the regulation with the largest absolute R1 score is chosen for the regulation between the genes and represented as R 1(A, B, t1, p). Algorithm 1 provides the top-level description of the method for inferring gene regulations and constructing a list with the regulations.
Algorithm 1
Construct a regulation list
1: Compute R 1(A, B, t1, p) between gene A at time point t1 and gene B at time point t1 + p for all pairs of genes.
2: Select the regulation with the largest absolute value of R 1(A, B, t1, p).
3: If 0 <p < 6, classify the regulation into one of the four types, +A(t1) → + B(t1+p), - A(t1 ) → + B(t1+p), +A(t1) → -B(t1+p), -A(t1) → - B(t1+p), and add it to the regulation list.
4: If p = 0, two genes are co-expressed or co-inhibited, and such gene regulation is not added to the regulation list.
5: If the new gene regulation is already in the regulation list, merge it with the previous regulation.
6: Go to step 2 to find the next gene regulation until no more regulation found.
After we construct a regulation list, we compute the R 2 score for the gene pairs in the regulation list. Some local maximum or minimum values are ignored when computing the R 2 score in a long time span. For example, all the three time spans shown in Figure 3A include at least 10 time points (18 time points in alpha-factor, 24 in cdc15, and 10 in cdc28). Expression levels of gene CLB1 has several wave peaks in the time span of CDC15, but only the maximum value in sub-timespan 2 of CDC15 will be selected when computing the R 2 score in the time span of CDC15. Time spans are divided into smaller sub-timespans as follows, and the R 2 score is not computed for sub-timespans with less than 6 time points.
1. A time point with the minimum expression level of the regulator gene becomes a splitting point of the time span.
2. Each sub-timespan starts with at least 3 consecutive time points that have a positive slope of a curve representing gene expression levels, and ends with at least 3 consecutive time points with a negative slope.
3. Each sub-timespan encompasses at least 6 time points, including the start and ending time points.
For example, the time span CDC 15 of Figure 3A is the longest one in the gene regulatory relation +CLB1(T)-> +SWI5(T+1), and split into 3 sub-timespans. The first sub-timespan of CDC15 has less than 6 time points, so the R 2 score is not computed for the first sub-timespan. The R 2 scores for the second and third sub-timespans are 0.24 and 0.38, respectively. Figure 3B is another example of a gene regulatory relation +CLB6(T) -> -CLB1(T+1). The time span CDC15 is split into 4 sub-timespans, and the third sub-timespan has only 3 time points. So, the R 2 score is computed for the remaining three sub-timespans, which are 27.02, 20.34, and 9.87, respectively.
Visualization of gene regulatory networks
All gene regulations identified in the previous step are visualized as a 2-dimensional gene regulatory network, in which a node represents a gene. Edge types and edge labels of the network represent gene regulatory relations. Arrows represent inductive interactions (relations+A(t1) -> +B(t2) and-A(t1) -> +B(t2)) and blocked arrows represent inhibitory interactions (relations +A(t1) -> -B(t2) and -A(t1) -> -B(t2)). The regulator gene, type of regulation (+ for induction and - for inhibition), and time delay of the regulation are annotated as edge labels. Each edge is labeled with R/s/T to indicate a regulator gene R, sign s of the log-ratio of the expression level of R, and the time delay T of the regulation.
For visualization of gene regulatory networks, two layout algorithms have been developed: grid layout and layered layout. As described in Algorithm 2, the grid layout algorithm positions all nodes at grid points. The node with the highest degree will be placed at the center grid point (node S in Figure 4A). Then, we position all nodes connected to the center node at adjacent grid points. Nodes with a higher degree are positioned earlier than those with a lower degree in the east, north, west, south, northeast, southwest, and southeast grid point of the current node (nodes 1-8 of Figure 4). Other nodes connected to the positioned nodes are placed in the same manner.
The layered layout algorithm, described in Algorithm 3, puts all nodes to horizontal layers. The node with the maximum degree is assigned to the top layer, and the nodes connected to the node are put in the next layer. If a layer has two nodes connected to each other, it makes a new layer above the layer and moves the node with a smaller degree to the new layer. The layered layout usually takes more time than the grid layout.
Algorithm 2
Grid layout
1: Find the node with the highest degree, and place it at the center grid point.
2: If there is a tie, select a node with a higher out-degree. Position all nodes connected to the center node in the adjacent grids. Nodes with a higher degree are positioned earlier than those with a lower degree in the east, north, west, south, northeast, northwest, southwest, and south east grid of the center node. If more than 8 nodes are connected to the center node, put the 9th node to the east of the 1st node, the 10th node to the north of the 2nd node, and so on.
3: Repeat step 2 for nodes connected to nodes that are already positioned. If an adjacent grid is occupied, move to the next possible position until it is available.
4: If there are disconnected nodes, repeat step 1 for the nodes and put them to the right of the previous subgraph.
Algorithm 3
Layered layout
1: Put the node with the maximum degree at layer 1. If there is a tie, select a node with a higher out-degree.
2: Assign the nodes connected the nodes at layer i to layer i+1.
3: Repeat steps 1 and 2 for the remaining nodes.
4: If two nodes at the same layer are connected to each other, make a new layer between the layer and the upper layer and move the node with a smaller degree to the new layer. Nodes with 0 out-degrees are also moved to the new layer.
5: Order the nodes in each layer by the Barycenter method [10].
6: If there are disconnected nodes, repeat steps 1-5 for the nodes and put them to the right of the previous subgraph.