### Investigation of the relationship between regulatory sequence motifs and expression profiles

Kernel CCA (Canonical correlation analysis) is a version of the nonlinear CCA, where the kernel trick is utilized to find nonlinearly correlated features from two datasets [15–17]. CCA is a classical multivariate statistical method for finding linearly correlated features from a pair of datasets [36]. Suppose there is a pair of multivariates **x** and **y**, CCA finds a pair of linear transformations such that the correlation coefficient between extracted features is maximized. However, if there is a nonlinear relationship between the variates, CCA does not always extract useful features.

Kernel CCA offers a solution for overcoming the linearity by first projecting the data into a higher dimensional feature space. While CCA is limited to linear features, kernel CCA can capture nonlinear relationships. Kernel CCA has been used for several applications including text retrieval and biological data analysis [15, 37].

Figure

1 illustrates the basic scheme of the kernel CCA for our integrated analysis of DNA sequence motif and gene expression data. Using kernel CCA, we tried to find maximally correlated features between the gene expression and the sequence motifs. Here, a gene set

**X** is represented by two separate profiles in terms of its transcriptional behaviour and upstream sequences,

**x**
_{
exp
} and

**x**
_{
seq
}. These are composed of the expression profile,

**x**
_{
exp
} = (

*e*
_{1},

*e*
_{2}, ...,

*e*
_{
N
}) and the sequence profile,

**x**
_{
seq
} = (

*m*
_{1},

*m*
_{2}, ...,

*m*
_{
M
}) of each gene. Here e

_{
i
} (1 ≤

*i* ≤

*N*) is the expression value of the gene in the

*i*-th sample or experimental condition from the microarray dataset, and m

_{
j
} (1 ≤

*j* ≤

*M*) denotes the occurrence frequency of the

*j*-th sequence motif in the upstream region of the gene. For the detection of correlated features between the two datasets,

**x**
_{
exp
} and

**x**
_{
seq
} are first mapped to Hilbert space,

*H*, by function

*φ*. That is, each

**x** is projected into two directions,

*f*
_{
exp
} and

*f*
_{
seq
}, in Hilbert space according to its representation:

where ⟨•,•⟩ denotes the dot product. Kernel CCA looks for maximally correlated features between

**x**
_{
exp
} and

**x**
_{
seq
}:

where

*λ*
_{
exp
} and

*λ*
_{
seq
} are regularization parameters, var(•) means a variance and cov(•,•) is a covariance between two variables. The kernel CCA can be given by solving a generalized eigenvalue problem:

where

**I** denotes the identity matrix,

**K**
_{
exp
} is the kernel matrix for expression profiles, and

**K**
_{
seq
} is the kernel matrix for sequence motifs. When given

*α*
_{
exp
} and

*α*
_{
seq
} as the solution of the above generalized eigenvalue problem with the largest eigenvalue, canonical correlation scores (CC scores) for

**x**
_{
seq
} and

**x**
_{
exp
} are estimated by

*u*
_{
seq
} =

**K**
_{
seq
}
*α*
_{
seq
} and

*u*
_{
exp
} =

**K**
_{
exp
}
*α*
_{
exp
}, respectively. The CC scores are based on the low dimensional-mapping of genes in terms of two separated representations and can be used to show the salient correlation between the two. Once we obtain the

*α* vector, the weights of the motif and expression profile,

**W**
_{
seq
} and

**W**
_{
exp
}, are obtained as following:

A high weight value of the specific sequence motif means that the motif is strongly correlated with the expression patterns of genes whose upstream region includes the motif and whose CC scores are high. If a weight of a specific motif has a high absolute value, the motif is more likely to play a regulatory role in the specific biological process. The kernel CCA was implemented using Matlab.

### Preparation of the gene sequence datasets

The sequence data was used in two ways. In the first case, we used the sequences of a total of 42 known motifs (Table 1) extracted by Pilpel [9]. We then scanned the upstream regions of ORFs for the presence of these motifs using the AlignACE program [3]. The sequence profile was represented by the occurrence of these motifs in the promoters of each gene in the genome.

In the second case, we analyzed the relationship between the expression profiles and the raw upstream sequences. We extracted ~1 kb upstream sequences of each gene. From these sequences, we calculated the frequency of all possible *l*-mers in each gene. For *l* = 5, each gene had 1,024 (= 4^{5}) different base combinations. The sequence profile was encoded in the frequency of *l*-mers.

We applied the kernel as
to the sequence data. When *d* = 1, it is the linear kernel, and when *d* > 1, it is the polynomial kernel.