Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression

Background Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions. Results We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI. Conclusion Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions.

Although TFBS can be predicted by computer searches on DNA sequences, false positives (FP) are often produced. Several computer programs use position weight matrices (PWMs) [1] to predict TFBS in silico, including MatInspector [2], MATCH [3], and TFBS perl modules [4]. PWMs represent positional base preferences or frequencies constructed by a set of experimentally determined TFBS, and typically correspond to a single TF. The transcription factor database, known as TRANSFAC, is a widely used collection of PWMs [5]. TRANSFAC provides several PWMs for single TFs with different quality levels. Computer programs predict TFBS from DNA sequences, which are the same or similar to known TFBS. The low information contents in the matrices leads to many false positives, due to the weak preference or shortness of the site length (6-30 bp). Various strategies have been proposed to allow correct identification of true positives (TPs) from predicted TFBS. One approach is to employ information from conserved regions in DNA sequences between different species, known as phylogenetic footprinting. Bayes block aligner (BBA) is a tool used to extract conserved regions from an alignment of two DNA sequences [6]. It was demonstrated that it could identify binding sites of musclespecific transcription factors [6]. Another approach is to identify multiple TFBS that form a structural cluster on a DNA sequence coordinate. This seems a reasonable technique because the density of predicted TFBS in promoter sequences is reported to be higher than non-promoter sequences, especially in the region 300 bp upstream from the transcription start site [7]. Genes are regulated by interactions with multiple functional TFs in metazoans [8]. Therefore, many promoter prediction programs, such as promoterscan [9], TSSG, and TSSW [10], have been developed based on the density of TFBS. The identification of genuine TFBS by searching clusters of predicted TFBS has been successful; however, these studies were evaluated with only specific genes and TF sets, such as those found in Yeast [11], Drosophila (early developmental enhancer) [12][13][14], liver [15], LSF and muscle specific regulatory regions [16,17]. It is unknown whether this method is applicable to other species, or genes. Although many vertebrate promoter sequences have CpG islands (CGI), the relationship between clusters of predicted TFBS and CGI is often underestimated [18]. Another strategy for the identification of putative TFBS includes a combinatorial approach that uses both phylogenetic footprinting and cluster analysis [12,15,19]. The program rVISTA utilizes information from conserved regions between human and mouse, in addition to clusters of TFBS predicted by the MATCH (BIOBASE) program [19]. This approach was evaluated using several known TFs (AP-1, NFAT, and GATA-3) and genes from the cytokine gene cluster. It remains unclear whether the properties used for clusters of TFBS are general and can be applied to other TFs or regulatory regions. Several reports have described methods for determining the statistical significance of predicted TFBS [11,12,17,[20][21][22]. These studies assume the use of appropriate PWMs to identify clustered TFBS. To determine if a particular cluster is genuinely related to the promoter, it is important to assess clusters of predicted TFBS for each individual PWM. This is done using real non-promoter sequences for the appropriate selection of the PWM and for the interpretation of clusters of predicted TFBS. Most of these studies use specific sets of coregulated genes to identify common predicted TFBS clusters, and therefore cannot be applied directly to the study of general properties of promoters.
In this study, we developed a measure that evaluates the degree of concentration of predicted TFBS to clarify whether predicted TFBS have a tendency to cluster in human promoter sequences rather than in non-promoter sequences for each PWM. We identified some PWMs in which predicted TFBS clusters occur more significantly in promoter than non-promoter sequences and vice versa. Using partial correlations among three properties (promoters, CGI and clusters of predicted TFBS), we identified two PWM groups, (1) those in which TFBS cluster in promoters as a result of the presence of CpG islands, and (2) those in which TFBS cluster in promoters independent of CpG islands. We show that transcription factors corresponding to the latter PWM group tend to be tissue-specific. In summary, this analysis is useful for the interpretation of predicted TFBS in regulatory regions.

Divergent preferences of TFBS for promoter sequences
We determined whether predicted TFBS formed clusters in human promoter sequences or in non-promoter sequences for each PWM using the cluster score described in the Method section. The higher the cluster score (derived from a logarithm of the p-value), the stronger the cluster of predicted TFBS is related to the promoter sequence. The threshold T, used to determine whether a cluster of predicted TFBS is found on a sequence, was calculated simultaneously. Since a prediction for the presence of a TFBS was performed for each PWM, an assessment for TFBS clusters was performed using the cluster score for each PWM. As a result, a number of PWMs do not tend to have clusters of TFBS in the promoter sequence. We observed a divergence of cluster scores. Of the 199 vertebrate PWMs in TRANSFAC, 94 (47%) PWMs had significantly high cluster scores, while 22 (11%) PWMs had significantly low cluster scores. The remaining 83 (42%) PWMs did not show significant cluster scores. A p-value of 1.0% was used to identify the above PWM set with Bonferroni correction for multiple testing ([23] Section 3.8). Figure 1 shows a histogram of cluster scores. Although these results were derived from genes on chromosome 20, the results from other chromosomes were similar as described in the following subsection. PWMs with high cluster scores are shown in Table 1. Some of the PWMs have thresholds T (of accumulated score C) equal to or less than 1.0. This indicates that the occurrence of single predicted TFBS is more discriminative than clusters. Sequence logo [24] of the top three PWMs are depicted in

Cluster scores for different datasets
To assess the robustness of the cluster score we compared cluster scores for different datasets from chromosomes 20, 21 and 22, respectively. Fig. 3 shows the correlation of cluster scores between chromosomes 20 and 21 (a), and between chromosomes 20 and 22 (b). Some PWMs, the matches of which were detected on less than 50 subsequences, are not shown. The correlation coefficient points were 0.91 (a) and 0.93 (b).

Correlations among promoter sequences, CpG islands, and clusters
About half of the human coding genes have a compositional bias for CGI over transcription start sites [25]. It is possible that some of the predicted TFBS clusters might be the consequence of the existence of a CGI. To investigate this possibility, we computed partial correlation coefficients of three categories of promoters, CGI and predicted TFBS clusters. In general, a partial correlation coefficient measures the correlation between any pair of variables when other, specified variables, have been held constant. For example, a partial correlation coefficient r IC.P is the correlation between I and C while controlling for P, where I denotes CGI, C denotes accumulated score (strength of predicted TFBS clusters, see Methods) and P denotes promoters. If we calculate simple correlation coefficients, r PI = 0.69, r IC ranged from -0.25 to 0.57 for various PWMs, and r PC ranged from -0.24 to 0.53 for various PWMs. These correlation coefficients are apparent ones. The partial correlation coefficients provide essential information and pure correlations, without the effect of the third variable. Fig. 4 shows a plot of r IC.P against r PC.I for various PWMs. For most of the PWMs, r PC.I is positive, although not particularly high (<0.3). This implies a correlation between clusters of these PWM matches and promoter sequences, separate to the effect of CGI. For the PWMs in the right circle in Fig. 4, r PC.I is high and r IC.P is approximately zero, where the cluster is more correlated with the promoter than the CGI. Some PWMs have a negative r PC.I , implying the absence of promoter sequences for these PWM matches. For the PWMs in the top circle in Fig. 4, r IC.P is high and r PC.I is approximately zero, suggesting that the correlation between promoters and clusters for these PWM matches is attributable to the presence of the CGI. While these promoters and clusters do not correlate directly, they appear to correlate because both are associated with CGI.
Using these two values, we identified two PWM sets, (1) a CGI-related set (37 PWMs, Table 2) in which TFBS clusters are correlated with CGI (independent of promoter), and (2) a CGI-independent set (54 PWMs, Table 3) in which clusters of TFBS are correlated with promoters (independent of CGI). These sets were used for the following analysis.

Correlation between clusters of predicted TFBS and gene expression
Since all widely expressed, or housekeeping, genes have CGI [25], it is possible that clusters of PWM matches for CGI-independent sets are associated with tissue specific promoters. For this reason we examined the relationship between clusters of PWM matches and the tissue specificity of the associated genes using published gene expression data ( [26,27]). The two resources used for this analysis are not consistent. Some genes annotated as housekeeping genes in one resource are referred to as tissue specific in another resource. We refer to these genes as mixed annotated genes. Genes with associated expression data were analysed and of these 72 were identified among the gene set covering the three chromosomes used in this study. They included 12 housekeeping genes, 9 mixed annotated genes, and 51 tissue specific genes. With the CGI-independent PWM sets we detected promoters with clusters of PWM matches. These clusters have significantly high Z-scores (see Methods) based on the accumulated score C in randomly generated DNA sequences (as control) with the same dinucleotide frequency of each promoter sequence. Table 4 shows the 40 genes detected, the DCC score (described below), their tissue specificity and start_p score. These genes are sorted according to the DCC score indicating the extent of association with CGI-independent PWMs over CGI-related PWMs. Results show that tissue specific genes tend to have high DCC scores and that transcription factors corresponding to CGI-independent PWMs are related to tissue specific genes. If we extract A histogram of cluster scores for PWMs 20 genes which have DCC scores equal to or higher than -0.03, 18 (90%) of these are tissue specific genes. The 72 genes with known gene expressions data included 51 tissue specific genes (71%). The p-value of the event of extraction was 0.04 under cumulative hypergeometric distribution. The p-value of the ranking of the two groups (11 housekeeping and 29 tissue-specific) in Table 4 was 0.01 by Wilcoxon rank test. Note that DCC is not correlated with the CGI score (start_p).

Discussion
Clusters of TFBS are an important property of regulatory regions [7,8,19,28]. To determine if this is a general tendency for PWM matches and all protein coding genes, we have developed a measure that evaluates the correlation between predicted TFBS concentrations and promoter sequences. We then examined the correlation for individual PWMs using an unbiased sequence set. Our results show that not all TFBS are clustered in promoter Sequence logos Figure 2 Sequence logos. (a) Top three PWMs from Table 1, (b) representative PWMs from Table 2, (c) representative PWMs from Table 3.
sequences. We found that TFBS clusters corresponding to 47% of PWMs are positively correlated with promoter sequences, and that TFBS clusters corresponding to around 11% of PWMs are negatively correlated with promoter sequences.
It is important to ascertain the relationship between cluster scores of PWMs and CGI, because CGI are a prominent feature of promoter sequences. The consensus sequences of the top-ranked PWMs (Table 1) are, 'ANNGACGCTNN' (WHN_B), 'TTTCSCGC' (E2F1DP1_Q6), 'NSGGGGGGGGMCN' (MAZR_01), and 'GGGGAGGG' (MAZ_Q6), where S represents C or G, M represents A or C, and N represents any bases. The sequence logos of the PWMs are depicted in Fig. 2-(a). The G+C % of base composition of each matrix is 70%, 56%, 91%, and 86%, respectively. The sequences with high cluster scores appear to be GC-rich. Larsen et al found that 57% of human genes are associated with CGI, that all housekeeping genes have CGI covering transcription start sites (TSS), and that 40% of tissue specific genes have CGI [25]. Therefore, the association of PWM-PCP with CGI may be significant, and CGI-related PWMs may play important roles in housekeeping regulation.
To evaluate the relationship between PWM-PCP and CGI, we calculated the partial correlation coefficient for each PWM. In general, if a correlation coefficient r XY is not small and r XY.Z (defined in Methods) ≈ 0, the probable hypotheses concerning cause and effect will be either 1) the correlation of X and Y is a consequence of Z, or 2) Z intervenes between X and Y. For the PWMs in the top circle in Fig. 4, r IC.P is high and r PC.I is approximately zero. This suggests that the correlation between promoters and TFBS clusters is attributable to the presence of the CGI and that while they do not directly correlate they appear to because both independently correlate with CGI. The characteristic PWMs in Table 2 Fig. 4 have an r IC.P of approximately zero and a high r PC.I r PC.I value showing that the cluster is correlated with promoters independent of CGI. The predicted TFBS clusters corresponding to these PWMs could not be explained by the presence of CGI. Some of these PWMs have thresholds T less than 1.0 indicating that even the single occurrence of a predicted TFBS is more discriminative than clusters. Particular examples with high recorded r PC.I values and values for r IC.P < 0.1 are TFIII_Q6 (M000706), MYOGNF1_01 (M00056) and CREL_01 (M00053). Sequence logos are depicted in Fig. 2-(c). TFIII_Q6 is a matrix associated with a general transcription factor II-I with the consensus sequence RGAGGKAGG, where the K represents G or T. The matrix TFIII_Q6 contains many 'G', and 'C' is allowed only the fourth position with low frequency. MYOGNF1_01 is a matrix associated with myogenin, nuclear factor 1 or related factors, and is therefore involved in the regulation of differentiation. CREL_01 is a matrix associated with the C-Rel proto-oncogene protein (C-Rel protein). An understanding of the function of these factors is important to this study. The PWM groups described above may be involved in tissue-specific gene regulation. If all housekeeping genes have CGI [25] then genes without CGI can be assumed to be tissue-specific or rarely expressed. Thus, genes with a cluster of predicted TFBS not associated with CGI might be associated with tissue-specific regulation. Further analysis of extractions of tissue specific genes, shown in Results, supports the hypothesis. Results from this analysis provide a solution to the promoter prediction problem. Hannenhalli et al. used additional information, including profiles of TF binding sites, for promoter prediction based on CGI [29], with no significant improvement to prediction performance. The report using 7 manually selected PWMs confirmed that CGI are the most dominant feature. Our results show that Sp1 and ATF have a strong correlation with CGI; a result consistent with their result that information including both PWM did not improve prediction accuracy. This observation is consistent with other PWMs. More stringent selection of PWMs is required for an improved accuracy of promoter prediction. One strategy is to utilise the CGI-independent PWMs identified in this study. Another problem is exemplified by the under-representation of Oct-1 (M00138) in the (-600:600) region of the human promoter and the absence of positional preferences [29].
This under-representation was not expected but is observed in 10% of known PWMs. OCT1_04 (M00138) is not in the high quality list of TRANSFAC, OCT1_01 (M00135) and OCT1_C (M00210) was found to have minus cluster scores (-0.63 and -2.89) in our table (additional file 1).
It is noteworthy that Fig. 5 shows TFBS (AP2_Q6) in nonpromoters to occur randomly under a certain distribution. This distribution can be modelled by a binomial probability distribution. A model of Poisson distribution, which is an approximation of binomial probability distribution for a certain condition, was proposed in [11] as the probability distribution of TFBS density. Although we have not tested the goodness-of-fit, our observation does not contradict the Poisson distribution model. To assess the robustness of the cluster score, we compared cluster scores for different datasets from chromosomes 20, 21 and 22 (Fig. 3). The correlation coefficients were 0.91 (a) and 0.93 (b), proving that the significance would be similar if we utilized the whole human genome dataset in the analysis. The scale of the figures between the Y-axis and X-axis are different because of the different number of sequences taken from each chromosome.

Conclusions
We have developed a measure that statistically evaluates the degree of concentration of predicted TFBS in promoter sequences. Using this strategy to analyse various PWMs we have determined that predicted TFBS tend to cluster in human promoter sequences rather than in non-promoter sequences. Our results show that local concentrations of predicted TFBS in human promoter sequences are not a general characteristic of PWMs. Only a portion of identified PWM matches corresponded to TFBS occurring in clusters in promoter sequences. By computing partial correlation coefficients, we identified PWM sets associated with CGI and others that are independent of CGI. Transcription factors and binding sites associated with CGIindependent PWMs are likely to be involved in tissue-specific gene regulation. Indeed, using the CGI-related/ dependent PWM sets, we extracted tissue-specific genes with high accuracy by detecting clusters of predicted TFBS. These results will be useful to interpret predicted transcription factor binding sites and to further understand the role of their formation into clusters. Ultimately, these findings will further elucidate the various functions of promoters, genes and transcription factors.

Data
DNA sequences from the fully sequenced chromosomes (chromosomes 20, 21 and 22) were taken from the November, 2002 GenBank freeze (build 31) and assembled by NCBI, in accordance with the annotation of the UCSC genome browser [30]. RefSeq [31] genes were used as they have been reviewed by NCBI staff, are well studied, and are unlikely to be spurious. Some genes in the human genome have alternative promoters [32], complicating our analysis. For this reason, overlapping genes identified using the UCSC annotation were discarded. This check of RefSeq genes reduced the number of genes in the analysis from 527 to 373 for chromosome 20, 224 to 142 for chromosome 21, and 449 to 294 for chromosome 22. The resultant gene set U consists of 809 genes.
Title: Distribution of accumulated score C for promoters and non-promoters for AP2_Q6 Figure 5 Title: Distribution of accumulated score C for promoters and non-promoters for AP2_Q6 To increase the accuracy of the annotation of transcriptional start sites, we modified the annotation of RefSeq according to DBTSS (version 2, Mar 2002) [33], a database of transcriptional start sites for 5' end mRNA sequences. Suzuki et al. reported that a certain portion of sequences in DBTSS were longer (extended) toward 5' end of mRNA sequences than those in RefSeq [33]. We describe how the modification improved the first gene set U. Fig. 6 shows the composition of different gene collections. The RefSeq database is updated daily by increasing the number of entries and correcting others. For illustration purpose, two versions of RefSeq are shown in Fig. 6.    [34] identified 30 genes in which the 5' end sequences could not be detected. Due to the uncertainty of TSS these genes were not used in this study. Forty-one DBTSS mRNA sequences from were shorter than corresponding sequences in the new RefSeq with regard to 5' end sequences. It is assumed that these RefSeq sequences were corrected following the old RefSeq release independent of DBTSS and were used as they were. Finally, we modified the exon annotation of 202 genes according to DBTSS.
We extracted promoter sequences at relative positions (-600:-1) from the TSS, and intron subsequences 600 bp in length from genome sequences. Only intron sequences were used for the non-promoter sequence data sets as exon sequences are known to have preferences in their oligomer statistics, such as G+C % and codon bias [35]. The first intron was not included in the data set as although regulatory elements are rare in introns, intron 1 occasionally contains regulatory elements such as enhancers. We A Venn diagram of three gene sets (DBTSS, old RefSeq, and new RefSeq) Figure 6 A Venn diagram of three gene sets (DBTSS, old RefSeq, and new RefSeq). Gene sets from A to G (Bold alphabet) consist of genes in the regions bounded by the thick lines. D consists of Dn (genes whose 5' end sequences were not extended from the old RefSeq sequences with DBTSS data) and Dex (genes whose 5' end sequences were extended). G consists of Gn (genes whose 5' end sequences were not extended from the old RefSeq sequences with DBTSS data) and Gex (genes whose 5' end sequences were extended). Namely D = {Dn,Dex} and G = {Gn,Gex}. Genes in chromosomes 20, 21, 22 were denoted by }. Gene sets C u , E u ,F u , and are parts of C, E, F, Gn and Gex, respectively. Some of the numbers of the sets are given in [33], that is, |{D,G}| = 7889, |{Dex,Gex}| = 2683 and |{Dex,Gex}| / |{D,G}| = 0.34, where |{D,G}| denotes the number of genes in set {D,G}. investigated the frequency of enhancers in human introns by searching NCBI PubMed [36] with the keywords 'human', 'first intron', and 'enhancer'. This search yielded 194 papers. Replacing the keyword 'first intron' with 'second', 'third', 'fourth intron', 'fifth' or 'last intron' yielded 40, 15, 1, 1 and 6 papers, respectively. Replacing 'enhancer' with 'silencer' resulted in 281, 6, 3, 0, 0 and 0 papers, respectively. Removal of intron 1 from the data set greatly reduces the overall occurrence of regulatory elements in human intron sequences and allows our statistical analysis to be performed without significant interference from intronic regulatory sequences. Intergenic sequences are left out of the non-promoter dataset due the unknown occurrence of regulatory sequences.

Prediction of TFBS
Each promoter or non-promoter sequence was scanned by the MATCH program using 423 matrices in TRNASFAC version 6.3 (a transcription factor database) with options including 'vertebrate', 'minimize false negatives' (in cutoff selection) and 'use high quality matrices only'. As Kel et al. described, the cut-off was determined so that the false negative rate is 10% [3]. The option 'use high quality matrices only' uses approximately 70% of matrice [3]. Any PWM in the 'high quality' PWMs meet the criteria; When the PWM is used with a cut-off value which allows a false negative rate of 50%, then the match rate dropped below 1 match/kb in exon2 sequences [3]. If more than one matrix was matched to same transcription factor (prefix of "Identifier"), we selected a representative matrix with the highest quality and smallest suffix number according to the TRANSFAC definition. After scanning the sequences by MATCH, we set consecutive sampling windows (600 bp) in introns and promoter sequences, and then recorded corresponding TFBS predictions. To prevent double counting of palindromic binding sites, two matches for the same matrix at the same position was regarded as a single match and the match with the higher score was taken. Before MATCH ran, repeat sequences were masked to 'N' according to the annotation by Repeat-Masker in the UCSC genome browser. From the above analysis we extracted 361, 129, and 278 promoter sequences from chromosomes 20, 21 and 22, respectively. The promoter sequences identified contained repeat sequences (e.g. ALU, L1) and simple repeats with low complexity, as observed in intron sequences. These sequences account for about 20% of all bases. To balance the rate of repeats between promoters and introns, we discarded intron sequences with high rates of repeats, so that the average rate of repeats in the intron samples was at the same level as in promoter sequences. The number of 600 bp intron sequences included in the analysis was 6,589 (chromosome 20), 4,324 (chromosome 21) and 4,531 (chromosome 22).

Accumulated scores of TFBS
When predicted TFBS occur many times in a sequence there is a high probability that it contains functional regulatory regions or promoters [7,8,19,28]. We tested this hypothesis for individual PWMs. The degree of concentration of predicted TFBS in a sequence was defined as the accumulated score C, which is a summation of the MATCH score for PWMs in the subsequence and is calculated for each PMW and corresponding sequence. C is assumed to be almost proportional to the frequency of predicted TFBS for the corresponding PWM. Many sequences generate different C values although some are identical. We then generated a series of C j (j = 1 ... n) values for a PWM, where n is the number of different C values. Fig. 5 shows the histogram of C for promoters and non-promoters respectively, using the TRANSFAC matrix of identifier 'AP2_Q6' as an example. Since C reflects the number of predicted TFBS found in a sequence, the figure shows the density of predicted TFBS in a sequence. This result is similar to the density plot described by Pestridge and Burks [7], although our figure (Fig. 5) is not a plot of predicted TFBS density for mixed PWMs, but instead is a plot of predicted TFBS density for individual PWMs. Also, the X-axis in our plot does not indicate the number of predicted TFBS but instead indicates the accumulated score C. The Y-axis is smoothed by averaging for the width of 5 in C value.

Cluster score and statistical significance for a PWM
Significance values for an individual PWM from a series of C j can be determined from a contingency table. Table 5 shows a contingency table for the number of promoters and non-promoters above and below the threshold C j for a given PWM. From this table, χ 2 value for a given C j is defined as where described in [37]. From the χ 2 value, we computed the probability P that the χ 2 value or greater is obtained by chance. The probability P was calculated from the χ 2 .
Since P is calculated for many PWMs, we must deal with the problem of multiple testing. Using the Bonferroni correction [23], Pn was calculated using the formula 1 -P n = (1 -P) n , approximately P n = P × n for small P × n. The n is the number of PWMs. When we determine the set of significant PWMs, Pn were compared with the significance level (i.e. 0.01). We also defined the statistical significance Q j as Q j = -log 10 (P n ) if Rprom >Rnonprom and Q j = + log 10 (P n ) otherwise, where Rprom = A1/A (a rate of sequences in promoters where clusters found), Rnonprom = B1/B (a rate of sequences in non-promoters where clusters found). Although the P is an indicator of the differ-ence between the occurrence of promoters and nonpromoters, the probability P itself does not represent the preferences of PWMs for promoters. To represent the preference of predicted TFBS for or against promoters, we add signs for statistical significance Q j . Positive Q j indicates that predicted TFBS tend to appear frequently in promoters, while negative Q j indicates that predicted TFBS tend to avoid promoters.
We studied how statistical significance Q j varies with the threshold of C j. Fig. 7 shows the presence of a peak of Q j when we change the threshold. We define the cluster score S of a PWM in such a way that the significance is the maximum, namely We simultaneously define a unique threshold T of the PWM by Title: significant score Q j of matrix AP2_Q6 for different thresholds Figure 7 Title: significant score Q j of matrix AP2_Q6 for different thresholds. T argmax |Q For the all-vertebrate TRANSFAC PWMs, we determined thresholds T and calculated significance scores (or cluster scores) S. The highest scoring PWMs are listed in Table 1.

Correlations among promoter sequences, CpG islands, and clusters
For every 600 bp sequence, three numerical features (promoter, CGI, and clusters) were annotated. CGI were identified using the CpGProD program [38] for original long sequences (not for short sequences of 600 bp). Regions larger than 500 bp with a G+C % equal to or greater than 50% and 'observed CpG / expected CpG' equal to or greater than 0.60 were classified as CGI [38,39]. The CpG-ProD program outputs 'start_p' scores for the predicted CGI. This score indicates the probability that the region is a CGI located over a transcription start site (start CGI). Short 600 bp sequences sampled from long sequences containing CGI were annotated as CGI if the overlapping CGI region was longer than 300 bp. The accumulated score C was used for cluster annotation. From sequences with feature annotation, the correlation coefficients between every two of the three features were computed for each PWM by the statistical language R [40]. We use P to denote whether the sequences is promoter or not, namely P = {1,0}, and I to the denote 'start_p' score for CGI calculated using the CpGProD program [38]. A partial correlation coefficient for each PWM was calculated using the subsequences. For example, a partial correlation coefficient r PC.I is the correlation between P and C while controlling for I, defined by where r PI is a correlation coefficient between variable P and I, r PC is a correlation coefficient between variable P and C and r CI is a correlation coefficient between variable C and I. A partial correlation coefficient differs from a correlation coefficient. If the correlation between P and C depends entirely on the common cause I, then when I is constant, the correlation between P and C should be zero. The partial correlation r PC.I expresses such a relationship. Even when I varies, r PC.I is expected to be zero in such a situation, while the correlation coefficient r PC may not be zero. See chapter 16.4 in [41] for details. Fig. 4 shows a plot of r IC.P against r PC.I for various PWMs. Using these two values, we identified two PWM sets including, (1) a CGI-related set consisting of 37 PWMs in which the clusters are correlated with CGI independent of promoters, and (2) a CGI-independent set consisting of 54 PWMs, in which the clusters are correlated with promoters independent of CGI. The CGI-related set requires that r IC.P >r PC.I and that the partial correlation coefficient (PCC) r IC.P is significantly high (p < 0.01) under the hypothesis that r IC.P is zero (see below). The CGI-independent set requires that r IC.P <r PC.I and that r PC.I is likewise significantly high. To calculate the statistical significance of PCC, a PCC r was subjected to z-transformation defined as The values of z are supposed to be normally distributed and the expected variance is where n is the sample size [41]. The CGI-related PWMs and CGI-independent PWMs are listed in Tables 2 and 3.

Gene expression data
To examine the relationship between clusters of predicted TFBS and the tissue specificity of the genes where clusters were found, we generated a list of genes with expression data. This list includes 535 housekeeping or maintenance genes expressed in 11 human adult and foetal tissues from [26], and 451 housekeeping or maintenance genes available at HugeIndex database http://www.hugein dex.org [27]. We then identified 581 non-redundant housekeeping genes and 'tissue-selective' genes, which are predominantly, but not exclusively, expressed in one tissue type. Tissue-selective genes were expressed in brain (589 genes), kidney (129 genes), liver (271 genes), lung (68 genes), muscle (302 genes), prostate (45 genes) and vulva (95 genes) [27]. These genes corresponded to 2,069 RefSeq entries. Seventy-two of these genes were identified in our gene set covering chromosome 20, 21 and 22 and were used for further analysis.

Tissue specific gene detection based on clusters of predicted TFBS
Using the two PWM sets described above, we searched clusters of predicted TFBS in promoter sequences. We calculated C and statistical significance of C as follows. For each promoter sequence, 30 random 600 bp sequences were generated under the first order Markov model, which is based on dinucleotide frequency and can identify promoter CpG bias and G+C%. The MATCH program was run with given matrix thresholds. The accumulated score C of PWMs was computed in every random sequence. A mean and a variance of C for each PWM were estimated from the random sequences. Then, for the given promoter sequence S p and the PWM (M), we can run MATCH and calculate C and its significance score (Zscore), namely If the Z-score value was above three, the promoter sequence was taken and considered together with the PWM. Table 4 lists genes identified with clusters of predicted TFBS with significant C values for the CGI-related/ dependent PWM set. The computation of p-value of cumulative hypergeometric distribution was performed using the AS R77 algorithm [42].