TC-motifs at the TATA-box expected position in plant genes: a novel class of motifs involved in the transcription regulation

Background The TATA-box and TATA-variants are regulatory elements involved in the formation of a transcription initiation complex. Both have been conserved throughout evolution in a restricted region close to the Transcription Start Site (TSS). However, less than half of the genes in model organisms studied so far have been found to contain either one of these elements. Indeed different core-promoter elements are involved in the recruitment of the TATA-box-binding protein. Here we assessed the possibility of identifying novel functional motifs in plant genes, sharing the TATA-box topological constraints. Results We developed an ab-initio approach considering the preferential location of motifs relative to the TSS. We identified motifs observed at the TATA-box expected location and conserved in both Arabidopsis thaliana and Oryza sativa promoters. We identified TC-elements within non-TA-rich promoters 30 bases upstream of the TSS. As with the TATA-box and TATA-variant sequences, it was possible to construct a unique distance graph with the TC-element sequences. The structural and functional features of TC-element-containing genes were distinct from those of TATA-box- or TATA-variant-containing genes. Arabidopsis thaliana transcriptome analysis revealed that TATA-box-containing genes were generally those showing relatively high levels of expression and that TC-element-containing genes were generally those expressed in specific conditions. Conclusions Our observations suggest that the TC-elements might constitute a class of novel regulatory elements participating towards the complex modulation of gene expression in plants.


Background
Over the past genomics has markedly changed our view on core-promoter organization [1]. For instance, the TATA-box can no longer be regarded as a general feature of polymerase II core-promoters [2]. Indeed, only a small fraction of eukaryotic genes actually harbour a TATA-box: less than 20% of genes in both human [3] and yeast [4]. While TATA-box-containing promoters support the direct binding of TATA-box-Binding Proteins (TBPs) and thereby also the formation of the preinitiation complex, TATA-less promoters are recognized by multiple TBP-related proteins and other TBP-associated Transcription Factors (TFs) involved in the recruitment of TBP [5]. Indeed some TATA-variants and other alternative elements allow the initiation of transcription and participate towards defining distinct patterns of expression [4,[6][7][8][9].
Several core-promoter elements or general Transcription Factor Binding Sites (TFBSs) have been previously identified in eukaryotes. They are characterized by a strong positional preference relative to the Transcriptional Start Site (TSS) as for instance the TATA-box in the [-30, -25] area [8], the Initiator element, (Inr), around the TSS [10], the downstream promoter element in the [+28, +33] area [11], or the IIB recognition element immediately upstream of certain TATA-boxes [12]. The position of binding sites of proteins belonging to the transcription complex is important for the functioning of promoters since it determines both the TSS location [13] and the transcription direction [5]. Thus, a strong positional conservation of a novel regulatory element would strongly indicate a functional role. This concept has led to a generation of tools that in contrast to previous TFBS predictors [14,15] are based on the positional densities of oligonucleotides rather than on their frequency of occurrence. These tools have been used to characterize core-promoter elements in several model genomes including plants [16][17][18][19][20][21][22].
All together, the core-promoter elements listed above seem unable to account for the transcription of all the RNA-polymerase-II transcribed genes. Less conserved core-promoter elements present in small gene sets have been described in previous studies at the gene level. For instance, in the human cytosolic phospholipase A2alpha gene, an AAGGAG motif in the [-35, -30] area binds TBP and is critical for basal transcriptional activity [23]. In other studies, a TBP has been shown to bind to a TAAGAGA element in the [-23, -17] region of the hepatitis B virus S gene [24]. These experimental observations suggest that core-promoter elements specific to small sets of genes remain to be disclosed. A study of large-scale structural properties of DNA in promoters indicated that the instability of DNA around -30 relative to the TSS necessary for transcription may be due to as yet unidentified motifs other than the TATA-box [25].
It is therefore clear from and despite this growing amount of data that the code embedded within corepromoter sequences has not yet been fully deciphered. In this work, we used an in silico hypothesis-driven approach to predict novel elements potentially recognized by the transcriptional complex. We explored the bioinformatics-based evidence that sequences other than the TATA-box and TATA-variants but located in the same region relative to the TSS may be functional corepromoter elements. We therefore searched for short sequences exhibiting similar positional constraints to those of the TATA-box and identified pyrimidine-rich elements distinct from the pyrimidine tract [21,26] as candidate elements for about 18% of the plant genes. To determine their potential functional role, we investigated any association between such identified TC-elements and specific features of the genes containing them, as has been previously shown for the TATA-box.

Results
Less than 39% of A. thaliana promoters contain a TATAbox or a TATA-variant Our approach was based on three steps. First, we searched for all 6 base long motifs with a statistically significant preferential position within the 300 nucleotides upstream of the TSS and called these motifs the Preferentially Located Motifs (PLMs). This method was first described by FitzGerald et al. [27] for the analysis of human promoters and was then applied by others to plant promoters [19,21]. Our results correlated well with previous studies in terms of the spatial representation of the PLMs within the plant promoters. Second, for each PLM, we precisely defined: (i) the preferential position relative to the TSS, i.e. the top of the peak, (ii) the functional window, i.e. the peak width, derived from the peak boundaries, (iii) the Score of Maximal Square relative to the base line (SMS, see Methods section) ranking the PLMs in terms of their topological constraint and (iv) the list of genes containing the studied PLM within the functional window. Third, to increase the chance of identifying functional PLMs, we searched for their conservation in 14927 A. thaliana genes and 18012 O. sativa genes with experimentally supported TSSs. A well-annotated genomic sequence is available for both of these species that diverged approximately 150 million years ago [28,29]. As previously described within the 50 nucleotides upstream of the TSS of both A. thaliana [19,21] and O. sativa [26] we found PLMs made up exclusively or almost exclusively of: (i) T and A nucleotides and exhibiting strong topological constraints, i.e. a sharp peak, and (ii) T and C nucleotides and exhibiting low topological constraints, i.e. a wide peak. Among the T and A rich motifs, we found the canonical TATA-box defined by the TATAWA consensus (W for A or T) and TATA-box variants [8]. We wondered whether by analysing the whole promoter set, other motifs with the same strong topological constraints could be missed. It is reasonable to predict that the presence of a PLM characterized by a wide functional window overlapping the TATA-box expected area might hide the strong topological constraints of a PLM specific to a small promoter set. To address this issue, we built up promoter sets by successively subtracting from the whole set of promoters, those characterized by different classes of PLMs, as described below. We firstly considered the conservation of PLMs at the genome level and then their conservation at the orthologues level.
TATAWA is a particularly well-conserved PLM since it is found in the same promoter region in both plants and animals [30]. Confirming previous results, we found this PLM in both A. thaliana and O. sativa genomes in a preferential position 32 bases upstream of the TSS and strictly located within the [-39, -26] region ( Figure  1). A total of 2606 (17.5%) and 2601 (14.4%) promoters in A. thaliana and O. sativa respectively contained a TATAWA within the [-39, -26] functional window. In contrast to the motif counting method highly prone to false positives, our method may slightly underestimate the number of functional motifs. Indeed, as our aim was to characterize and compare different sets of genes each containing defined PLMs (TATA-box or other) within their promoters, we chose only "clean" sets, i.e. those containing only one regulatory element to the detriment of completeness.
Several T and A rich motifs partially matching the TATAWA sequence have been shown at the TATA-box expected position [31,32]. We observed PLMs with the same positional constraints as TATAWA, i.e. showing a sharp peak in the [-39, -26] region. These PLMs may be (i) functional motifs hereafter named as TATAΔ-PLMs where Δ stands for variant; or (ii) motifs being PLMs due to their overlap with the TATAWA sequence. This overlap represents an inherent drawback of the method since motifs shorter than 6 nucleotides (TAT for instance) as well as 6 base long motifs (NNTATA for instance) may appear as PLMs. TATA-variant and TATA-box are distinct regulatory elements that need differentiating [33]. For this reason, we searched for TATAΔ-PLMs in sets of promoters not containing the canonical TATA-box in the [-39, -26] region. In this way, we were able to ascertain that the PLMs found within the promoter sets were bona fide elements under topological constraint. In A. thaliana and O. sativa respectively 12321 (82.5%) and 15411 (85.6%) promoters were found without a TATAWA in the [-39, -26] region. Relative to the TATAWA sequence, 32 possible motifs diverged at one position. Out of these 32 motifs, only 11 were found to represent PLMs (TATAΔ1-PLMs) conserved in both species (Figure 2, column 2). Consistent with in vivo experimental results [34], T to A or A to T substitutions were the main differences observed among the TATAΔ1-PLMs. Motifs with a base C or G substitution at the central positions were not considered as PLMs and were counter-selected. Motifs with a C or G substitution at the T1 or A6 positions were, however, considered as TATAΔ1-PLMs. There   thaliana promoters. The TATA-box PLM sequence (first column), TATAΔ1-PLM sequences (second column) and TATAΔ2-PLM sequences (third column) are organized in an oriented graph. In red the motifs observed in more than 350 promoters; in orange the motifs observed in more than 200 promoters and in black the motifs observed in up to 200 sequences. Each edge between two PLMs reflects the presence of one substitution leading from one PLM to another one. Numbers of A. thaliana promoters containing a given motif are indicated above the sequences. The TATAWA motifs are seeds (in bold).
The number of genes containing the TA-PLMs decreased from the TATA-box to the TATAΔ2-PLMs ( Figure 2). Interestingly, even though most of the TATAΔ-PLMs were found to be T and A rich as expected, 3 out of the 4 TATAΔ2-PLMs (CTTAAA, TACTTA and TCTTAA) contained a base C. This suggested that some T and C rich PLMs, hereafter referred to as TC-PLMs, could have been missed in the analysis of the whole promoter set. Indeed, it has been shown that, in plants, the region overlapping the TSS is characteristically rich in T and C nucleotides described as CpT microsatellites or TC-microsatellites [35,36]. We observed these microsatellites during the analyses of the whole promoter set within a number of different 6base-long PLMs. They are characterized by a wide functional window centred downstream of the TSS often extending several hundred bases. We sometimes observed a small secondary peak located around 30 bases upstream of the TSS which disturbed the broad distribution of some of the TC-microsatellites (Figure 3. A). These TC-microsatellites may hide TC [-39,-26] -PLMs, i.e. TC-PLMs with high topological constraints similar to those of TA-PLMs present within specific sub-sets of genes.
Conserved TC [-39,-26] -elements are observed in almost 18% of A. thaliana promoters We hypothesised that TC [-39,-26] -PLMs with narrow functional windows could be functional alternatives to TA-PLMs. This being the case we expected to be able to detect these PLMs more easily in promoter sets within which they predominate or within promoters not containing TA-PLMs. We therefore selected a TA-less promoter set by excluding the TA-PLM-containing promoters from the whole promoter dataset. Out of the 16 possible dinucleotides, only 3 were found to be conserved PLMs and to shared the TATA-box topological constraints: TA, AT and AA. A total of 1745 A. thaliana promoters (11.7%) and 5089 O. sativa promoters (28.2%) contained no TATA-box or conserved TATAΔ-PLM or the 3 conserved dinucleotide-PLMs.
We examined the possibility of these TA-less promoter sets representing sets of poorly annotated promoters due to an incorrect prediction of the TSS position. In both TA-less promoter sets, three arguments supported the validity of the TSS position prediction. First, the observed TC [-39,-26] -PLMs were conserved between A. thaliana and O. sativa and in both plants they exhibited the same strict topological constraints (see below). Second, the distributions of several motifs known as TFBSs supported by experimental analyses [37,38], such as the Inr or the GGCCC element, shared the same topological constraints in both the whole and the TA-less promoter sets (data not shown). Third, as expected, the GC-compositional strand bias or GC-skew expected in plant promoters was observed at the TSS location in both promoter sets (data not shown).
We therefore searched in the TA-less promoter sets for the presence of 6-base-long conserved PLMs. Out of the 4096 possible motifs, we found 29 conserved PLMs exhibiting a strict functional window in the [-39, -26] area relative to the TSS. These PLMs were present in 2645 A. thaliana (17.7%) and in 2331 O. sativa (12.9%) promoters. In agreement with our hypothesis, the 29 PLMs were TC [-39,-26] -PLMs, i.e. were comprised of T and C bases only. Figure 3 illustrates how the TC [-39,-26] -PLMs can be clearly distinguished in the TA-less promoter set while they were missed in the whole promoter set. Indeed, using the whole promoter set, it is only possible to find the TTCTTC-PLM, which is characterized by a wide functional window and by a preferential position 29 bases downstream of the TSS ( Figure  3.A). On the contrary, using the TA-less promoter set, we observed both the wide functional window PLM and a distinct TTCTTC-PLM characterized by a sharp peak centred 33 bases upstream of the TSS, i.e. sharing the TA-motif topological constraints (Figure 3.B). We propose that TA-PLMs may be recognized and thus be functional in a T and C rich environment whereas the . The y-axis percentage depends on the promoter number in each data set. TTCTTC is preferentially positioned at +29 (SMS 10) and -33 (SMS 8) in the whole promoter set and in the TA-less promoter set respectively. In both cases, the size of the window used to scan the promoters was 2 bases. same might not be true for TC [-39,-26] -PLM. As a consequence, large pyrimidine-rich regions could have been preferentially maintained during evolution in promoters with a TA-PLM whereas they may have been counterselected, at least upstream of the TSS, in promoters with a functional TC [-39,-26] -PLM. It is important to note that only TC [-39,-26] -PLMs exhibited the TATA-box topological constraints in the TA-less promoter set. Altogether, our results (i) confirmed the importance of TC-microsatellites (TC {n} or TTC {n} for instance - Figure  3.A) in plant promoters and (ii) predicted the existence of a novel class of functional elements, the TC [-39,-26] -PLMs characterized by a sharp peak in the [-39, -26] region and observed in almost 18% of A. thaliana promoters.
We investigated the putative presence of TC [-39,-26] -PLMs in other eukaryotic genomes. We analyzed 15802 Homo sapiens promoters and 15833 Mus musculus promoters with an experimentally supported TSS [39]. We observed neither TC [-39,-26]  Concerning the TA-PLMs, the graph root or seed is the TATAWA-PLM. For a given TA-PLM, the number of promoters containing it depends on both the distance from the seed, i.e. the number of substitutions between the given PLM and the seed-PLM and the existence or not of a more divergent sequence. We applied the same approach to the conserved TC [-39,-26] -PLMs and found that they could be organized into a unique, closed and oriented graph (see Methods section). Three seeds were suggested: TCTTCT, TTTCTT and TTCTTC ( Figure 4). Similar to that observed for TA-PLMs, these three TC [-39,-26] -PLMs were those most frequently observed in promoters with the number of promoters containing the other TC [-39,-26] -PLMs depending on both the distance from these seeds and the existence or not of a more divergent sequence. In conclusion, all [-39,-26] -PLMs can be detected independently in different promoters and may be organized into different groups with apparent evolutionary links between the given PLM and the related seed. Gene Ontology (GO) annotations [41] have been used previously in the prediction of the functional role of  TFBS [42]. Here we analysed the GO annotations of the four classes of genes defined above, according to the four promoter sets. As far as biological processes are concerned, the most conspicuous observation was a preferred association between genes containing a TATAbox and responses to different stimuli (Table 1, Biological Process). This result was expected since in human and in yeast the TATA-box-containing genes are more frequently involved in specific biological processes [4,9]. Interestingly, we also found that TC [-39,-26] -PLM-containing genes are more frequently involved in protein metabolism than any other gene class (Table 1, Biological Process), i.e. a basic biological process. Neither the TATAΔ-containing genes nor the [-39,-26] -PLM-less genes showed any significant bias for any biological process categories. We also analysed the partitioning of genes between the different GO cellular components. Again, the TATA-box-containing genes presented the most biased partitioning. The products of these genes are more often constituents of the cell wall, the cytosol and the ribosomes and less frequently constituents of the chloroplasts, the plastids, the Golgi apparatus and the mitochondria compared to the products of all the other genes (Table 1, Cellular Components). Products of the TC [-39,-26] -PLM-containing genes exhibited little biased partitioning between the GO cellular component categories. Indeed, only the cell wall component was significantly less represented. It seems biologically sound to have found a positive correlation between the response to different stimuli and the cell wall GO categories in the TATA-box-containing genes and a negative correlation between the protein metabolism and the cell wall categories for TC [-39,-26] -PLM-containing genes. These observations suggest that TC [-39,-26] -PLMs might have a functional role in transcriptional control that could frequently differ from and perhaps even oppose the TATA-box functional role. It is of particular interest that our observations support the hypothesis that small variations in the TATA-box sequence are linked to large changes in gene expression [6,43]. Unfortunately, due to the relatively small number of promoters containing a unique TC [-39,-26] -PLM, it was deemed impossible to perform the same comparisons between the TC [-39,-26] -PLM seeds alone and their predicted variants as that performed for the TA-motifs. It should be noted that by analysing all the genes containing a TC [-39,-26] -PLM alone we decreased the power of our statistical tests. We were however able to demonstrate a significant difference between the TATA-box and the TC [-39,-26] -PLM sets concerning protein metabolism and cell wall categories (p. values of 3e -5 and 1e -9 respectively).
Only the canonical TATA-box is spatially linked to the Initiator element Recent progress in mammalian genomics has defined a more functional image of core-promoters. Two functional categories of core-promoters have been described [44]. The single peak promoters have a tightly defined TSS position within a few base pairs while in broad peak promoters several TSSs may be found within small clusters spanning tens of base pairs. A TATA-box is more likely to be found within single peak promoters than in broad peak promoters [44]. In genome-wide studies, only the most upstream TSSs are often considered and are defined by the available transcript sequences. There are three main categories of core-promoters described in human [9], i.e. TATA-box-containing promoters with an Inr, TATA-box-containing promoters without Inr and TATA-less promoters. We observed the same relative representation of these three categories in Arabidopsis. Consistent with mammalian studies [44], recent evidence suggests that A. thaliana genes containing a TATA-box tend to have a sharper dominant peak of TSS compared to that of other genes [45].
The distance between TATA-box and Inr is important for accurate transcription initiation [46]. We considered the distance between the Inr [21], called here YR-TSS, and the conserved PLMs characterized using our approach. As expected, the YR-TSS is a conserved PLM whose functional window is one base upstream of the TSS in both plants (data not shown). We first analysed the 1496 A. thaliana promoters containing a TATA-box and we represented the distances between each TATAWA in the [-39, -26] region and each dinucleotide CA, TA, TG and CG up to 70 bases downstream of the TATAWA (Figure 6.A). The dinucleotide CA, but not TA, TG or CG showed a strong preferred distance of 30 to 33 bases from TATAWA. This distance is consistent with the preferential position of both PLMs and with mammalian results [47]. Furthermore, none of the YR motifs showed a preferential distance either from TATAΔ-or from TC [-39,-26] -PLMs ( Figure 6.B and 6.C). Finally, our results provide evidence of a link between the canonical TATA-box and the CA-TSS: 30 to 33 bases preferentially separate the CA from the first T of TATAWA.
Genes containing the TC [-39,-26] -element are generally long In human, the presence of a TATA-box in promoters is often associated with a compact gene structure [6]. In A. thaliana, we observed the same bias towards a more compact structure of TATA-box-containing genes ( Table 2). In contrast, the TC [-39,-26] -PLM-containing genes and the [-39,-26] -PLM-less genes were overall relatively longer while the gene structure of TATAΔ-PLMcontaining genes showed no bias. When present, the bias in gene size was mainly explained by the 5'UTR and by the length of the coding sequence. However, to a lesser extent, the enrichment in compact genes amongst those containing a TATA-box may be explained by the shorter cumulative length of introns, and by the higher percentage of intronless genes ( Table 2).
Based on the counting of Expressed Sequence Tags (ESTs) [4] or on microarray data [6], gene size and gene expression levels have been shown previously to be inversely correlated in both human and yeast. This being the case, our observations could be due to the fact that, in general, TATA-box genes have higher expression levels than other A. thaliana genes [19,21]. The differences in gene sizes we observed are consistent with a lower mean expression of TATAΔ-PLM-than of TATA-box-containing genes. Both TC [-39,-26] -PLM-containing and [-39,-26] -PLM-less genes could be expressed at lower levels. TATA-box-containing genes have been shown to generally display relatively high specificity of expression compared to housekeeping genes that frequently show high expression levels ( Table 1). Therefore, the question remains as to understand whether the presence of regulatory elements in the [-39, -26] region of promoters is linked directly to gene function per se or to the transcription level that in turn is associated with gene function. To resolve this we analysed a large set of transcriptome data, distinguishing the two different components of transcription: specificity and level.

Specific role played by each class of regulatory element found in the [-39, -26] region on different components of transcription
In eukaryotic genomes, various TFBSs have been linked to specific gene expression patterns [4,6,9,48]. We   [49] and available through the Complete Arabidopsis Transcriptome database: CATdb [50]. The dataset included 1044 hybridizations based on 522 different samples covering numerous developmental stages, biotic and abiotic stresses and mutants. We used normalized data on which positive hybridizations have previously been determined (see Methods section in Aubourg et al., [51]). For any one gene, the relative number of positive hybridizations was considered as a measure of the range of expression, i.e. of the specificity. To measure the global level of expression we computed the median of the signal intensities. The relationship between the median signal intensity and the percentage of hybridizations was clearly not linear and suggested the existence of different classes of genes with respect to transcriptional regulation. We thus clustered the transcriptome data into four classes ( Figure 7 and Table 3, first rows). First, genes were separated in two classes depending on their expression level above or below the distribution model line, respectively HE for those with High levels of Expression and LE for those with Low levels of Expression. Second, two classes were defined relating to the specificity of hybridizations and delimited by the two inflections in the cloud of points. The first class clustered genes positively hybridized in less than 15% of microarray hybridizations (SR for Small Range) and the second class clustered those with more than 85% of hybridizations (WR for Wide Range). Interestingly, each of these four transcriptional clusters was predominantly made up of genes containing one of a specific class of [-39,-26] -PLMs. Thus, (i) genes within the transcriptional SR group, i.e. those that hybridized in specific conditions predominantly Structural gene features have been assigned by querying the FLAGdb ++ database [59]. For median length data, we performed two one-sided Wilcoxon tests allowing the identification of enrichment in wide (bold) or in compact gene structures (underlined) in a set of genes compared with all the other genes, i.e. genes within the whole gene set minus genes within the considered gene set. For intron-less gene percentages, we performed two one-sided Fisher exact tests allowing the identification of higher (bold) or lower (underlined) percentages in a gene set in comparison with all the other genes. NS indicates a non-significant difference. P-values in parenthesis are less than 5% with the Bonferroni correction. Both the first intron and 3'UTR lengths are never biased (data not shown).  (Table 3, last two rows). The most Highly Expressed genes (HE+ genes) were predominantly those containing TATAWA whereas the most Lowly Expressed genes (LE+ genes) were mainly [-39,-26] -PLM-less genes. We propose that in the presence of a TATA-box, the recognition of the TSS depends directly on a small number of TBPs and/or of TBP-related protein(s). The relaxed recognition by TBPs in TATAΔ-containing promoters might be responsible for a bias towards the large range of expression. In the absence of TATA-box or TATAΔ, the recruitment of the TBP might be mediated by different TBP-associated TFs depending on the gene. Interaction of the different TBP-associated TFs with other specific factors might therefore explain the relatively higher specificity of transcription observed for the TC [-39,-26] -PLM gene set.

The class of regulatory element in the [-39, -26] region is not conserved in higher plant orthologous gene pairs
With the increasing availability of transcriptome data, the search for conserved TFBSs within promoters of genes exhibiting similar transcriptional patterns has gathered much attention. Surprisingly, the conservation of corepromoter elements, widely considered as being those most conserved, has not received such large interest. One recent study addressing this question did cast doubt on this widely accepted notion [52]. Concordantly, in several studies on orthologous relationships between large gene families, we observed no significant conservation of the presence of a TATA-box between plant orthologues (data not shown). Even single copy genes present as orthologues in both A. thaliana and O. sativa are no more conserved than random pairs of genes [53]. The co-occurrence of different elements in the [-39, -26] promoter region could, at least in part, explain an apparent non-conservation at the gene level despite the observed conservation at the genomic one. This prompted us to reexamine the question comparing core-promoter elements in the [-39, -26] region in A. thaliana and O. sativa. We analysed the conservation of the three classes of [-39,-26] -PLMs in 5805 pairs of orthologous genes in A. thaliana and O. sativa characterized by an experimentally supported TSS. While the conservation of TATAWA appeared relatively low (17%), it was significantly higher than that expected by chance (6%) ( Table 4). This was not the case for TATAΔ-or TC [-39,-26] -PLMs, which exhibited no conservation between orthologous genes (Table 4). Our results showing that the TATA-box is more involved at the transcriptional level and that both TATAΔ-and TC [-39,-26] -PLMs are more involved in specificity are in line with finding. All together our observations are in accordance with the fact that gene transcription levels are positively correlated between orthologous genes in A. thaliana and O. sativa The 8 expression categories are described in the Figure 7. HE, High Expression; LE, Low Expression; SR Small Range of expression; WR, Wide Range of expression; HE + Highest Expressions; LE + , Lowest Expressions. The whole gene set contains the 11161 genes for which we have both a known TSS and CATMA-expression data. We performed two one-sided Fisher exact tests allowing the identification of higher (bold) or lower (underlined) percentages in a gene set in comparison with all other genes, i.e. genes within the whole gene set minus genes within the considered gene set. NS indicates non-significant difference. P-values in parenthesis are less than 5% with the Bonferroni correction. [53] and that the correlation between transcriptional range and intensity is rather weak [51]. Nevertheless, TATAΔ-and TC [-39,-26] -PLMs are conserved, both at the sequence level per se and at the occurrence level within the whole genome.

Discussion and conclusions
In A. thaliana and O. sativa, we identified a novel class of putative TFBS involving TC-elements that are preferentially located within the same core-promoter region as TATA-boxes. We showed that these TC-elements are structurally distinct from the previously described TC-microsatellites observed in the 5'UTR of plant genes. Nevertheless, the presence of both TC-elements and TC-microsatellites in higher plants but not in vertebrates suggests an evolutionary link between these two promoter elements.
Previous reports have described the presence of a pyrimidine rich element, named the Y-patch, in plant corepromoters [21,26], and its tendency to be associated with both the TATA-box and the Inr motif [45]. Ypatch motifs and TC-microsatellites are two names used to globally describe T and C rich sequences widely surrounding the TSS. In promoters, these two elements are characterized by two non identical but overlapping groups of motifs. Our results show that the TC [-39,-26] -PLM exhibits specific characteristics distinguishing it from the other Y-patch motifs. Y-patch motifs show frequent occurrence in plant promoters, are present in a wide area around the TSS (see Figure 1) and may be extended from a 6-to a 10-base-long element without decreasing the score associated to their local overrepresentation (SMS). Indeed the TC [-39,-26] -PLMs were only observed in a sub-set of 18% of A. thaliana promoters at the TATA-box expected position, were associated with a sharp functional window, were 6-base-long and could not be extended without decreasing their SMS.
Specific promoter features were frequently observed when genes were classified into groups, i.e. the TC-element-containing gene group, the TATA-box-containing gene group or the TATAΔ-PLMs-containing gene group. First, our data indicate that the TATA-box preferentially associates with a CA-TSS. Neither the TATAΔ-nor the TC-elements exhibited this apparent functional association. In addition, no association between the TATA-box and any of the three other YR-TSS was observed. Second, TC-element-containing genes were predominately large and involved in protein metabolism while TATA-box-containing genes were predominantly compact and involved in response to stress and stimulus. Indeed, gene function, specificity and level of expression are linked features. Using an original approach we have been able to distinguish the possible involvement of different [-39,-26] -elements in the control of either the level or the specificity of expression. A global analysis of CATMA-transcriptome data indicated preferential links between the presence of: (i) a TATA-box and high gene expression; (ii) TC-elements and high specificity of gene expression; and (iii) TATAΔ-PLMs and broad expression patterns, as in housekeeping genes. All together our observations suggest that TC-elements might be considered as a novel class of plant promoter elements linked directly or indirectly to the regulation of gene expression.
The presence of all three elements, i.e. TATA-boxes, TC-and TATAΔ-elements, have been observed in two plants that diverged about 150 millions of year ago [29]. We observed a global conservation, i.e. conservation in the relative number of genes containing one of the three motifs in the two genomes. Nevertheless, only the TATA-box showed a low level of conservation between orthologous gene pairs while conservation of either TCor TATAΔ-elements between orthologous gene pairs was no higher than that found between any gene pair. On the one hand, the low level of conservation at the orthologous gene level is in line with the rapid evolution of core regulatory motifs after gene duplication in A. thaliana [43] and with the notion that A. thaliana and O. sativa may have independently evolved novel TSSs [54]. On the other hand, the conservation of the relative number of genes with either a TATA-box, a TC-or a TATAΔ-element suggests a global conservation of the interspecies variability, noise level and evolvability of gene expression; three processes involving TATA In the 5805 pairs of orthologous genes between A. thaliana and O. sativa, we searched for the number of genes containing a given [-39,-26]  sequences [55][56][57]. The three classes of core-promoter elements observed in the [-39, -26] region are each present in about 20% of the promoters. In this study, we used "clean" classes of promoters, i.e. those containing only one of the three classes of elements. However in several other promoters, more than one element from a given class may be present. Therefore many functional combinations are expected. Experimental data have shown that the TATA-box is involved in the direct binding of TBP [5] and at least some TC-and TATAΔelements might be involved in the recognition by other TFs recruiting TBP to form the transcription initiation complex at the right position. Consistent with this notion, both TC-and TATAΔ-elements are linked to the specificity of expression that could be mediated through different TFs under the control of various signals. An alternative would be that TATA-boxes, TATAΔ-elements and TC-elements are all responsible for DNA instability around -30 relative to the TSS but at different levels [25] promoting a continuous range of control on gene transcription. Either way, we believe that our observations call for further biological analysis and that our predictions will be useful for future assessment of the relationships between promoter architecture and gene expression. In this respect, future studies in model plants should be based on a better description of the different organization of core-promoters and include a more precise definition of the location of alternative TSSs based on new sequencing technologies.

Promoter sets
Arabidopsis thaliana and Oryza sativa promoters were built from transcripts, Expressed Sequence Tags and full-length cDNA (The Arabidopsis Information Resource -TAIR R.6 [41] and The Institute for Genomic Research -TIGR R.3 [58] respectively). We used FLAGdb ++ [59], an integrative database of plant model genomes, to define the transcriptional units by aligning all the available transcripts to gene models, excluding pseudogenes. We excluded promoters with a 5'UTR smaller than 50 bases in order to avoid truncated UTRs. Our promoter sets included 14927 A. thaliana and 18012 O. sativa sequences extending 1 kb upstream of the predicted TSS and containing the whole 5'UTR. The 15802 Homo sapiens and the 15833 Mus musculus promoters were retrieved from the DBTSS [39]. To maintain a consistent approach to that used to construct the plant promoter set, we selected only the TSSs located the most upstream of genes.

Identification of Preferentially Located Motifs
For each motif analysed, we extracted all occurrences within a promoter set. The motif position corresponds to the position of the first base of the sequence. Motif distributions were determined using a one-base-long sliding-window with a one-base shift. For each window, we counted the number of promoters containing the motif rather than the number of occurrences to avoid favouring repeated motifs. The promoter sequences were then divided into two regions. First, the [-1000, -300] region was used to learn the distribution model using a simple linear regression and determine a 99% confidence interval. Second, within the [-300, 500] region, we searched for non-evenly distributed motifs, i.e. those exhibiting a peak above the confidence interval. We increased the sliding-window size from 1 to 100 bases step-by-step when a second peak was detected or when the learning area contained at least one window within which no motif was detected, in order to avoid non-accurate learning of the distribution model. For each PLM we recorded: (i) the motif distribution, (ii) the window size, (iii) the preferential position, i.e. the position of the peak top in the distribution, (iv) the peak boundaries allowing the functional window to be located, (v) the list of promoters containing the motif and (vi) the Score of Maximal Square relative to the base line or SMS, i.e. the ratio (peak top minus base line)/(upper bound of the confidence interval minus base line). We only considered PLMs characterized by an SMS greater than 1.

Sequence distance graph and seed motifs
Out of all [-39,-26] -PLMs, we searched for those differing by a single substitution and represented the links by means of a graph. For both TA-PLM and TC-PLM graphs we identified the seed(s), i.e. PLM(s) only connected in the graph to PLMs less present in promoters than itself. The graph is thus oriented relative to the seed(s). The TA-PLM graph comprises the TATA-box and all the 15 TATAΔ-PLMs. The TC [-39,-26] -PLM graph was first constructed with the PLMs observed in more than 200 promoters. Then, remaining PLMs were added considering only links with the PLMs making up the graph and not those with the remaining PLMs.

Extended motif analyses
PLM relevance is ranked by the SMS value. For each PLM, we analyzed the SMS of all motifs extended by one base upstream or downstream of the initial PLM sequence. We referred to these motifs as extendedmotifs. Selected extended-motifs were only those that (i) were a PLM, (ii) shared the initial PLM constraints, i.e. a sharp peak in the [-39, -26] region and (iii) were characterized by an SMS 1.1-fold higher than that of the initial PLM, i.e. exhibited a stronger topological constraint.

Gene structure and gene function
All information about gene structure was obtained from FLAGdb ++ [59] including: (i) the median of the lengths of frames, coding sequences (CDSs), 3'UTRs and 5'UTRs, first introns and cumulated introns; and (ii) the percentage of intron-less genes.
Information about gene function was obtained from the TAIR Gene Ontology (GO) categories [41,60]. We used the GO Biological Function and GO Cellular Component categories. For a given gene set, we calculated the percentage of genes within that category relative to the remaining A. thaliana genes with an experimentally supported TSS. Finally, we only considered previously annotated genes, i.e. we excluded the "unknown" and the "other" categories.

Gene expression
The transcriptome data used in this work were obtained using the CATMA v2 microarray [61]. They include 522 hybridized samples extracted from 40 different projects covering 12 organ types. Expression data were downloaded from CATdb [50] containing all the transcriptome data generated by the CATMA-URGV platform. They had also been deposited in either the NCBI Gene Expression Omnibus [62] or the European Bioinformatics Institute ArrayExpress [63] repositories. Out of the 14927 A. thaliana genes with an experimentally supported TSS, we analyzed the expression data of 11161 genes specifically relating to one CATMA probe. Each gene was spotted according to (i) the percentage of samples in which it had been detected, called the hybridization percentage and (ii) its median expression level. We performed a third order linear regression allowing us to cluster 4371 genes with a High level of Expression (HE) and 6790 with a Low level of Expression level (LE). Different hybridization ranges were also defined according to two inflections in the linear regression curve. The Small Range (SR) class included 4510 genes characterized by a hybridisation percentage of less than 15% and the Wide Range (WR) class included 1241 genes characterized by a hybridisation percentage of over 85%. Second, we built a 60% confidence interval allowing the identification, within each class, of the highest and the lowest expression levels (respectively HE+ and LE+).

Orthologous genes
We selected those A. thaliana and O. sativa orthologous gene pairs being Bidirectional Best Hits [64] with BLASTp [65] resulting in 5805 orthologous pairs of genes having an experimentally supported TSS. The presence of a PLM in both genomes led to the distinction between the presence of this PLM within (i) A. thaliana promoters but not in their respective O. sativa orthologues; (b) O. sativa promoters but not in their respective A. thaliana orthologues; or (c) both orthologous genes. For each PLM-class, we performed comparisons between expected and observed conservation within orthologous gene pairs. The c/(a+c) ratio indicates the level of observed conservation in O. sativa with respect to A. thaliana. The expected conservation value is given by the ratio b/(5805-a-c), i.e. the presence of a given PLM-class in O. sativa orthologous genes from the PLM-class-less A. thaliana genes.

Statistical analysis
Statistical analyses were performed with the R statistical software [66]. We used R for (i) the regression analysis leading to PLM identification ( Figure 1) and the characterization of expression categories (Figure 7), and for (ii) motif distributions (Figures 3 and 6). We performed Fisher exact one-sided tests using the Bonferroni correction to compare percentages between two independent samples. We searched for statistical decreases or increases in the percentages of one set of genes (i) in GO annotation categories (Table 1), (ii) in intron-less ( Table 2, last row), (iii) in expression categories (Table  3) and (iv) with a given motif (Table 4). Structural gene features, even after log modification, cannot be assumed to be normally distributed. We therefore performed Wilcoxon-Mann-Whitney one-sided tests using the Bonferroni correction to compare two independent structural gene feature distributions. We considered the hypothesis that structural data might result in higher or lower values for a given gene set compared to all other genes, i.e. the 14927 genes minus the considered gene set ( Table 2, first rows). Each p value less than 5% with the Bonferroni correction was considered significant.
Additional file 1: Effect on the score of an extension by pyrimidines of TC [-39,-26] -PLMs. First column, motifs that do not exhibit an increase in their SMS (in parentheses) when extended. Second column, motifs that, when extended exhibit an increase in their original SMS due to the generation of an other 6-base-long motif (underlined sequence) characterized by a higher SMS (underlined SMS). Note that the SMS of the 7-or 8-base-long motifs are all lower than the SMS of the underlined 6-base-long motifs. Third column, motifs that can be extended in 7-or 9base-long motifs. Note that the two long extensions are made of three repeats of TCT.