Skip to main content


The word landscape of the non-coding segments of the Arabidopsis thaliana genome



Genome sequences can be conceptualized as arrangements of motifs or words. The frequencies and positional distributions of these words within particular non-coding genomic segments provide important insights into how the words function in processes such as mRNA stability and regulation of gene expression.


Using an enumerative word discovery approach, we investigated the frequencies and positional distributions of all 65,536 different 8-letter words in the genome of Arabidopsis thaliana. Focusing on promoter regions, introns, and 3' and 5' untranslated regions (3'UTRs and 5'UTRs), we compared word frequencies in these segments to genome-wide frequencies. The statistically interesting words in each segment were clustered with similar words to generate motif logos. We investigated whether words were clustered at particular locations or were distributed randomly within each genomic segment, and we classified the words using gene expression information from public repositories. Finally, we investigated whether particular sets of words appeared together more frequently than others.


Our studies provide a detailed view of the word composition of several segments of the non-coding portion of the Arabidopsis genome. Each segment contains a unique word-based signature. The respective signatures consist of the sets of enriched words, 'unwords', and word pairs within a segment, as well as the preferential locations and functional classifications for the signature words. Additionally, the positional distributions of enriched words within the segments highlight possible functional elements, and the co-associations of words in promoter regions likely represent the formation of higher order regulatory modules. This work is an important step toward fully cataloguing the functional elements of the Arabidopsis genome.


All genomes are composed of nucleotides, which are represented abstractly as letters (Adenine (A), Guanine (G), Cytosine (C), and Thymine (T)). Strings of such letters can be conceptualized as words, which provide the blueprints for organisms. Each word is found a specific number of times in a particular genome. Note that the expected frequency of a word is inversely related to the word's length. Some nucleotides appear more frequently than others (e.g. A/T in Arabidopsis), giving each genome a distinct (G+C)% content and biasing expected word frequencies. Higher order frequencies (dinucleotide and trinucleotide) also show distinct biases beyond those expected for single nucleotide frequencies [1].

Distinct selective pressures shape words positioned in different genomic regions. For example, a word in an open reading frame (ORF) has a direct influence on the primary amino acid sequence of a protein and hence is under strong selective pressure. In contrast, words in introns are likely to be under more relaxed selective constraints, unless they are important for gene functions, for example by providing docking sites for splicing factors [2] or for enzymes involved in the post-transcriptional processing of a transcript [3, 4]. The gene sections corresponding to the 5' and 3' untranslated regions (5'UTRs and 3'UTRs, respectively) are also likely to be under less selective constraints than the ORFs, yet signatures of strong selection in UTRs have been described (reviewed in [5]). The constant formation of DNA microsatellites through slippage by the replication machinery, and the action of viruses and transposons, also complicate the word landscape, especially in regions with lower selective constraints (such as introns, UTRs and intergenic regions) [6, 7].

This manuscript describes the results of a genome-wide analysis to discover putative regulatory words. Within this context, we define the cis-regulatory apparatus as all the DNA segments that are located proximal to a gene, and that also contribute to the gene's expression. It is the function of transcription factors, miRNAs, or other molecules that interact with DNA, to interpret the words (sequence code) hardwired in the cis-regulatory apparatus and to 'execute' them, thereby generating signals to the basal transcription machinery that result in changes to the rate of RNA production by the corresponding DNA-dependent RNA polymerases. When located upstream of the transcription start site (TSS), the cis-regulatory apparatus is often referred to as the promoter of a gene.

Promoters are typically divided into three regions: core, proximal and distal. The core promoter, a region at location [+1;-100] relative to the TSS, performs a central role in the formation of pre-initiation transcriptional complexes. Immediately upstream of the core promoter is the proximal promoter, which is located at position [-101;-1000] relative to the TSS and serves as a docking site for transcription factors. The distal promoter is located at [-1001;-3000] relative to the TSS and contains the regulatory elements that are commonly known as enhancers and silencers. The participation of a particular DNA segment in the regulation of gene expression can only be demonstrated experimentally. Thus, understanding the rules at play in deciphering the transcriptional regulatory code remains one of the most significant challenges in biology today.

Although most regulatory elements are present in the UTRs and upstream regions, due to their proximity to the TSS, studies have shown the presence of regulatory elements in introns, and, to a much lesser extent, in coding regions [2, 816]. Building on this knowledge, a segment-based analysis was performed that is focused on non-coding regions within the open reading frames (i.e., introns) and flanking non-coding regions (i.e., UTRs and upstream regions). The coding regions were omitted from this analysis because they are under other selection pressures corresponding to the amino acid sequences of the proteins they produce, and thus they are subjected to biases other than regulation.

Arabidopsis thaliana provides an ideal reference organism to investigate the word landscape of a plant genome, and to relate said landscape to important biological features. The Arabidopsis genome consists of 125 Mbp arranged into five chromosomes [17, 18]. The genome is well annotated and regions corresponding to introns, 3'UTRs, 5' UTRs, and intergenic genomic spaces are all available from The Arabidopsis Information Resource (TAIR, [19].

Many studies have characterized Arabidopsis DNA sequence motifs that participate in the regulation of particular genes (e.g., [2023]), and public databases such as AthaMap [24] and AGRIS [25] provide comprehensive collections of cis-regulatory elements likely to participate in the regulation of gene expression. However, a systematic analysis of all the words present in the Arabidopsis genome is still lacking.

To analyze the different segments of the Arabidopsis genome, an enumerative word discovery approach was used to detect statistically overrepresented words. Similar approaches have been successfully applied over the last decade in the area of motif discovery [2637]. In a 2005 study, Tompa et al. [38] showed that enumerative methods outperformed heuristic methods in many cases. They are particularly applicable in this research, because they allow the study of the entire 'word landscape' of a genomic data set.

Our approach scans the sequences and produces a set of words and word frequencies. This information is employed by a Markov model to compute expected word frequencies. Words with unexpectedly high frequencies are putative functional elements, and thus they are further characterized by comparing word frequencies and positions to gene induction or suppression using the method of Geisler et al. [39]. Additionally, clusters of similar words are formed and used to create motifs for putative transcription factor binding sites. Sequences that contain the same functional elements are grouped together into putative 'nodes' of regulatory networks. Words that co-occur often are identified as putative transcription factor binding modules.

Results and Discussion

Distribution of 8-letter words in the Arabidopsis genome

To determine the word distributions in the segments of the Arabidopsis thaliana genome that contribute to the cis-regulatory apparatus, a comprehensive analysis of 8-letter words in the entire genome was conducted and compared with segments corresponding to non-coding regions. Words of length 6-16 were examined and the complete results have been made available via AGRIS[25, 40]. This article reports findings for words of length eight because they correspond to the typical DNA sequence length recognized by transcription factors (usually 6-8 bp [38, 41]). Furthermore, 8-mers are long enough that there is enough diversity of word choices (~64,000) to reduce false positive results, while retaining sufficient word counts to be statistically informative.

The genome was sub-divided into segments comprising the 3' UTRs, 5'UTRs, promoters and introns (Table 1). The promoter segment was further dissected into the core promoter, corresponding to [-100; +1]; proximal promoter [-1000; -101]; and distal promoter [-3000; -1001]. The general properties of the six genome segments are shown in Table 1. As in a similar study, which was aimed at discovering regulatory elements involved in human DNA-repair pathways [26], word-based genomic signatures were created for each segment. Specifically, the following were identified for each of the genome segments: (1) the set of overrepresented words (signature words), (2) words missing from the sequences (unwords), (3) word-based clusters, (4) word co-occurrences and (5) functional categorizations of the signature words. The results are detailed in the remainder of this section.

Table 1 Segment characteristics for Arabidopsis thaliana

Overrepresented Words

All 8-letter words present in the segments were identified and scored using observed:expected frequency ratios (O/E). Specifically, each word was scored and ranked by using the function S* ln(S/E S ), where S is the number of sequences that contained the word, 'ln' is the natural logarithm, and E S is the number of sequences in which the word was expected to occur. Words discovered in the whole genome were analyzed using the O* ln(O/E O ) score, with O referring to the overall occurrence of a word across the entire genome and E O representing the expected occurrence of that word. The 25 top-ranked words, corresponding to ~0.04% of all possible words, which also corresponds to ~0.04% of the discovered words, were taken as an exemplary subset of the results and further examined (see Table 2, 3, 4, 5, 6, 7, &8 and Additional file 1, 2, 3, 4, 5, 6, &7).

Table 2 The top 25 words in 3'UTRs
Table 3 The top 25 words in 5'UTRs
Table 4 The top 25 words in Introns
Table 5 The top 25 words in Core Promoters
Table 6 The top 25 words in Proximal Promoters
Table 7 The top 25 words in Distal Promoters
Table 8 The top 25 words in the entire genome

A detailed analysis of the words identified a minimal overlap between the sets of overrepresented words for the different segments. Specifically, considering the list of top 25 words discovered in any of the six segments (and in the genome wide analysis), 175 words were unique to one specific set, 15 words occurred uniquely in two sets, 7 in three sets, 4 in four sets and none in five sets. Only two words (ATTTTTTA, and AATATATT) were shared in six out of seven sets (neither word was present in the 5'UTR set). Note that the word AATATATT has a significant similarity to the sequence of the TATA-box, a regulatory element that is (1) often found in core promoters and (2) known to contribute to the correct positioning of the core transcriptional machinery [42]. It is conceivable that the absence of AATATATT in the 5'UTR set prevents the initiation of transcription at incorrect sites.

The large differences between the various sets of words provide evidence for the existence of segment-specific signatures. Of additional interest is the uniqueness of the word-based genomic signatures in comparison to the signature for the entire Arabidopsis genome. Clearly, the segments' signatures distinguish them from each other and from the entire genome.

In addition to uniquely characterizing each segment, the top words discovered in each data set also have a strong probability of being functional regulatory elements. This argument was strengthened by a functional analysis, which is described later in this section.

Missing Words

Another interesting component of our word-based signature is the set of words NOT contained within the different segments (see Table 9, 10, 11, &12 and Additional file 8, 9, 10, &11), referred to as unwords [43] or nullomers [44, 45]. The absence of words in particular segments is an interesting phenomenon and may represent negative selection pressure or increased mutation rates specific to these words, or structural constraints on DNA [44]. Thus, the missing word sets, which show unwords and their associated scores, serve as important 'fingerprints' for the segments.

Table 9 Words not detected in the 3'UTRs
Table 10 Words not detected in the 5'UTRs
Table 11 Words not detected in the Introns
Table 12 Words not detected in the Core Promoters

Word-based Clusters

Any biologically required sequence experiences evolutionary pressure (in this case purifying selection) resulting in a narrowing of the range of allowable sequence mutations. Often, a word and various mutations of the word exhibit the same functionality. To incorporate this into our analysis, clusters were built around each of the top overrepresented words, forming groups of words that are similar to each 'seed word.' Word similarity was measured through the Hamming distance metric, which models single point mutations. A Hamming distance of 1 was used to form the clusters. Each cluster is depicted via a sequence logo, providing a visual motif of the characteristics of the cluster.

Selected clusters and the corresponding sequence logos are shown in Additional file 12. Two representative motifs are presented for each segment. Motifs for each segment were chosen in order to provide a variety of examples of putative binding sites for the non-coding segments.

The presented motifs correspond to well-known regulatory elements and complex motifs, which represent sets of putative regulatory elements. Of particular interest in Additional file 12 are the word-based clusters for the core promoters (in the left column) which correspond to the TATA-box. Also known as the Goldberg-Hogness box [46], the TATA-box is a well-characterized regulatory element appearing 31 bp upstream of the transcription start site in 30% of the promoter sequences in Arabidopsis [23]. The core promoters also contain another interesting motif, (CGACGTCG), which is involved in stress response in Arabidopsis thaliana [22]. An extensive functional characterization is described later in this section.

Word Location Distribution

The locations of a particular word within a segment can provide insight into functional properties of the word. For example, functional TATA motifs are located at a specific distance upstream of the TSS [23, 46]. We identified the segment-specific locations of the seed words of the clusters shown in Additional file 12. Being selected for their high complexity, these words are expected to exhibit a distribution bias, manifesting as peaks in the scatterplots of the distribution across sequences, as shown in Figures 1, 2, 3 and 4.

Figure 1

Word location distribution across introns. Word location distributions for interesting words within the introns. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 2

Word location distribution across core promoters. Word location distribution for interesting words within the core promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 3

Word location distribution across proximal promoters. Word location distributions for interesting words within proximal promoters. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

Figure 4

Word location distribution across the entire genome. Word location distributions for interesting words within the genome. The occurrences are shown on a log-scale in order to allow a comparison between the different segments as well as the words visualized for the entire genome.

The Figures contain histograms showing the numbers of occurrences of specific words at each point along the sequences. For uniformity, sequence lengths are normalized to the range [1;100]. Strong peaks can indeed be found for the words selected in the intron, core promoter, and proximal promoter regions. The peaks detected for the intron segment are at both the 5' and 3' ends of the introns, which means that the words occur in close proximity to flanking exons. The close proximity to the intron-exon boundaries is expected for splicing regulatory sequences [2, 816]. The peaks exhibited in core and proximal promoters are not surprising. The distributions of words locations in these segments are expected to show clustering, due to positional conservation of locations of cis-regulatory elements [23]. Of particular interest is the location of the peak for the first word chosen for the core promoter distribution (TATAATA), the TATA-box. A location of around 31 bp upstream from the TSS corresponds to the findings in [23].

Interestingly, we also detect strong peaks for the example words chosen for the genome wide word landscape, possibly indicating an important chromosomal feature that is not yet understood.

Word Co-occurrences

Genes are usually controlled by a combination of multiple transcription factors, or by transcription factor complexes binding to different sites embedded in the genes' regulatory non-coding regions. In order to detect the interacting transcription factor binding sites of a complex, we examined the positional relationships of words. The top 25 overrepresented words were paired, and the overrepresentation of each pair was determined using a Markovian background model of order 6. The top 25 overrepresented word pairs for each segment are displayed in Table 13, 14, 15, 16, 17 and 18 (see also Additional files 13, 14, 15, 16, 17, &18). The limited overlap between the word pairs of different segments indicates additional unique word-based signatures for genomic segments.

Table 13 Co-occurrence in 3'UTRs
Table 14 Co-occurrence in 5'UTRs
Table 15 Co-occurrence in Introns
Table 16 Co-occurrence in Core Promoters
Table 17 Co-occurrence in Proximal Promoters
Table 18 Co-occurrence in Distal Promoters

AGRIS Lookup

The AGRIS database [25] contains a large collection of known regulatory elements for Arabidopsis thaliana. The words discovered in this study were compared to the regulatory elements of equal or lesser length in AGRIS. Table 19 provides the overview of the motifs and their locations within the results.

Table 19 AGRIS Lookup

Functional Categorizations of Words

In order to reveal biological meanings of overrepresented words, we established associations between the overrepresented words and biological functions of the genes that harbour a particular word in their corresponding segment (Table 1). For a single word, all the genes that contain that word in their selected segment were found and the corresponding overrepresented Gene Ontology (GO) terms were identified. Overrepresentation of a GO term is determined by using the Arabidopsis gene GO term distributions as a background model. The developmental and experimental parameters that perturb the expression of genes harbouring a particular word was determined by comparing the number of induced, suppressed or neutral genes, to that expected by chance in a collection of 1305 tissue and stress microarrays from the public domain. Significant enrichment or depletion of induced or suppressed genes has been shown to correlate strongly with factors affecting regulation of a cis-regulatory element [39].

As shown in Figures 5, 6, 7, 8, 9 and 10, we identified overrepresented functional categories (y-axis) of genes that carry a particular word (x-axis, top panel) in either their 3'UTR (Figure 5), 5'UTR (Figure 6), intron (Figure 7), or promoter regions (Core, Proximal and Distal Promoters, Figures 8, 9 and 10, respectively). The red squares depict overrepresented categories with lowest p-value, calculated for each segment separately, smaller than 10E-16. For example, the word GTTTTTGA was significantly enriched in the 3'UTRs of genes that participate in the GO category "Protein Synthesis" (including the sub-categories ribosome biogenesis, ribosomal proteins, translation), and is correlated with genes suppressed in flowers and early stage siliques (p-value 4E-14). Based on microarray expression of micro-dissected tissues (see methods), the word TGTTTTTT is present in the 3' UTR of genes induced in roots (p-value 1E-8), in the atrichoblast (hairless) cell files of the root (p-value 7E-25), the root cortex (p-value 2E-23), endodermis (p-value 2E-51), and lateral root cap (p-value 4E-20). The word CTCTCTTT, enriched in introns, was correlated with differential induction in cotyledons (p-value 8E-20), suppressed in young flowers, especially carpals (p-value 1E-14) and heart stage embryos (p-value 3E-20). Surprisingly, the presence of these words in the UTRs and introns were strongly correlated with tissue specific profiles, but were only weakly enriched or strongly depleted for responses by hormones, biotic and abiotic stresses. There was no significant correlation to any of the 1305 surveyed conditions if the words were located in the 1000 bp upstream or downstream regions. This is strikingly different to the well characterized abscisic acid responsive element (ABRE) (CACGTGTC) [22], which when found in the 1000 bp 5'upstream region, was strongly correlated to induction by 10 μM abscisic acid (ABA) (p-value 4E-49), cold, salt and drought stresses (p-values < 1E-40), in flowers (p-value 1E-31), and suppressed in roots (p-value 4E-7) but no significant correlations were observed when ABRE was present in the 3'UTRs, 5'UTRs or introns. We also analyzed primary promoter regions where most of the basal promoter elements are expected to be located. The frequency of words is calculated as described above, and genes that contain the high scoring word in their primary promoter region were queried for enriched biological function. For example, GCCCATTA is found in core promoter regions of genes preferably involved in ribosome biogenesis and translation. Genes with this word in the upstream promoter are significantly depleted for response to all hormones, biotic and abiotic stresses (typically p-value 1E-8 or better). In other words, genes harbouring this word in their upstream promoter region tend to be less responsive to stresses than randomly chosen genes. However, the word CTATAAAT was found in core promoter regions of genes preferably functioning as storage facilitating proteins (Figure 8). Genes with this word in the upstream promoter are rapidly induced by 10 nM brassinolide (p-value 1E-9) and by salt stress in roots (p-value 4E-9). These genes are also induced in roots, flowers, pollen, and during seed development, and strongly suppressed in young leaves and cotyledons.

Figure 5

Cellular functions in 3'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 3'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 6

Cellular functions in 5'UTRs. Enriched functional categories within the set of genes associated with each word in the top 25 words of the 5'UTRs. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 7

Cellular functions in introns. Enriched functional categories within the set of genes associated with each word in the top 25 words of the introns. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 8

Cellular functions in core promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the core promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 9

Cellular functions in proximal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the proximal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

Figure 10

Cellular functions in distal promoters. Enriched functional categories within the set of genes associated with each word in the top 25 words of the distal promoters. The lookup was conducted against the MIPS Functional Catalogue Database (FunCatDB) [54].

A set of 10 frequently enriched cis-elements was recently provided for the ATH95 gene coexpression neighbourhood (AAACCCTA, CTTATCCN, GGCCCANN, GCCACGTN, GCGGGAAN, GACCGTTN, AANGTCAA, CNGATCNA, NCGTGTCN, CATGCANN) [47]. Our results show a direct overlap with two of those words (AAACCCTA, NCGTGTCN), which are detected and marked as 'interesting' in the 5'UTRs, and the proximal promoters, respectively. Several words were hit partially as members of the 'interesting' word clusters (CTTATCCN, GCCACGTN, AANGTCAA, CNGATCNA), while others were not represented in the selected word clusters and the top 25 words. While no overlap for GACCGTTN could be found, it is possible to validate the significance of GGCCCANN and GCGGGAAN through the detection of these two words as unwords in the introns, marking them interesting regulatory elements associated with the expression, but not necessarily with the regulation of the associated alternative splicing process.


The analyses described here provide a first view of the word landscape within the non-coding regions of the Arabidopsis thaliana genome. An analysis centred on the statistically interesting words furnishes important insights into the unique elements of each segment. The correlations of particular words with cellular functions or expression patterns provide valuable hypotheses for further experimentation. Correlation between word position and expression also seems strong, with one class of words only present in the 5'/3'UTRs and introns, and another class of words putatively functioning only in the region upstream of the TSS. Words in the first class seem more directed at regulation of tissue and cellular identity, while words which function upstream appear more likely to be involved in environmental responses.


Word-based genomic signatures are the union of results generated by applying the software pipeline shown in Figure 11. Statistically relevant words are extracted from a set of genomic sequences, and are analyzed to determine similarity, location distribution, groupings, and predicted cellular function.

Figure 11

Process Flowchart. Methodology flow applied for the discovery of word-based genomic signatures in non-coding Arabidopsis thaliana.

Sequence Data

This manuscript reports the results of analyzing DNA sequences of Arabidopsis thaliana. The non-coding genomic segments (specifically, the 3'UTRs, 5'UTRs, promoters and introns) and the entire genomic sequence (as complete chromosomes) were obtained from TAIR (release 8) [19]. Both masked and unmasked versions of the genome were analyzed. Ambiguous nucleotides, depicted in the sequences by the letters [R, Y, W, S, K, M], were removed because they represent sequencing anomalies; this resulted in the removal of 0.15% (or 188,820) of the nucleotides.

In this study, only protein-coding genes were considered as genes, and transposable-like, or pseudo-genes, were omitted. Thus, the total number of genes in this study is ~27,000. Due to different lengths and locations of the promoter elements it is possible that, while core promoters can occur for a specific gene, no distal promoter for that gene exists due to the fact that its location would fall into another gene or even outside of a chromosome. The difference in number of genes in 3'UTRs and 5'UTRs sets compared to other sets is due to genes that lack annotated UTR (it is yet to be discovered).

Whenever multiple spliced transcripts were available for a gene, a major transcript was chosen (Atngnnnnn.1) to prevent bias towards genes that contain multiple transcripts. Likewise, only introns of major transcripts were selected.

Word Enumeration and Scoring

The first pipeline stage employs a radix trie data structure [48] to enumerate all subsequences (words) of a specified length in the given DNA input sequences. For each word w, with o total occurrences in s sequences, a word score is computed as s*ln(s/Es(w)). The expected number of sequences containing word w, Es(w), is computed as the product of (1) the probability for each observed word to occur anywhere in the input sequences and (2) the total length of the sequences. This model implicitly assumes a binomial model for the word distribution, i.e., that the word probabilities are independent of the positions of the words within the sequences [49, 50]. The probability is computed by using a maximum-order homogeneous Markov chain model [49] where the transition probabilities are determined using the Maximum Likelihood Method [50]. (Note that under this model, the (G+C)% biasing is achieved for any order of Markov model greater than or equal to zero, since the frequencies of individual nucleotides are taken into consideration for all orders.) The order of the Markov model was chosen by using a standard chi-square test to assess the appropriateness of Markov chains of orders 0 to 6. To provide the highest precision for computation of expected values, the highest order model that passed the chi-square test was selected. Thus, an order 6 model was selected.

A p-value for each word (representing the probability of obtaining a score at least as high as the one observed [51]) is calculated by using a binomial word distribution to determine the probability of obtaining at least o repeats in the s input sequences that contain w.

Word Clustering

The Word Clustering stage computes a cluster for each of the top scoring words (seed words) identified in the Word Scoring phase. A cluster is computed from a seed word by determining the set of words whose Hamming distance is within a user-specified threshold. A Position/Weight Matrix (PWM) is constructed for each cluster [52], and a sequence logo is created from each PWM using the TFBS module by Lenhart and Wasserman [53]. For example, the PWM for the seed word ATTTTGTA in the 3'UTRs is as follows:

The columns of the PWM correspond to nucleotide positions and the rows correspond to the nucleotides A, C, G, and T, respectively.

Word Location Distribution

For selected words from the different segments it was determined if they were clustered at specific locations along the corresponding sequences in which they occur. In order to detect a location bias, representative of such clusters, histograms were created to show the numbers of occurrences of a specific word at each point corresponding to a positional offset from the transcription start site (TSS). For uniformity, sequence lengths were normalized to the range [1;100], to represent the number of nucleotides between the position and the TSS.

Co-Occurrence Analysis

The Co-Occurrence Analysis considers all non-overlapping pairs of the top ranked words and computes the expected number of sequences that contain both words. Subsequently, the observed number of sequences that contain both words is determined, and an observed-to-expected ratio is computed (using a binomial word distribution) for each word pair.

AGRIS Lookup

Previously published and curated binding site motifs which are equal to or shorter than eight base pairs were extracted from the AGRIS AtcisDB database [25], and were compared with the word lists generated for the different segments. For each motif the corresponding entries in word list were determined and the highest scoring word was identified.

Determine Cellular Function

The MIPS Functional Catalogue Database (FunCatDB) [54], was used for determining over-represented cellular functions in each gene list containing a particular word. The workflow of the cellular function analysis, labelled as "Cellular Function" in the larger process flow (Figure 11) is as follows. For each word in the 'top 25' lists (Table 2, 3, 4, 5, 6, 7, &8) we determined the list of genes that contained the word being analyzed in the corresponding region. Then we determined the functional category of each gene by using the functional category scheme (version 2.1) retrieved from FuncatDB. The p-values for enrichment of categories were calculated by statistical tests with the hypergeometric distribution. After filtering out p-values greater than 1E-5, results were visualized by the matrix2png software package [55].

Analysis of the correlation between word location and gene expression was done as described in [39] with the following exceptions. A larger database was constructed from 1305 available raw microarray datasets (Additional file 19) present in NASC affyarrays and the gene expression omnibus The p-value was calculated using a chi-squared test comparing genes 2-fold induced, 2-fold suppressed, or neutral between observed (all genes harbouring the word) and expected values (based on genomic average). The Bonferroni correction was used to adjust for multiple hypothesis testing. Microarray sources included a large tissue macro-dissection [56], and the follow-up studies on stress, hormones, and pathogens [57]. We included the laser capture microdissected tissue microarray datasets [58], the gene expression profile of the Arabidopsis root [59], analysis of brassinosteroids [60], and the numerous other experiments found in the collected dataset in the above mentioned repositories. Data were normalized using global scaling of the middle 96% data points, and then noise filtered using a t-test of signal vs. background, and a t-test of signal vs. control.


  1. 1.

    Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001, 17 (12): 1113-1122. 10.1093/bioinformatics/17.12.1113.

  2. 2.

    Brudno M, Gelfand MS, Spengler S, Zorn M, Dubchak I, Conboy JG: Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucl Acids Res. 2001, 29 (11): 2338-2348. 10.1093/nar/29.11.2338.

  3. 3.

    Ham B-K, Brandom JL, Xoconostle-Cazares B, Ringgold V, Lough TJ, Lucas WJ: A Polypyrimidine Tract Binding Protein, Pumpkin RBP50, Forms the Basis of a Phloem-Mobile Ribonucleoprotein Complex. Plant Cell. 2009, 21 (1): 197-215. 10.1105/tpc.108.061317.

  4. 4.

    Huang N-C, Yu T-S: The sequences of Arabidopsis GA-INSENSITIVE RNA constitute the motifs that are necessary and sufficient for RNA long-distance trafficking. The Plant Journal. 2009, 59 (6): 921-929. 10.1111/j.1365-313X.2009.03918.x.

  5. 5.

    Sunyaev SR, Lathe WC, Ramensky VE, Bork P: SNP frequencies in human genes an excess of rare alleles and differing modes of selection. Trends Genet. 2000, 16 (8): 335-337. 10.1016/S0168-9525(00)02058-8.

  6. 6.

    Symonds VV, Lloyd AM: An Analysis of Microsatellite Loci in Arabidopsis thaliana: Mutational Dynamics and Application. Genetics. 2003, 165 (3): 1475-1488.

  7. 7.

    Biemont C, Vieira C: What transposable elements tell us about genome organization and evolution: the case of Drosophila. Cytogenet Genome Res. 2005, 110: 25-34. 10.1159/000084935.

  8. 8.

    Neznanov N, Umezawa A, Oshima RG: A Regulatory Element within a Coding Exon Modulates Keratin 18Gene Expression in Transgenic Mice. Journal of Biological Chemistry. 1997, 272 (44): 27549-27557. 10.1074/jbc.272.44.27549.

  9. 9.

    Bilodeau PS, Domsic JK, Stoltzfus CM: Splicing Regulatory Elements within tat Exon 2 of Human Immunodeficiency Virus Type 1 (HIV-1) Are Characteristic of Group M but Not Group O HIV-1 Strains. Journal of Virology. 1999, 73 (12): 9764-9772.

  10. 10.

    Bornstein P, McKay J, Morishima JK, Devarayalu S, Gelinas RE: Regulatory elements in the first intron contribute to transcriptional control of the human alpha 1(I) collagen gene. Proceedings of the National Academy of Sciences of the United States of America. 1987, 84 (24): 8869-8873. 10.1073/pnas.84.24.8869.

  11. 11.

    Fiume E, Christou P, Gianì S, Breviario D: Introns are key regulatory elements of rice tubulin expression. Planta. 2004, 218 (5): 693-703. 10.1007/s00425-003-1150-0.

  12. 12.

    Smith AN, Barth ML, McDowell TL, Moulin DS, Nuthall HN, Hollingsworth MA, Harris A: A Regulatory Element in Intron 1 of the Cystic Fibrosis Transmembrane Conductance Regulator Gene. J Biol Chem. 1996, 271 (17): 9947-9954. 10.1074/jbc.271.17.9947.

  13. 13.

    Rohrer J, Ellen Conley M: Transcriptional Regulatory Elements Within the First Intron of Bruton's Tyrosine Kinase. Blood. 1998, 91 (1): 214-221.

  14. 14.

    Wardrop SL, Brown MA: Identification of two evolutionarily conserved and functional regulatory elements in intron 2 of the human BRCA1 gene. Genomics. 2005, 86 (3): 316-328. 10.1016/j.ygeno.2005.05.006.

  15. 15.

    Reid LH, Gregg RG, Smithies O, Koller BH: Regulatory Elements in the Introns of the Human HPRT Gene are Necessary for Its Expression in Embryonic Stem Cells. Proceedings of the National Academy of Sciences of the United States of America. 1990, 87 (11): 4299-4303. 10.1073/pnas.87.11.4299.

  16. 16.

    Rippe RA, Lorenzen SI, Brenner DA, Breindl M: Regulatory elements in the 5'-flanking region and the first intron contribute to transcriptional control of the mouse alpha 1 type I collagen gene. Mol Cell Biol. 1989, 9 (5): 2224-2227.

  17. 17.

    Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.

  18. 18.

    Polakowska RR, Graf BA, Falciano V, LaCelle P: Transcription regulatory elements of the first intron control human transglutaminase type I gene expression in epidermal keratinocytes. Journal of Cellular Biochemistry. 1999, 73 (3): 355-369. 10.1002/(SICI)1097-4644(19990601)73:3<355::AID-JCB7>3.0.CO;2-D.

  19. 19.

    Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucl Acids Res. 2003, 31 (1): 224-228. 10.1093/nar/gkg076.

  20. 20.

    Hudson ME, Quail PH: Identification of promoter motifs involved in the network of phytochrome A-regulated gene expression by combined analysis of genomic sequence and microarray data. Plant Physiology. 2003, 133 (4): 1605-1616. 10.1104/pp.103.030437.

  21. 21.

    Rapp P, Carlson J, Michael T, McClung C, Gross R: Examination of Arabidopsis thaliana upstream regions for possible promoter motifs. 5th Annual Conference on Computational Genomics: 2001; Baltimore, Maryland. 2001

  22. 22.

    Zhang W, Ruan J, Ho T-hD, You Y, Yu T, Quatrano RS: Cis-regulatory element based targeted gene finding: genome-wide identification of abscisic acid- and abiotic stress-responsive genes in Arabidopsis thaliana. Bioinformatics. 2005, 21 (14): 3074-3081. 10.1093/bioinformatics/bti490.

  23. 23.

    Molina C, Grotewold E: Genome wide analysis of Arabidopsis core promoters. BMC Genomics. 2005, 6 (1): 25-10.1186/1471-2164-6-25.

  24. 24.

    Steffens NO, Galuschka C, Schindler M, Bulow L, Hehl R: AthaMap web tools for database-assisted identification of combinatorial cis-regulatory elements and the display of highly conserved transcription facote binding sites in Arabidopsis thaliana. Nucleic Acids Res. 2005, W397-402. 10.1093/nar/gki395. 33 Web Server

  25. 25.

    Davuluri R, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E: AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics. 2003, 4 (1): 25-10.1186/1471-2105-4-25.

  26. 26.

    Lichtenberg J, Jacox E, Welch J, Kurz K, Liang X, Yang M, Drews F, Ecker K, Lee S, Elnitski L, et al: Word-based characterization of promoters involved in human DNA repair pathways. BMC Genomics. 2009, 10 (Suppl 1): S18-10.1186/1471-2164-10-S1-S18.

  27. 27.

    Lichtenberg J, Morris P, Ecker K, Welch L: Discovery of regulatory elements in oomycete orthologs. The 2008 International Conference on Bioinformatics and Computational Biology. Las Vegas. 2008

  28. 28.

    Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001, 17 (Suppl 1): S207-214.

  29. 29.

    Pavesi G, Mauri G, Pesole G: In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics. 2004, 5 (3): 217-236. 10.1093/bib/5.3.217.

  30. 30.

    Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004, 32: W199-203. 10.1093/nar/gkh465.

  31. 31.

    Apostolico A, Bock ME, Lonardi S, Xu X: Efficient Detection of Unusual Words. Journal of Computational Biology. 2000, 7 (1-2): 71-94. 10.1089/10665270050081397.

  32. 32.

    van Helden J, Andre B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998, 281 (5): 827-842. 10.1006/jmbi.1998.1947.

  33. 33.

    Bussemaker HJ, Li H, Siggia ED: Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci. 2000, 97 (18): 10096-10100. 10.1073/pnas.180265397.

  34. 34.

    Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Proceedings of the fourth annual international conference on Computational molecular biology: 2000. 2000, 345-362.

  35. 35.

    Sinha S, Tompa M: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002, 30 (24): 5549-5560. 10.1093/nar/gkf669.

  36. 36.

    Vilo J, Brazma A, Jonassen I, Robinson A, Ukkonen E: Mining for putative regulatory elements in the yeast genome using gene expression data. Proc Int Conf Intell Syst Mol Biol. 2000, 8: 384-394.

  37. 37.

    Sinha S, Tompa M: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003, 31 (13): 3586-3588. 10.1093/nar/gkg618.

  38. 38.

    Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology. 2005, 23: 137-144. 10.1038/nbt1053.

  39. 39.

    Geisler M, Kleczkowski LA, Karpinski S: A universal algorithm for genome-wide in silicio identification of biologically significant gene promoter putative cis-regulatory-elements; identification of new elements for reactive oxygen species and sucrose signaling in Arabidopsis. Plant Journal. 2006, 45: 384-398. 10.1111/j.1365-313X.2005.02634.x.

  40. 40.

    Palaniswamy SK, James S, Sun H, Lamb RS, Davuluri RV, Grotewold E: AGRIS and AtRegNet. A Platform to Link cis-Regulatory Elements and Transcription Factors into Regulatory Networks. Plant Physiol. 2006, 140 (3): 818-829. 10.1104/pp.105.072280.

  41. 41.

    Grotewold E, Springer N: The Plant Genome: Decoding the Transcriptional Hardwiring. Annual Plant Reviews. 2009, 35: 196-227.

  42. 42.

    Smale ST, Kadonaga JT: The RNA Polymerase II Core Promoter. Annu Rev Biochem. 2003, 72 (1): 449-479. 10.1146/annurev.biochem.72.121801.161520.

  43. 43.

    Herold J, Kurtz S, Giegerich R: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008, 9 (1): 167-10.1186/1471-2105-9-167.

  44. 44.

    Acquisti C, Poste G, Curtiss D, Kumar S: Nullomers: really a matter of natural selection. PLoS ONE. 2007, 2 (10): 10.1371/journal.pone.0001022.

  45. 45.

    Hampikian G, Andersen T: Absent sequences: nullomers and primes. Pacific Sym on Biocomputing. 2007, 12: 355-366. full_text.

  46. 46.

    Lifton RP, Goldberg ML, Karp RW, Hogness DS: The organization of the histone genes in Drosophila melanogaster: functional and evolutionary implications. Cold Spring Harbor symposium on quantitative biology. 1978, 42 (Pt 2): 1047-

  47. 47.

    Vandepoele K, Quimbaya M, Casneuf T, De Veylder L, Peer Van de Y: Unraveling Transcriptional Control in Arabidopsis Using cis-Regulatory Elements and Coexpression Networks. Plant Physiol. 2009, 150 (2): 535-546. 10.1104/pp.109.136028.

  48. 48.

    Morrison DR: PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM. 1968, 15 (4): 514-534. 10.1145/321479.321481.

  49. 49.

    Robin S, Rodolphe F, Schbath S: DNA, Words and Models. 2005, Cambridge: Cambridge University Press

  50. 50.

    Ewens WJ, Grant GR: Statistical Methods in Bioinformatics. 2001, New York: Springer

  51. 51.

    Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Mueller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics. 2008, 24 (13): 223-231. 10.1093/bioinformatics/btn161.

  52. 52.

    Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of Transcription Factor Binding Sites with Variable-order Bayesian Networks. Bioinformatics. 2005, 21 (11): 2657-2666. 10.1093/bioinformatics/bti410.

  53. 53.

    Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18: 1135-1136. 10.1093/bioinformatics/18.8.1135.

  54. 54.

    Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, et al: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004, 32 (18): 5539-5545. 10.1093/nar/gkh894.

  55. 55.

    Pavlidis P, Noble WS: Matrix2png: A Utility for Visualizing Matrix Data. Bioinformatics. 2003, 19: 295-296. 10.1093/bioinformatics/19.2.295.

  56. 56.

    Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU: A gene expression map of Arabidopsis thaliana development. Nature Genetics. 2005, 37: 501-506. 10.1038/ng1543.

  57. 57.

    Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D'Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. The Plant Journal. 2007, 50 (2): 347-363. 10.1111/j.1365-313X.2007.03052.x.

  58. 58.

    Spencer MWB, Casson SA, Lindsey K: Transcriptional profiling of the Arabidopsis embryo. Plant Physiology. 2007, 143: 924-940. 10.1104/pp.106.087668.

  59. 59.

    Brady SM, Orlando DA, Lee JY, Wang JY, Koch J, Dinneny JR, Mace D, Ohler U, Benfey PN: A high-resolution root spatiotemporal map reveals dominant expression patterns. Science. 2007, 318: 801-806. 10.1126/science.1146265.

  60. 60.

    Goda H, Shimada Y, Asami T, Fujioka S, Yoshida S: Microarray analysis of brassinosteroid-regulated genes in Arabidopsis. Plant Physiology. 2002, 130: 1319-1334. 10.1104/pp.011254.

Download references


The Ohio University team acknowledges the support of the Stocker Endowment, Ohio University's Graduate Research and Education Board (GERB), the Ohio Supercomputer Center, the Choose Ohio First Initiative of the University System of Ohio. We also wish to thank Sarah Wyatt for proofreading the manuscript. The Ohio University team further acknowledges that salaries and research support are provided by state funds appropriated to the Ohio Plant Biotechnology Consortium through The Ohio State University, Ohio Agricultural Research and Development Center. MG acknowledges the support of the plant biology department at Southern Illinois University Carbondale, and would like to thank Patrick Brown and Elisabeth Fitzek for help in assembling and analysis of the microarray database and tools. Funding for EG was provided by National Science Foundation grants MCB-0418891, MCB-0705415 and by state funds appropriated to the Ohio Plant Biotechnology Consortium through The Ohio State University, Ohio Agricultural Research and Development Center. Finally we would like to acknowledge the anonymous reviewers who helped in enhancing the quality of the manuscript presented here.

Author information

Correspondence to Jens Lichtenberg.

Additional information

Authors' contributions

JL contributed in the development of algorithms and models, the implementation of algorithms, generation of the results and drafting of the document. JDW contributed in the development of the models and algorithms and the implementation of the approaches. KK contributed in the development, implementation and testing of models and algorithms. XL contributed in the development of the models and algorithms for co-occurrence analysis and generated the respective data. FD contributed in the development of models and algorithms, and in the implementation of the methods. MG conducted correlation analysis between word presence/location and gene expression pattern. KE contributed the idea of Hamming-distance-based clustering. SSL contributed to the statistical foundations of the scoring model. AY's contributions include extraction of data sets, functional analysis of words, and writing the manuscript. EG contributed to writing the manuscript and integrating the identified words with existing knowledge on control of gene expression. In addition to architecting the software pipeline employed in this research, LRW contributed to the design, implementation and validation of models and algorithms (especially in the areas of word searching and word scoring) and to the writing of this manuscript.

All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: Words discovered in 3'UTRs. Entire set of words discovered in the 3'UTRs with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 5 MB)

Additional file 2: Words discovered in 5'UTRs. Entire set of words discovered in the 5'UTRs with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 5 MB)

Additional file 3: Words discovered in introns. Entire set of words discovered in the introns with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 6 MB)

Additional file 4: Words discovered in core promoters. Entire set of words discovered in the core promoters [-100;+1] with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 5 MB)

Additional file 5: Words discovered in proximal promoters. Entire set of words discovered in the proximal promoters [-1,000;-101] with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 6 MB)

Additional file 6: Words discovered in distal promoters. Entire set of words discovered in the distal promoters [-3,000;-1,001] with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 6 MB)

Additional file 7: Words discovered in entire genome. Entire set of words discovered in the complete genome with occurrences, expected occurrences, scores, reverse complement information and p-value. (CSV 4 MB)

Additional file 8: Words missed in 3'UTRs. Entire set of words expected to occur but not discovered in the 3'UTRs with expected occurrences. (CSV 10 KB)

Additional file 9: Words missed in 5'UTRs. Entire set of words expected to occur but not discovered in the 5'UTRs with expected occurrences. (CSV 13 KB)

Additional file 10: Words missed in introns. Entire set of words expected to occur but not discovered in the introns with expected occurrences. (CSV 2 KB)

Additional file 11: Words missed in core promoters. Entire set of words expected to occur but not discovered in the core promoters with expected occurrences. (CSV 10 KB)

Additional file 12: Word based clusters. Word-based clusters built around 2 overrepresented words of each non-coding segment of Arabidopsis thaliana represented by the word cluster and the sequence logo associated with said cluster. A word in a word cluster is presented through the nucleotide sequence associated with the word, the sequence count, the overall count and the SlnSES score. (DOC 376 KB)

Additional file 13: Word co-occurrences in 3'UTRs. Entire set of co-occurring words (taken from the top 25 words) discovered in the 3'UTRs with occurrence, expected occurrences and scores. (CSV 38 KB)

Additional file 14: Word co-occurrences in 5'UTRs. Entire set of co-occurring words (taken from the top 25 words) discovered in the 5'UTRs with occurrence, expected occurrences and scores. (CSV 42 KB)

Additional file 15: Word co-occurrences in introns. Entire set of co-occurring words (taken from the top 25 words) discovered in the introns with occurrence, expected occurrences and scores. (CSV 31 KB)

Additional file 16: Word co-occurrences in core promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the core promoters with occurrence, expected occurrences and scores. (CSV 43 KB)

Additional file 17: Word co-occurrences in proximal promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the proximal promoters with occurrence, expected occurrences and scores. (CSV 64 KB)

Additional file 18: Word co-occurrences in distal promoters. Entire set of co-occurring words (taken from the top 25 words) discovered in the distal promoters with occurrence, expected occurrences and scores. (CSV 67 KB)

Additional file 19: NASC Microarrays. Entire set of microarray experiments available in NASC that were used for the cellular functional analysis. (XLS 560 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Lichtenberg, J., Yilmaz, A., Welch, J.D. et al. The word landscape of the non-coding segments of the Arabidopsis thaliana genome. BMC Genomics 10, 463 (2009).

Download citation


  • Transcription Start Site
  • Word Pair
  • Arabidopsis Genome
  • Core Promoter
  • Proximal Promoter