Patterns and evolution of ACGT repeat cis-element landscape across four plant genomes
© Mehrotra et al.; licensee BioMed Central Ltd. 2013
Received: 30 July 2012
Accepted: 18 March 2013
Published: 25 March 2013
Skip to main content
© Mehrotra et al.; licensee BioMed Central Ltd. 2013
Received: 30 July 2012
Accepted: 18 March 2013
Published: 25 March 2013
Transcription factor binding is regulated by several interactions, primarily involving cis-element binding. These binding sites maintain specificity by means of their sequence, and other additional factors such as inter-motif distance and spacer specificity. The ACGT core sequence has been established as a functionally important cis-element which frequently regulates gene expression in synergy with other cis-elements. In this study, we used two monocotyledonous – Oryza sativa and Sorghum bicolor, and two dicotyledonous species – Arabidopsis thaliana and Glycine max to analyze the conservation of co-occurring ACGT core elements in plant promoters with respect to spacer distance between them. Using data generated from Arabidopsis thaliana and Oryza sativa, we also identified conserved regions across all spacers and possible conditions regulating gene promoters with multiple ACGT cis-elements.
Our data indicated specific predominant spacer lengths between co-occurring ACGT elements, but these lengths were not universally conserved across all species under analysis. However, the frequency distribution indicated local regions of high correlation among monocots and dicots. Sequence specificity data clearly revealed a preference for G at the first and C at the terminal position of a spacer sequence, suggesting that the G-box motif is the most prevalent for the ACGT class of promoters. Using gene expression databases, we also observed trends suggesting that co-occurring ACGT elements are responsible for gene regulation in response to exogenous stress. Conservation in patterns of ACGT (N) ACGT among orthologous genes also indicated the possibility that emergence of functional significance across species was a result of parallel evolution of these cis-elements.
Although the importance of ACGT elements has been acknowledged for several plant species, ours is the first study that attempts to compare their occurrence across four species and analyze conservation among them. The apparent preference for particular spacer distances suggest that these motifs might be implicated in important physiological functions which are yet to be identified. Variations in correlation patterns among monocots and dicots might arise out of differences in transcriptional regulation in the two classes. In accordance with literature, we established the involvement of co-occurring ACGT elements in stress responses and showed how this regulation differs with variation in the ACGT (N) ACGT motif. We believe that our study will be an essential resource in determining optimum spacer length and spacer sequence between ACGT elements for promoter design in future.
Eukaryotic gene expression and its regulation by means of transcription is one of the most significant areas of research currently. Regulation of gene transcription relies on interactions among transcription factors (TFs) which bind to specific DNA cis sites to form a conglomeration of proteins which guide the polymerase binding . These cis sites include enhancers, core promoters, matrix or scaffold attachment regions, insulators and silencers. Of these, enhancer elements are an important class of cis-regulatory sequences that are usually present upstream of the transcription initiation site and contain multiple short binding site sequences for targeting several activators and repressors of transcription. These short binding site sequences are often referred to as sequence motifs and occur in recurring patterns across DNA . It is generally accepted that sequence motifs co-evolve with their core promoter in both sequence specific and location specific ways to achieve a directed target function [3, 4]. Comparative studies show conserved regions of promoters are expressed widely across genomes of various species, indicating directed evolution , correlating with the notion that functionally less important regions of DNA evolve (in terms of mutant substitutions) faster than more important ones . For example, in a comparison of the 200 base pair early enhancer of Hoxc8 in 29 species of mammals the complete nucleotide sequences of this region were 90% similar across all taxa, confirming that this enhancer sequence has been specifically conserved [7, 8]. Apart from the sequence of these motifs, the positional and inter-motif distances within the enhancers also play a critical role in interactions between transcription factors , as the spacing and intermediary sequences control the size and strength of interactions of TF binding , subsequently affecting the gene expression . In fact, unless helical phasing is conserved to provide allowance for protein binding, even a change in spacer length by one base pair can drastically alter gene expression levels . Holistically, this implies that conserved sequences (conforming to the various spatial and positional constraints) occurring in higher numbers as compared to what is probabilistically expected may be of specific functional significance.
In plant genomes, one such sequence motif – the ACGT core sequence – is functionally important in a variety of promoters that respond to different stimuli like light , anaerobiosis , jasmonic acid  and hormones such as salicylic acid , abscisic acid [16–18] and auxin . This core element is present at different relative positions in multiple copies upstream of the transcription start site , and any alterations in this core sequence reduce the overall promoter activity significantly, for it contributes synergistically to gene expression by stabilizing the transcription complex formed on the minimal promoter . Co-occurring ACGT elements are over-represented in Arabidopsis and rice genomes, emphasizing their functional relevance when compared to single ACGT core elements . As discussed, the inter motif distance between these co-occurring ACGT sequence is of particular importance as promoter activation by ACGT is differentially regulated by the spacing between two copies of the motif . Additionally, the copy number of ACGT elements in a promoter and distance from the transcription start site also drastically alter gene expression . While most reports on the ACGT core sequence are based on Arabidopsis thaliana, the ACGT family of promoters (ACEs) have also been identified in wheat  rice [26–28] and barley , suggesting that ACEs are conserved across plant species.
Given that the ACGT core sequence is dispersed across promoters of various plant species; and that they occur in multiple copies with a variable number of base pairs separating them, we have attempted to analyze patterns in occurrence of ACGT core element repeats in plant genomes. We performed an in-silico search for ACGT elements separated by spacers of varying lengths in all identified promoters for two monocots – Sorghum and Rice, and two dicots – Arabidopsis and Soybean. Our data indicated similarities in the frequency patterns across the four plant species, with correlations for particular spacer lengths between ACGT core elements. In order to analyze if a specific sequence motif is preferred as a spacer between multiple ACGT elements across all promoters, we developed consensus sequences from all spacers observed. Additionally, we studied the evolution of co-occurring ACGT elements by analyzing their prevalence among orthologous genes in Arabidopsis, Rice and Sorghum. Further, to understand the functional significance of these elements, we used microarray data to analyze which conditions might be responsible for regulation of genes consisting of multiple ACGT cis-elements.
Focusing our analysis on promoters, we extracted 1 kb sequences upstream of all identified chromosomal genes from the following genomes - Arabidopsis thaliana (The Arabidopsis Genome Initiative v. 10, 2011), Oryza sativa (Rice) (International Rice Genome Sequencing Project, Build 4.0, 2009), Glycine max (Soybean) (US DOE Joint Genome Institute (JGI-PGF), v. 1.0, 2010) and Sorghum bicolor (Sorghum) (Sorghum Consortium, v. 1.0, 2009) using the NCBI Reference Sequence database [30–33]. Using a code (Additional file 1), we extracted gene annotation information (Gene ID/ Arabidopsis TAIR ID and ATG site) from Gene bank files and the corresponding 1 kb upstream region from the FASTA sequence. We searched for co-occurring ACGT elements of the form ACGT (N) ACGT, where 0 ≤ N ≤ 30 in all extracted 1 kb regions for our analysis. As it has been previously seen that cooperatively binding transcription factors are usually spaced within 25 bp, we limited our analysis to a spacer distance of 30 bp. The sequence of each spacer (region between two ACGT core elements) was extracted and the total number of occurrences for each spacer length was determined for each species. In order to test the significance of these frequencies, we used four palindromic – TAGC, CGTA, GCTA, ATGC, and four non-palindromic – AGCT, TGCA, CTAG, GATC sequences as controls. Using the PLACE database, we ensured that each of these 4 bp sequences are not conserved cis element themselves . By performing a similar analysis on each control sequence, we compared the frequency of the ACGT (N) ACGT motif with the corresponding frequencies of control sequences for the same N.
Spacer sequences occurring in all promoters of Arabidopsis thaliana were analyzed to identify preferences for a particular nucleotide at each position within the spacer sequence. The percentage of A, G, C and T at each position for all spacer lengths (N = 0–30) was calculated to identify preferences at particular positions within the spacers. Since the genome wide GC content for Arabidopsis thaliana is known to be around 36% , we chose threshold occurrence percentage of 25% for C/G and 40% for A/T for a particular position in all spacers of the same length (N). Single letter IUPAC DNA codes were assigned to each position to generate consensus sequences for all spacer lengths.
In order to understand the mechanism of evolution of the ACGT (N) ACGT cis-element in the aforementioned plant species, we analyzed its predominance for N = 0–30 in all identified in-paralogs/orthologs among Arabidopsis, Rice and Sorghum . Gene names for Rice were converted from MSU annotation to RAP annotation prior to analysis . Frequencies for co-occurring ACGT elements were analyzed from extracted genes.
where X = (A ∩ B)/ B and Y = P(A);
A = event that a given gene is regulated (up/down) by a particular condition B = event that a given gene contains multiple ACGT elements separated by N base pairs. Further, we calculated the overall likelihood of occurrence for each condition (for N = 0–30). All conditions with likelihood of occurrence > 1.30 were subjected to further statistical analysis using the 8 control sequences described earlier.
Wherever possible, statistical analysis was performed to determine the significance of results. The frequency of ACGT (N) ACGT was assessed for significant peaks by box and whiskers plots, with 10% and 90% whiskers. The outliers were considered as potential peaks, especially if they were present across all species. The degree of correlation between the two monocot and dicot species was calculated by taking the frequencies for N consecutive spacer lengths at a time, beginning from 0 till 30 for each of the species, where N > =6. By assuming Gaussian distribution, the Pearson’s correlation coefficient was calculated for each of the cases to determine significance. Consecutive spacer lengths of N with the highest degrees of correlation were interpreted as the most conserved spacer distances.
To identify conditions regulating ACGT (N) ACGT containing promoters over the 8 control elements, a Grubb’s outlier’s test was performed on the likelihood of occurrences for each condition, assuming the data set was normally distributed. If the ACGT (N) ACGT likelihood emerged significantly higher than the controls by this test for a certain condition, it was interpreted to be specifically regulated by that condition.
Based on trends in variation of frequencies in the four species, we observed patterns of high correlation specific to monocots and dicots (Figure 1A, B). The two dicotyledonous plants involved in our study - Arabidopsis and Soybean, depicted high correlation in frequencies for spacer distance (N) = 6–11(r = 0.974; t = 8.599; p = 0.0010; N = 6) (Figure 1A). Similarly, Rice and Sorghum, both monocotyledonous plants, showed regions of high correlation for spacer distance (N) = 0–6 (r = 0.984; t = 12.42; p = .0001; N = 7) and (N) = 18–26 (r = 0.934; t = 6.921; p =0.0002; N = 9). Comparisons across monocots and dicots, for the most part, were found to be not as significantly correlated (r < 0.8).
The ACGT core motif forms an important class of cis-elements implicated in a variety of functions. Multiple ACGT motifs have been shown to form enhancer elements which bind synergistically to transcription factors for gene regulation . Our data indicates that certain spacer lengths are preferred over others in plant promoters. It is possible that these spacer lengths are present in abundance due to extra stability conferred by helical phasing at these particular lengths . A major finding was the peak at spacer distance (N) = 7 for rice, which is noteworthy as it is almost twice of the next highest frequency observed in the distribution. To the best of our knowledge, there are no previous reports describing an interaction between cis-elements in rice with such spatial constraints. Therefore, a further investigation for this observation could be interesting. Unfortunately, the peaks observed in frequency distributions were not consistent throughout the four plant species, or even among individual classes. However, a consistency in the frequency dips observed among monocots and dicots suggests a class specific mode of gene regulation. A possible explanation for the consistent dips could be that certain spacer lengths might cause sterically incompatible binding of transcription factors, explaining why they could be less preferred as compared to an otherwise compatible binding caused by other spacer lengths . The theory of a class specific mode of regulation is also supported by differences in regions of high correlation in monocots and dicots.
The correlations in the patterns of ACGT repeat frequencies led us to speculate the precise mechanism of conservation of these repeats. The similar trends and subsequent functional significance of ACGT elements suggest two possible mechanisms of progression - either parallel or convergent evolution. We expect that a significant correlation in the spacer patterns of orthologous/paralogous genes groups would confirm the evolution of ACGT elements being a result of parallel evolution, whereas no correlation would suggest convergent mechanisms. Our results strongly suggest proof for parallel evolution, as patterns in ACGT elements appear to have evolved from a common ancestral gene and subsequently persisted in the descendent genes across species.
The subsequent comparison of co-occurring ACGT elements with random 4 bp nucleotides indicated that ACGT elements are not as predominant in plant promoters as compared to other random elements. This observation indicates that although a higher frequency results from a preferred occurrence, it might not be a result of conservation in the genome. Nevertheless, on analyzing the same random elements for functional preference using microarray data, ACGT (N) ACGT elements were found to be predominant in spite of lower frequencies than random elements in all promoters. This fact underscores the functional relevance of ACGT (N) ACGT cis-elements.
Based on our functional analysis, we deduced that co-occurring ACGT elements are involved in gene regulation in response to stress conditions in both Arabidopsis and Rice, suggesting species-wide functional significance. Genes which were up-regulated by salt and draught stress were much more likely to contain cis-elements of the form ACGT (N) ACGT in their promoters. This observation is supported by previous reports that multiple basic leucine zipper transcription factors, which recognize the ACGT core site, have been implicated in response under drought and high salinity conditions in Arabidopsis [16, 43]. Similarly, bzip transcription factors have been shown to be involved in regulation under drought stress in Rice and Soybean [44, 45]. Further, although regulation by jasmonic acid could not satisfy the criteria for the Grubs’ outlier tests, the likelihood of occurrence of genes regulated by jasmonic acid was higher than the cutoff of 1.30. With jasmonic acid’s conventional involvement in mediating stress responses in plants , this observation is extremely interesting in light of our findings.
Spacer sequence results showed a clear preference for G at the first and C at the terminal position for almost all spacer lengths. This validates previous reports which state that the sequence requirement of the ACGT-containing ABRE is ACGT-G G/T C . The CACGTG motif, or the G-box, is recognized in rice (Kumar A, 2009), and our results clearly show a predominance of G at +2 and C at -2 positions. This result also corresponds to reports which state that the bZip class of TFs show enhanced binding to ACGT elements with the presence of the G box . It can therefore be inferred, that a majority of TFs which bind to ACGT elements have stronger interactions if the flanking nucleotides are C and G. While the overall spacer sequence did not show any clear consensus sequence, spacer distance 24 showed a large amount of conservation. Spacer sequence between two ACGT motifs in a cis-element can be crucial for gene regulation . From our frequency analysis, we determined that spacer length 24 also appears to be preferred in Arabidopsis. In light of these observations, it is possible that a spacer of 24 bases might be functionally relevant in Arabidopsis.
Determining the optimum spacer length and preferred spacer sequence could dramatically enhance promoter designing techniques. If a particular spacer length is confirmed to be implicated in regulation of a particular function, e.g. - stress response, a cis-element containing the ACGT repeated motif can be incorporated within promoters to give rise to sturdier and more resistant genetically modified crops . Therefore, identifying the mechanism and implications of the conservation of specific spacer lengths and sequences is of prime importance for various genetic engineering techniques. Having identified spacer lengths between ACGT elements which can up-regulate gene expression in conditions of draught and salt stress, these results suggest improved methods for promoter design and creating hardy plant varieties.
This is the first study which has attempted to analyze patterns of ACGT repeat cis-elements in four plant genomes. We established that each species exhibits preferences for particular spacer lengths and demonstrated the existence of spacer lengths preferentially avoided in monocots and dicots. This suggests that a class specific mechanism of gene regulation might be present for ACGT (N) ACGT elements. We further identified parallel evolution to be the underlying mechanism for ACGT co-occurrences across species. Moreover, by indicating that genes up-regulated by salt and drought stress are more likely to contain ACGT repeat cis-elements in their promoters, our in-silico results suggest a significant role of these elements in these pathways.
We thank the Biological Sciences Department at Birla Institute of Technology and Science, Pilani for their cooperation. We also thank BITS Pilani, Pilani campus for their support in terms of infrastructure and facilities which were critical for completion of this project. We thank the editor and the reviewers for their valuable inputs.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.