Conservation of noncoding microsatellites in plants: implication for gene regulation

Zhang, Lida; Zuo, Kaijing; Zhang, Fei; Cao, Youfang; Wang, Jiang; Zhang, Yidong; Sun, Xiaofen; Tang, Kexuan

doi:10.1186/1471-2164-7-323

Research article
Open access
Published: 25 December 2006

Conservation of noncoding microsatellites in plants: implication for gene regulation

Lida Zhang¹,
Kaijing Zuo¹,
Fei Zhang¹,
Youfang Cao¹,
Jiang Wang¹,
Yidong Zhang¹,
Xiaofen Sun² &
…
Kexuan Tang^1,2

BMC Genomics volume 7, Article number: 323 (2006) Cite this article

7455 Accesses
75 Citations
Metrics details

Abstract

Background

Microsatellites are extremely common in plant genomes, and in particular, they are significantly enriched in the 5' noncoding regions. Although some 5' noncoding microsatellites involved in gene regulation have been described, the general properties of microsatellites as regulatory elements are still unknown. To address the question of microsatellites associated with regulatory elements, we have analyzed the conserved noncoding microsatellite sequences (CNMSs) in the 5' noncoding regions by inter- and intragenomic phylogenetic footprinting in the Arabidopsis and Brassica genomes.

Results

We identified 247 Arabidopsis-Brassica orthologous and 122 Arabidopsis paralogous CNMSs, representing 491 CT/GA and CTT/GAA repeats, which accounted for 10.6% of these types located in the 500-bp regions upstream of coding sequences in the Arabidopsis genome. Among these identified CNMSs, 18 microsatellites show high conservation in the regulatory regions of both orthologous and paralogous genes, and some of them also appear in the corresponding positions of more distant homologs in Arabidopsis, as well as in other plants. A computational scan of CNMSs for known cis-regulatory elements showed that light responsive elements were clustered in the region of CT/GA repeats, as well as salicylic acid responsive elements in the (CTT)_n/(GAA)_n sequences. Patterns of gene expression revealed that 70–80% of CNMS (CTT)_n/(GAA)_n associated genes were regulated by salicylic acid, which was consistent with the prediction of regulatory elements in silico.

Conclusion

Our analyses showed that some noncoding microsatellites were conserved in plants and appeared to be ancient. These CNMSs served as regulatory elements involved in light and salicylic acid responses. Our findings might have implications in the common features of the over-represented microsatellites for gene regulation in plant-specific pathways.

Background

Microsatellites, as one of the major repeat classes, are extremely common in eukaryotic genomes [1]. They are generally thought to result from the mutation effects of replication slippage [2]. Different from the origin of microsatellites from repetitive DNA in animals [3], plant microsatellites show a significant association with nonrepetitive DNA [4]. They can be found abundantly within or near genes in plant genomes, and in particular, some types are significantly enriched within the 5' noncoding regions of plant genes [5–7]. For example, in Arabidopsis thaliana, this feature is mostly attributable to the fact that CT/GA and CTT/GAA repeats are more frequently found in 5'-flanks than in other genomic regions, suggesting that they can potentially function as factors in regulating gene expression [7].

For quite a long time, microsatellites were only considered as genetic markers in DNA fingerprinting and diversity studies due to the extensive length polymorphisms. However, recent findings show that some of them act as cis-regulatory elements which can be recognized by transcription factors [8, 9]. It has been well known for so-called GAGA elements, comprising the dinucleotide repeat sequence (GA)_n to be present in promoters regulating numerous developmental genes in animals [10, 11]. Similarly, the (GA)_n sequences in regulatory regions of some plant genes can also be recognized by GAGA-binding factors [12–14], and more generally, the GA-rich element, a more complex 9 base pairs (bp) based (GA)_n repeat, has been shown to have protein-binding affinity [15]. Another major microsatellite in plants, the trinucleotide repeat sequence (GAA)_n presented within 5'UTR of ntp303 was found important in the modulation of transcription and translation efficiency [16]. Furthermore, some unusual phenotypic variations were found to be associated with the length of 5' noncoding microsatellites. A typical example was reported by Bao and his colleagues that variation in the number of CT/GA repeats in the 5'UTR of the waxy gene was correlated with amylose content in rice [17]. Although the mechanism is still unclear, the microsatellite length polymorphism is thought to affect the expression of the related genes of amylose synthesis.

Regions of DNA involved in gene regulation are expected to exhibit sequence conservation between related species over evolutionary time due to functional constraints. It has been recognized that comparative analyses of noncoding DNA sequences in multiple species, known as phylogenetic footprinting, can help identify conserved putative regulatory elements [18]. Successful identification of conserved noncoding sequences in comparisons among different grass genomes and cruciferous species, as well as between closely related genomic sequences from Arabidopsis and Brassica species has provided some good references for discovery of Conserved Noncoding Microsatellite Sequences (CNMSs) by phylogenetic footprinting in plants [19–22].

If microsatellites are important for regulating gene expression, they should be conserved in the homologous promoters through gene duplication or speciation during plant evolution. To address the question of microsatellites associated with gene regulatory elements, we used inter- and intragenomic phylogenetic footprinting to analyze the dominant microsatellites in the 5' noncoding regions of Arabidopsis and Brassica oleracea genes for CNMSs. About 10% of 5' noncoding CT/GA and CTT/GAA repeats are conserved in the Arabidopsis genome, and they are preferentially involved in gene regulation in plant-specific pathways.

Results

Distribution of microsatellites in different genomic regions

The characteristics of microsatellite occurrences were surveyed among the different genomic regions in the Arabidopsis genome. It was obvious that microsatellites were found to be highly abundant in the regulatory regions, and the over-representation of CT/GA and CTT/GAA repeats contributed most to the increase of microsatellites in these regions (Figure 1A, 1B). This preference of CT/GA and CTT/GAA repeat occurrences indicated that they might have the role in regulating genes.

Conservation of microsatellites in Arabidopsis

Regulatory sequence elements within promoter DNA are often short, orientation independent and contain frequent gaps of variable size. Thus, we determined the conserved noncoding microsatellite sequences as candidate regulatory elements based upon the following criteria: that there were at least 6-bp overlapping regions of the corresponding microsatellites between the aligned sequences. According to the criteria, we identified 247 Arabidopsis-Brassica orthologous CNMSs and 122 Arabidopsis paralogous CNMSs [see Additional file 1], involving 491 CT/GA and CTT/GAA repeats respectively (Table 1), which accounted for 10.6% of these types located in the 500-bp regions upstream of coding sequences in the Arabidopsis genome. These CNMSs do not randomly occur in different noncoding regions and they tend to be found more frequently near the initiation codon (Figure 2A, 2B).

In order to validate the above study and to ensure that the observation of CNMSs was not simply due to its over-representation in plant genomes, a similar analysis was carried out on three different random datasets, i.e. the 1000 homologous pairs of 5' noncoding sequences in Arabidopsis as dataset 1, the 1000 randomly shuffled pairs of 5' noncoding sequences as dataset 2 and the 1000 random pairs of genomic DNA sequences as dataset 3, as well as the three corresponding datasets of Arabidopsis and Brassica sequence pairs with the same data size. Figure 3 showed the frequencies of CNMS (CT/GA)_n and (CTT/GAA)_n in dataset 1, dataset 2 and dataset 3, respectively. Obviously, there was very little probability that CNMSs were found in the 500-bp genomic DNA sequence pair by chance. In contrast with the random pairs of noncoding sequences, the homologous noncoding sequences showed significant high in the frequency of CNMS occurrences. Taken together, these tests indicated that some microsatellites in regulatory regions were conserved from common ancestors during plant evolution.

Table 1 Summary of Arabidopsis-Brassica and Arabidopsis-Arabidopsis CNMSs

Full size table

Evolution of conserved microsatellites in Arabidopsis

To gain insight into the evolutionary relationship of Arabidopsis-Brassica and Arabidopsis-Arabidopsis CNMSs, the synonymous substitution rate (Ks) was calculated for the corresponding gene pairs. For Arabidopsis-Brassica orthologous CNMS gene pairs, the frequency distribution showed a clear peak for Ks values of 0.4 to 0.5 (Figure 4A), suggesting that these CNMSs were conserved from a common ancestor over a 15 million years (Myr) period, which was consistent with the divergence time frame estimated at 14.5 to 20.4 Myr based on mitochondrial DNA data [23]. On the other hand, we noticed two peaks in the Ks distribution of Arabidopsis paralogous CNMS gene pairs, and the Ks values were 0.8 to 0.9 and 1.2 to 1.3, respectively (Figure 4B). The former group contained most of the paralogous CNMSs which were originated from large scale gene duplication over 28 Myr ago, which was consistent with the recent polyploidization event during evolution of the Arabidopsis genome [24]. The latter group were duplicated from the common ancestor over 42 Myr ago, which probably occurred at the time of the divergence of brassicaceae family [25].

The results from the evolutionary relationships of Arabidopsis-Brassica and Arabidopsis-Arabidopsis CNMSs suggested that most paralogous CNMSs pre-dated the divergence of the two species; hence, many paralogous CNMSs in Arabidopsis were likely to find their counterparts in Brassica. Further comparisons of paralogous and orthologous genes from Arabidopsis and Brassica were made for common CNMSs (Figure 5A, 5B). With the same criteria, we identified 18 CNMSs found in Arabidopsis paralogous pairs that also were coincident with CNMSs from at least one orthologs in Brassica (Table 2). We called these conserved elements, shared among paralogous and orthologous genes, Ultra-CNMSs. An example of such Ultra-CNMSs was shown in Figure 5C, and the three homologous CT repeats were highly conserved from a common ancestor over 48 Myr.

Table 2 Ultra-CNMSs in Arabidopsis-Brassica orthologs and Arabidopsis paralogs

Full size table

Conservation of microsatellites in plants

As expected, analysis of regulatory regions of related gene families revealed that many Ultra-CNMSs were conserved across a number of more distantly homologous genes in Brassicaceae species and other plants. Figure 6A showed that CNMSs (CT)_n were conserved among orthologous genes from Arabidopsis, Brassica, Medicago and rice, as well as among more distantly paralogous genes in Arabidopsis. These genes are representatives of a larger family of transmembrane receptor kinases and related non-transmembrane kinases in plant genomes. Many of them arised from a common ancestor of dicots and monocots. Another striking CNMS was found in the regulatory regions of GATA transcription factor genes from Brassica, Arabidopsis and rice. Of 14 members in subfamily I in the Arabidopsis genome [26], five of them have the same CNMSs found in their regulatory regions (Figure 6B). These CNMS associated transcription factor genes that fell into two subgroups indicated they diverged before the dicotyledonous and monocotyledonous plants [26]. It was obvious that these Ultra-CNMSs had been passed down from a common ancestor of dicots and monocots under extreme purifying selection for more than 170 Myr [27].

Annotation enrichment and depletion of CNMS associated genes

We tested whether CNMSs in genes were influenced by the function of the proteins they encode. There were 206 Arabidopsis-Brassica and 194 Arabidopsis-Arabidopsis CNMS associated genes with known function in the Arabidopsis genome. We looked for categories of biological process and molecular function defined in the Gene Ontology (GO) database that were significantly enriched or depleted in these genes [28, 29]. These CNMS associated genes showed significant functional enrichment for transcription factor activity (P < 1.8 × 10^-7 for orthologous CNMS genes, and P < 7.7 × 10^-7 for paralogous CNMS genes, against all GO annotated Arabidopsis genes) and transcription (P < 4.5 × 10^-4 for orthologs and P < 6.6 × 10^-5 for paralogs) (Figure 7), and genes that performed the functions of the two types accounted for about 23% of all known genes. However, they were obviously depleted for DNA metabolism (P < 2.2 × 10^-4 for orthologous CNMS genes, and P < 1.8 × 10^-2 for paralogous CNMS genes). These findings suggested that CNMSs might be specifically associated with regulation of transcription at the DNA level, but not involved in DNA metabolism.

CNMSs as regulatory elements in plants

To further investigate the regulatory nature of these CNMSs, we employed a computational method to discover cis-elements that were similar to function assigned elements based on the PlantCARE [30, 31] and PLACE databases [32, 33]. The identification of cis-elements showed that some binding sites were clustered in the consevered microsatellite regions, and these regulatory elements were involved in plant-specific functions in response to some environmental stimuli (Table 3). The CNMS (CT)_n include the TCTCtCT sequences similar to the TCCC motif known as part of conserved DNA module array AtpCD-CMA involved in light responsiveness [34, 35]. Another function of CNMS (CT)_n may be as an enhancer due to the same motif (TCTCTCTCT) found in a 60-nt region downstream of the transcription start site of the CaMV 35S RNA, which can enhance gene translation in plant protoplasts [36]. As complementary sequences to (CT)_n, (GA)_n serve as regulatory element having similar functions, which contain a series of overlapped GAG motifs (AGAGAGa) involved in light regulation [35, 37]. In soybean, it is clear that the 18-bp GAGA element sequence within the Gsal promoter can be recognized by GBP encoded by a light-regulated gene [12]. The CNMS (CTT)_n contain sequences similar to the TCA-element (TCATCTTCTT) which is a binding site for salicylic acid-inducible proteins [38]. Similarly, the CNMS (GAA)_n contain AGAA sequences having the characteristic of the core recognition sequence (tcAGAAgagg) for salicylic acid-responsive genes [39].

Table 3 Prediction of CNMSs serve as regulatory elements in silico

Full size table

Although CNMSs (CT)_n/(GA)_n and (CTT)_n/(GAA)_n are similar to known regulatory elements, most of them have no experimental verification for their functions. Therefore, all CNMS (CTT)_n/(GAA)_n associated genes were selected to investigate their changes in expression levels after the treatment of salicylic acid. The abundance of gene transcripts evaluated by the MPSS showed these CNMS (CTT)_n/(GAA)_n associated genes had distinct expression characters with salicylic acid treatment (Figure 8A, 8B). About 70–80% of CNMS (CTT)_n/(GAA)_n associated genes in Arabidopsis leaves were regulated by salicylic acid, while others were undetectable with and without salicylic acid treatment. Among these salicylic acid-responsive genes, only about 15–23% of them were up-regulated by salicylic acid, and most of them were inhibited after the treatment. Seven CNMS (CTT)_n/(GAA)_n associated genes were additionally analyzed for expression patterns after salicylic acid treatment. The RT-PCR showed that these investigated genes, excepted At2g05920 and At5g67360, were obviously down-regulated by salicylic acid (Figure 9), which were consistent with the patterns of gene expression from the Arabidopsis MPSS database. According to the expression patterns by RT-PCR, we found the preliminary correlation between repeat number of CTT/GAA motif and gene in response to salicylic acid. The (CTT)₄/(GAA)₄ sequences were associated with gene down-regulation with salicylic acid stimulus, but the (CTT)₅ and (CTT)₇ associated genes were not obviously regulated by salicylic acid. These findings implied that regulation of CNMS associated gene expression by salicylic acid might be dependent on the number of CTT/GAA repeats.

Discussion

Microsatellites (CT)_n/(GA)_n and (CTT)_n/(GAA)_n are well presented in the Arabidopsis genome, and in particular, they are preferentially located within the 5' noncoding regions. In this study, we identified 491 conserved CT/GA and CTT/GAA repeats for candidate regulatory elements by inter- and intragenomic phylogenetic footprinting. These CNMSs tend to occur within these regions near the initiation codon with the preference of CT and CTT motifs, which are consistent with the characteristic of pyrimidine-rich repeat distribution in these regions [5, 7]. Another striking feature of CNMS distribution is that they are rarely found in the peri-centromeric regions; in contrast, their related genes are always clustered in chromosome arms (data not shown). The reasons for the absence of CNMS on peri-centromeric regions are still unclear, but CNMS associated genes occurring in clusters on chromosome arms is probably attributable to co-expression.

Microsatellites generally evolve rapidly, but there are about 10% of 5' noncoding CT/GA and CTT/GAA repeats which show high conservation in occurrences and appear to be ancient. In particular, the Ultra-CNMSs have been under purifying selection for more than 42 Myr, and some of them for at least 170 Myr. This conservation may be explained by function constraint so that many homologous genes have the corresponding microsatellite sequences in their regulatory regions. Most microsatellites of CT/GA and CTT/GAA types seem to be originated by recent mutations under positive selection [4, 7], which lead to the significant over-representation of microsatellites in the 5' noncoding regions compared with other genomic fractions. The reasons of positive selection for some repeat occurrences are still unknown. However, at least, they may provide opportunities for rapid adaptive changes in these regulatory regions or play specific roles in gene regulation.

It is well known that intergenomic phylogenetic footprinting is an effective method for the discovery of regulatory elements in a set of orthologous noncoding regions from multiple species [18–22]. In plant genomes, intragenomic phylogenetic footprinting represents another powerful strategy to detect regulatory elements due to the facts that most plant genomes are rich in duplicated genes and large fractions of these gene pairs share transcriptional characteristics [40]. Although detection of the full complement of cis-elements is not feasible by this approach due to potential acquisition and loss of individual regulatory elements between duplicated promoters, we can readily identify several specific regulatory elements which show high conservation in duplicated genes. Using this approach, we have successfully identified 122 Arabidopsis CNMSs as candidate regulatory elements of plant-specific function. Most of paralogous CNMSs were originated from the recent polyploidization event before the divergence between Arabidopsis and Brassica [23], implying that they might be conserved with their counterparts in Brassica. We compared the data generated by inter- and intragenomic phylogenetic footprinting and found 18 CNMSs highly conserved in both orthologous and paralogous sequences. The number of the identified ultra-CNMSs may be underestimated for the incomplete reference genome sequences of Brassica or the false orthologous relationships. These conserved microsatellites occurring among three or more homologous genes provides greater evidence that these CNMS are likely to be significant in gene regulation.

Functional annotation showed that CNMS associated genes were obviously depleted for DNA metabolism, such as DNA replication, DNA recombination and DNA repair. It is possible that genes that are essential for survival, lack CNMSs within their 5' noncoding regions because these genes do not need some specific regulatory elements. In contrast, these CNMS genes are preferentially associated with regulation of transcription in plants. CNMSs serve as regulatory elements and their related genes can be responsive to one or more forms of environmental stimuli (Table 3). The functional biases imply that CNMS associated genes (e.g. transmembrane receptor kinase genes and transcription factor genes) encoding proteins are involved in upstream pathways of defense responses in plants.

Although GAGA elements are known to be involved in the regulation of numerous developmental genes in animals [10, 11], we believe that CNMSs (CT)_n/(GA)_n are likely to be associated with transcriptional regulation in light signaling pathways in plants [35]. These CNMSs are often found in a number of different light-regulated genes [12, 41]. Although expression of most CNMS (CT)_n/(GA)_n associated genes was not significantly changed with light/dark transitions, three Ultra-CNMSs related genes (At5g52430, Atlg21920 and At3g62650) were obviously induced with longer periods of darkness according to microarray gene expression data of 7800 unique Arabidopsis genes [42]. This was consistent with the fact that about 9% of these CNMS genes were significantly down-regulated, while only 2% of them were up-regulated for light by a whole-genome expression analysis in seedling of Arabidopsis [43]. It is possible that the expression level changes of most CNMS (CT)_n/(GA)_n associated genes are not obvious under light since they are always in upstream of related pathways. However, CNMSs (CT)_n/(GA)_n, at least parts of them, may be the binding sites for trans-acting regulators involved in light signaling pathways and their associated genes can be induced under darkness.

Salicylic acid is well known as an important signaling molecule involved in both locally and systemically induced disease resistance responses [44]. Many salicylic acid responsive genes have been found in plant defense pathways. The CNMS (CTT)_n/(GAA)_n associated genes exhibit distinct expression characters with salicylic acid treatment, implying that they may be associated with a range of different stresses [38]. CNMSs (CTT)_n/(GAA)_n as regulatory elements regulating gene expression are associated with the repeat number in salicylic acid signaling pathways. They may not act as isolated transcription factor binding sites to regulate gene expression. Instead, they are likely to co-operate with other elements to perform complex regulatory functions in transcription. Perhaps some of them may perform roles in RNA interference by forming RNA duplexes with complementary antisense microsatellite sequences, which lead to quite a few CNMS genes whose transcripts are undetectable in Arabidopsis leaves.

Conclusion

Microsatellites (CT)_n/(GA)_n and (CTT)_n/(GAA)_n are preferentially associated the 5' noncoding regions in the Arabidopsis genome. Parts of them are conserved among the homologous genes and appear to be ancient. The computational prediction and gene expression analysis indicated that CNMSs (CT)_n/(GA)_n and (CTT)_n/(GAA)_n acted as regulatory elements involved in light and salicylic acid responses. From our analysis, the presence of CT/GA and CTT/GAA repeats in regulatory regions may be particularly useful as a guide for further experiments of plant regulatory networks in response to environmental stimulus.

Methods

Plant materials

The Arabidopsis plants were grown in soil in a growth chamber at 20°C with 8 hours of light for 40 days. Plants were sprayed to run-off with 1 mM salicylic acid in 0.5% dimethyl sulfoxide (DMSO) for different time scales. One, four, twelve and forty-eight hours post-treatment, leaves were cut and harvested respectively, quick-frozen in liquid nitrogen, then stored at -80°C. Total RNA was later extracted using Plant RNA Mini Kit (Watson Biotechnologies INC., China).

Sequence data sources

The annotated sequences of the five chromosomes of Arabidopsis (accession numbers: NC_003070, NC_003071, NC_003074, NC_003075, and NC_003076, updated 25-JAN-2005) were downloaded from the Genomes Division of GenBank [45, 46]. Intergenic regions were defined as being a part of DNA from the end of the last exon of one gene to the beginning of the first exon of the following gene. A set of 16223 full-length cDNA sequences containing both 5' and 3'UTRs for Arabidopsis was extracted from the TAIR database [47]. The preliminary sequences of Brassica genome were obtained from The Institute for Genomic Research website [48].

Identification of orthologous and paralogous gene pairs

To identify putative Arabidopsis-Brassica orthologous gene sets, each preliminary sequence from Brassica was searched against 1-kb sequences (fragments from the position -500 to +500 relative to the translation initiation) of all genes from Arabidopsis using BLASTN [49] and then the fragments from Brassica were clustered according to the best match gene of the Arabidopsis genome. Conversely, each 1-kb gene sequence from Arabidopsis was searched against the contigs from Brassica. Two sequences were defined as orthologs if each of them was the best hit of the other in the aligned regions and if the expect value (E) was <le-10. A list of the identified Arabidopsis-Brassica orthologs in the study is provided as supplementary data [see Additional file 2].

For identifying the paralogous gene pairs from a recently common ancestor in the Arabidopsis genome, each annotated coding sequence was searched against all other coding sequences using BLASTN. The best pair was considered significant if each of them was the best hit of the other and the expect value was <le-10. A file of the list of the paralogous gene pairs is included as supplementary data [see Additional file 3]. To avoid the negative conservation of microsatellites caused by the effects of insufficient randomizing mutations, the tandemly repeated gene pairs separated by less than 25 intermediate genes were ignored in further analysis.

Microsatellite detection

Microsatellites were found in sequences using the modified Sputnik repeat-finder [50]. Di-and trinucleotide repeats were identified when a total size of at least 12-bp, allowing up to about 10% deviation from a perfect repeat. Repeat motifs consisting of different frames (e.g. GAA, AGA and AAG) were regarded as the same type of repeat.

Identification of CNMSs

Because gene fragments of Brassica were derived from preliminary contigs with no annotated open reading frames, each pair of Arabidopsis-Brassica sequences were aligned using DiAlign2 with translation option to identify the 5' noncoding sequences and coding regions in the Brassica orthologs [51]. The 5' noncoding sequence pairs were aligned using DiAlign2 for finding conserved microsatellites. To exclude nonspecific alignments, a stringent threshold parameter of 3 was used. The CNMSs were identified when the corresponding loci had at least 6-bp overlapping sequences between the aligned microsatellite sequences.

Selection of random data sets

To ensure that CNMSs were not to occur by chance, we used two different datasets of random pairs as negative controls to validate the results. One control dataset contained 1000 random pairs of 500-bp upstream noncoding sequences in the Arabidopsis genome, and another control dataset of 1000 pairs was randomly generated from the 500-bp sequences of Arabidopsis genomic DNA fragments. The reference dataset of equal data size was randomly selected from the 500-bp paralogous noncoding sequence pairs in Arabidopsis. The 1000 paralogous pairs, the 1000 shuffled pairs of noncoding sequences and the 1000 random pairs of genomic sequences were respectively referred as dataset 1, dataset 2 and dataset 3 in further analysis. Similarly, three corresponding datasets of Arabidopsis and Brassica sequence pairs were generated with the same data size. The dataset 1 consisted of 1000 Arabidopsis-Brassica orthologous pairs of 5' noncoding sequences, and the dataset 2 contained 1000 random pairs of Arabidopsis and Brassica upstream noncoding sequences, and the dataset 3 of 1000 pairs was randomly generated from the Arabidopsis and Brassica genomic DNA sequences. The same criteria of CNMS detection was applied in the test. Occurrences of CNMSs were analyzed in analogous manner for 10 different random sets with equal data size.

Estimation of duplication and speciation time

We used the level of synonymous substitution of CNMS associated coding sequences to estimate the Ks of CNMSs. For each pair of CNMS associated genes, the two protein sequences were aligned by ClustalW, and the resulting alignment was then used as a guide to align the nucleotide sequences [52]. After removing gaps, the level of synonymous substitution was estimated using the yn00 program in PAML [23]. The time of divergence (T), between two sequences was calculated from this as T = Ks/2λ, where Ks is the fraction of synonymous substitutions per synonymous site and λ is the mean rate of synonymous substitution. The estimate value for λ in dicots is 1.5 synonymous substitutions per 10⁸ years [53].

Estimation of gene expression level

Gene expression level was estimated using the data from the Massively Parallel Signature Sequencing (MPSS) database of Arabidopsis [54, 55]. The MPSS data of three different libraries was generated from untreated leaves and treated leaves 4 and 52 hours after salicylic acid treatment, respectively. For the three libraries, a total of 9,081,200 17-bp signatures were obtained in multiple sequencing runs and in two sequencing frames. The abundance for each signature was normalized to transcripts per million (TPM) to facilitate comparisons across libraries.

RT-PCR of Arabidopsis CNMS associated genes was conducted using the one-step RNA PCR kit (TaKaRa) with gene specific primers [see Additional file 4]. The 0.5 μg total RNA was used as the template to be amplified with the following program: an initial 50°C for 30 min and 94°C for 2 min, followed by 25 cycles of 94°C for 30s, 54°C for 30s and 72°C for 1 min. The house-keeping gene actin 2 (At3gl8780) was used as an internal control in RT-PCR reaction.

References

Toth G, Gaspari Z, Jurka J: Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 2000, 10: 967-981. 10.1101/gr.10.7.967.
Article PubMed CAS PubMed Central Google Scholar
Levinson G, Gutman GA: Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol. 1987, 4: 203-221.
PubMed CAS Google Scholar
Nadir E, Margalit H, Gallily T, Ben-Sasson SA: Microsatellite spreading in the human genome: evolutionary mechanisms and structural implications. Proc Natl Acad Sci USA. 1996, 93: 6470-6475. 10.1073/pnas.93.13.6470.
Article PubMed CAS PubMed Central Google Scholar
Morgante M, Hanafey M, Powell W: Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat Genet. 2002, 30: 194-200. 10.1038/ng822.
Article PubMed CAS Google Scholar
Fujimori S, Washio T, Higo K, Ohtomo Y, Murakami K, Matsubara K, Kawai J, Carninci P, Hayashizaki Y, Kikuchi S, Tomita M: A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett. 2003, 554: 17-22. 10.1016/S0014-5793(03)01041-X.
Article PubMed CAS Google Scholar
Lawson MJ, Zhang L: Distinct patterns of SSR distribution in the Arabidopsis thaliana and rice genomes. Genome Biol. 2006, 7: R14-10.1186/gb-2006-7-2-r14.
Article PubMed PubMed Central Google Scholar
Zhang LD, Yuan DJ, Yu SW, Li ZG, Cao YF, Miao ZQ, Qian HM, Tang KX: Preference of simple sequence repeats in coding and non-coding regions of Arabidopsis thaliana. Bioinformatics. 2004, 20: 1081-1086. 10.1093/bioinformatics/bth043.
Article PubMed CAS Google Scholar
Iglesias AR, Kindlund E, Tammi M, Wadelius C: Some microsatellites may act as novel polymorphic cis-regulatory elements through transcription factor binding. Gene. 2004, 341: 149-165. 10.1016/j.gene.2004.06.035.
Article PubMed CAS Google Scholar
Martin P, Makepeace K, Hill SA, Hood DW, Moxon ER: Microsatellite instability regulates transcription factor binding and gene expression. Proc Natl Acad Sci USA. 2004, 102: 3800-3804. 10.1073/pnas.0406805102.
Article Google Scholar
Bevilacqua A, Fiorenza MT, Mangia F: A developmentally regulated GAGA box-binding factor and Sp1 are required for transcription of the hsp70.1 gene at the onset of mouse zygotic genome activation. Development. 2000, 127: 1541-1551.
PubMed CAS Google Scholar
Busturia A, Lloyd A, Bejarano F, Zavortink M, Xin H, Sakonju S: The MCP silencer of the Drosophila Abd-B gene requires both pleiohomeotic and GAGA factor for the maintenance of repression. Development. 2001, 128: 2163-2173.
PubMed CAS Google Scholar
Sangwan I, O'Brian MR: Identification of a soybean protein that interacts with GAGA element dinucleotide repeat DNA. Plant Physiol. 2002, 129: 1788-1794. 10.1104/pp.002618.
Article PubMed CAS PubMed Central Google Scholar
Santi L, Wang Y, Stile MR, Berendzen K, Wanke D, Roig C, Pozzi C, Muller K, Muller J, Rohde W, Salamini F: The GA octodinucleotide repeat binding factor BBR participates in the transcriptional regulation of the homeobox gene Bkn3. Plant J. 2003, 34: 813-826. 10.1046/j.1365-313X.2003.01767.x.
Article PubMed CAS Google Scholar
Meister RJ, Williams LA, Monfared MM, Gallagher TL, Kraft EA, Nelson CG, Gasser CS: Definition and interactions of a positive regulatory element of the Arabidopsis INNER NO OUTER promoter. Plant J. 2004, 37: 426-438. 10.1046/j.1365-313X.2003.01971.x.
Article PubMed CAS Google Scholar
Kooiker M, Airoldi CA, Losa A, Manzotti PS, Finzi L, Kater MM, Colombo L: BASIC PENTACYSTEINE1, a GA binding protein that induces conformational changes in the regulatory region of the homeotic Arabidopsis gene SEEDSTICK. Plant Cell. 2005, 17: 722-729. 10.1105/tpc.104.030130.
Article PubMed CAS PubMed Central Google Scholar
Hulzink RJ, de Groot PF, Croes AF, Quaedvlieg W, Twell D, Wullems GJ, Van Herpen MM: The 5'-untranslated region of the ntp303 gene strongly enhances translation during pollen tube growth, but not during pollen maturation. Plant Physiol. 2002, 129: 342-353. 10.1104/pp.001701.
Article PubMed CAS PubMed Central Google Scholar
Bao S, Corke H, Sun M: Microsatellites in starch-synthesizing genes in relation to starch physicochemical properties in waxy rice (Oryza sativa L.). Theor Appl Genet. 2002, 105: 898-905. 10.1007/s00122-002-1049-3.
Article PubMed CAS Google Scholar
Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000, 16: 369-372. 10.1016/S0168-9525(00)02081-3.
Article PubMed CAS Google Scholar
Guo H, Moose SP: Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell. 2003, 15: 1143-1158. 10.1105/tpc.010181.
Article PubMed CAS PubMed Central Google Scholar
Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M: Conserved noncoding sequences in the grasses. Genome Res. 2003, 13: 2030-2041. 10.1101/gr.1280703.
Article PubMed CAS PubMed Central Google Scholar
Hong RL, Hamaguchi L, Busch MA, Weigel D: Regulatory elements of the floral homeotic gene AGAMOUS identified by phylogenetic footprinting and shadowing. Plant Cell. 2003, 15: 1296-1309. 10.1105/tpc.009548.
Article PubMed CAS PubMed Central Google Scholar
Colinas J, Birnbaum K, Benfey PN: Using cauliflower to find conserved non-coding regions in Arabidopsis. Plant Physiol. 2002, 129: 451-454. 10.1104/pp.002501.
Article PubMed CAS PubMed Central Google Scholar
Yang YW, Lai KN, Tai PY, Li WH: Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J Mol Evol. 1999, 48: 597-604. 10.1007/PL00006502.
Article PubMed CAS Google Scholar
Blanc G, Hokamp K, Wolfe KH: A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 2003, 13: 137-144. 10.1101/gr.751803.
Article PubMed CAS PubMed Central Google Scholar
Koch MA, Haubold B, Mitchell-Olds T: Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Mol Biol Evol. 2000, 17: 1483-1498.
Article PubMed CAS Google Scholar
Reyes JC, Muro-Pastor MI, Florencio FJ: The GATA family of transcription factors in Arabidopsis and rice. Plant Physiol. 2004, 134: 1718-1732. 10.1104/pp.103.037788.
Article PubMed CAS PubMed Central Google Scholar
Soltis PS, Soltis DE, Chase MW: Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature. 1999, 402: 402-404. 10.1038/46528.
Article PubMed CAS Google Scholar
The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Article PubMed Central Google Scholar
Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional investigation of gene datasets based on Gene Ontology. Genome Biol. 2004, 5: R101-10.1186/gb-2004-5-12-r101.
Article PubMed PubMed Central Google Scholar
Lescot M, Dehais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouze P, Rombauts S: PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. 2002, 30: 325-327. 10.1093/nar/30.1.325.
Article PubMed CAS PubMed Central Google Scholar
PlantCARE database. [http://bioinformatics.psb.ugent.be/webtools/plantcare/html/]
Higo K, Ugawa Y, Iwamoto M, Korenaga T: Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res. 1999, 27: 297-300. 10.1093/nar/27.1.297.
Article PubMed CAS PubMed Central Google Scholar
PLACE database. [http://www.dna.affrc.go.jp/PLACE/signalscan.html]
Bolle C, Kusnetsov VV, Herrmann RG, Oelmuller R: The spinach AtpC and AtpD genes contain elements for light-regulated, plastid-dependent and organ-specific expression in the vicinity of the transcription start sites. Plant J. 1996, 9: 21-30. 10.1046/j.1365-313X.1996.09010021.x.
Article PubMed CAS Google Scholar
Arguello-Astorga GR, Herrera-Estrella LR: Ancestral multipartite units in light-responsive plant promoters have structural features correlating with specific phototransduction pathways. Plant Physiol. 1996, 112: 1151-1166. 10.1104/pp.112.3.1151.
Article PubMed CAS PubMed Central Google Scholar
Pauli S, Rothnie HM, Chen G, He X, Hohn T: The cauliflower mosaic virus 35S promoter extends into the transcribed region. J Virol. 2004, 78: 12120-12128. 10.1128/JVI.78.22.12120-12128.2004.
Article PubMed CAS PubMed Central Google Scholar
Orozco BM, Ogren WL: Localization of light-inducible and tissue-specific regions of the spinach ribulose bisphosphate carboxylase/oxygenase (rubisco) activase promoter in transgenic tobacco plants. Plant Mol Biol. 1993, 23: 1129-1138. 10.1007/BF00042347.
Article PubMed CAS Google Scholar
Goldsbrough AP, Albrecht H, Stratford R: Salicylic acid-inducible binding of a tobacco nuclear protein to a 10 bp sequence which is highly conserved amongst stress-inducible genes. Plant J. 1993, 3: 563-571. 10.1046/j.1365-313X.1993.03040563.x.
Article PubMed CAS Google Scholar
Pastuglia M, Roby D, Dumas C, Cock JM: Rapid induction by wounding and bacterial infection of an S gene family receptor-like kinase gene in Brassica oleracea. Plant Cell. 1997, 9: 49-60. 10.1105/tpc.9.1.49.
Article PubMed CAS PubMed Central Google Scholar
Haberer G, Hindemitt T, Meyers BC, Mayer KF: Transcriptional similarities, dissimilarities, and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol. 2004, 136: 3009-3022. 10.1104/pp.104.046466.
Article PubMed CAS PubMed Central Google Scholar
Teakle GR, Manfield IW, Graham JF, Gilmartin PM: Arabidopsis thaliana GATA factors: organisation, expression and DNA-binding characteristics. Plant Mol Biol. 2002, 50: 43-57. 10.1023/A:1016062325584.
Article PubMed CAS Google Scholar
Schaffer R, Landgraf J, Accerbi M, Simon V, Larson M, Wisman E: Microarray analysis of diurnal and circadian-regulated genes in Arabidopsis. Plant Cell. 2001, 13: 113-123. 10.1105/tpc.13.1.113.
Article PubMed CAS PubMed Central Google Scholar
Ma L, Sun N, Liu X, Jiao Y, Zhao H, Deng XW: Organ-specific expression of Arabidopsis genome during development. Plant Physiol. 2005, 138: 80-91. 10.1104/pp.104.054783.
Article PubMed CAS PubMed Central Google Scholar
Ryals JA, Neuenschwander UH, Willits MG, Molina A, Steiner HY, Hunt MD: Systemic Acquired Resistance. Plant Cell. 1996, 8: 1809-1819. 10.1105/tpc.8.10.1809.
Article PubMed CAS PubMed Central Google Scholar
The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.
Article Google Scholar
Arabidopsis genome in GenBank. [ftp://ftp.ncbi.nih.gov/genomes/Arabidopsis_thaliana]
TAIR database. [ftp://ftp.arabidopsis.org/home/tair/home/tair/Sequences/]
TIGR website. [ftp://ftp.tigr.org/pub/data/b_oleracea/wgs_seq/]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed CAS PubMed Central Google Scholar
The motified sputnik repeat-finder. [http://capb.dbi.udel.edu/main/tools.htm]
Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics. 1999, 15: 211-218. 10.1093/bioinformatics/15.3.211.
Article PubMed CAS Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.
Article PubMed CAS PubMed Central Google Scholar
Koch M, Haubold B, Mitchell-Olds T: Molecular systematics of the Brassicaceae: Evidence from coding plastidic matK and nuclear Chs sequences. Am J Bot. 2001, 88: 534-544.
Article PubMed CAS Google Scholar
Meyers BC, Lee DK, Vu TH, Tej SS, Edberg SB, Matvienko M, Tindell LD: Arabidopsis MPSS: an online resource for quantitative expression analysis. Plant Physiol. 2004, 135: 801-813. 10.1104/pp.104.039495.
Article PubMed CAS PubMed Central Google Scholar
Arabidopsis MPSS database. [http://mpss.udel.edu/at]

Download references

Acknowledgements

This research was supported by National Sciences Foundation of China (No. 30600348), and China '973' project. Preliminary sequence data was obtained from The Institute for Genomic Research website. Sequencing of Brassica oleracea was funded by the "National Science Foundation".

Author information

Authors and Affiliations

Plant Biotechnology Research Center, Fudan-SJTU-Nottingham Plant Biotechnology R&D Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 200030, China
Lida Zhang, Kaijing Zuo, Fei Zhang, Youfang Cao, Jiang Wang, Yidong Zhang & Kexuan Tang
State Key Laboratory of Genetic Engineering, Fudan-SJTU-Nottingham Plant Biotechnology R&D Center, School of Life Sciences, Morgan-Tan International Center for Life Sciences, Fudan University, Shanghai, 200433, China
Xiaofen Sun & Kexuan Tang

Authors

Lida Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kaijing Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Fei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Youfang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yidong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofen Sun
View author publications
You can also search for this author in PubMed Google Scholar
Kexuan Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kexuan Tang.

Additional information

Authors' contributions

ZL designed and conducted the study on microsatellite detection, identification of CNMSs, comparative genome analysis, evolution, and drafted the manuscript. ZK and ZF provided the plant materials and participated in gene expression analysis. CY performed database searches. WJ and ZY participated in data analysis and manuscript revision. SX and TK participated in research design and in the drafting of the manuscript. All authors read and approved the final manuscript

Electronic supplementary material

Additional file 1: The full list of pairs of conserved noncoding microsatellites identified in this study. (XLS 91 KB)

Additional file 2: List of the orthologs between Arabidopsis and Brassica. (TXT 727 KB)

Additional file 3: List of the paralogous gene pairs in the Arabidopsis genome. (TXT 115 KB)

Additional file 4: The primer sequences used for RT-PCR. (XLS 16 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, L., Zuo, K., Zhang, F. et al. Conservation of noncoding microsatellites in plants: implication for gene regulation. BMC Genomics 7, 323 (2006). https://doi.org/10.1186/1471-2164-7-323

Download citation

Received: 21 August 2006
Accepted: 25 December 2006
Published: 25 December 2006
DOI: https://doi.org/10.1186/1471-2164-7-323

Conservation of noncoding microsatellites in plants: implication for gene regulation

Abstract

Background

Results

Conclusion

Background

Results

Distribution of microsatellites in different genomic regions

Conservation of microsatellites in Arabidopsis

Evolution of conserved microsatellites in Arabidopsis

Conservation of microsatellites in plants

Annotation enrichment and depletion of CNMS associated genes

CNMSs as regulatory elements in plants

Discussion

Conclusion

Methods

Plant materials

Sequence data sources

Identification of orthologous and paralogous gene pairs

Microsatellite detection

Identification of CNMSs

Selection of random data sets

Estimation of duplication and speciation time

Estimation of gene expression level

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us