Profiling ascidian promoters as the primordial type of vertebrate promoter
© Okamura et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Skip to main content
© Okamura et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
CpG islands are observed in mammals and other vertebrates, generally escape DNA methylation, and tend to occur in the promoters of widely expressed genes. Another class of promoter has lower G+C and CpG contents, and is thought to be involved in the spatiotemporal regulation of gene expression. Non-vertebrate deuterostomes are reported to have a single class of promoter with high-frequency CpG dinucleotides, suggesting that this is the original type of promoter. However, the limited annotation of these genes has impeded the large-scale analysis of their promoters.
To determine the origins of the two classes of vertebrate promoters, we chose Ciona intestinalis, an invertebrate that is evolutionarily close to the vertebrates, and identified its transcription start sites genome-wide using a next-generation sequencer. We indeed observed a high CpG content around the transcription start sites, but their levels in the promoters and background sequences differed much less than in mammals. The CpG-rich stretches were also fairly restricted, so they appeared more similar to mammalian CpG-poor promoters.
From these data, we infer that CpG islands are not sufficiently ancient to be found in invertebrates. They probably appeared early in vertebrate evolution via some active mechanism and have since been maintained as part of vertebrate promoters.
Among the 16 DNA dinucleotides, the CpG dinucleotide is unique in terms of its frequency in genomic sequences. This most probably results from the DNA methylation system because the DNMT1 and DNMT3 families of the deuterostomes, such as echinoderms and chordates, predominantly target the 5 position of cytosine residues only in the CpG dinucleotide . Because the deamination of 5-methylcytosine is not recognized by the DNA repair mechanisms, CpG is rapidly mutated to TpG or to its complementary dinucleotide CpA . Therefore, deuterostome organisms, except for Oikopleura dioica , display a globally reduced frequency of the CpG dinucleotide compared with its expected frequency calculated from actual numbers of guanine and cytosine residues [4, 5]. Interestingly, they also display skewed distributions of the CpG dinucleotide across their genomes, so that their genomes contain CpG-poor and CpG-rich domains [6, 7]. In amphibians, avians, and mammals, the CpG-rich domains are much shorter than the CpG-poor domains and are generally known as CpG islands .
CpG islands are good markers of some classes of genes because they are often linked to the promoters of those genes . In most cases, CpG islands escape DNA methylation, which suppresses gene expression in general, in almost every tissue  and function as part of the gene promoter . Hence, CpG islands tend to be related to ubiquitously or broadly expressed genes, whereas promoters that lack a CpG island are involved in the spatiotemporal regulation of the genes . It is important to note that mammalian promoters can be thus divided into the two distinct classes, not only structurally but also functionally. In the human genome, CpG-rich promoters or CpG island promoters are dominant, occurring more than twice as often as CpG-poor promoters [13, 14].
As anticipated for a vertebrate taxon, CpG island promoters were indeed experimentally identified in fish by an analysis of transcription start sites (TSSs) . The presence of two classes of promoters in fish, amphibians, reptiles, avians, and mammals has since been confirmed in silico . In that study, the authors analysed the distributions of the normalized CpG contents (the ratio of the observed CpG number to the expected CpG number, called the "CpG score" hereunder) of the promoter sequences in six vertebrate genomes and showed bimodal distributions for all of them. Furthermore, the structural bimodality was shown to correspond to functionally distinct classes of genes. The authors also analysed three invertebrate promoters, of one sea urchin and two ascidian (sea squirt) species, and found unimodal distributions of high CpG scores, unlike the distributions observed in the vertebrate promoters. This led them to propose that the vertebrate promoter classes differentiated at an early stage of vertebrate evolution, with global DNA methylation and subsequent deamination. This is basically consistent with the formerly accepted evolutionary hypothesis of CpG islands [17, 18].
If this hypothesis is true, do the non-vertebrate deuterostomes (e.g. echinoderms, lancelets, and ascidians) have CpG islands in their genomes? Currently, the presence of CpG islands in invertebrate animals is unclear. It is possible to apply any criteria that define a CpG island to their genomic sequences and identify some islands. Nevertheless, we were interested in determining whether there are CpG island-like sequences in invertebrate genomes that are associated with transcription initiation, and how and when these sequences appeared during evolution.
To address this issue, we identified the TSSs of Ciona intestinalis by a combination of the oligo-capping method  and massive-scale cDNA sequencing (RNA-seq, specifically TSS-seq) . The widely used model organism C. intestinalis is an ascidian tunicate, which although an invertebrate, is most closely related to the vertebrates . Although the ascidian evolved from the last common ancestor of the ascidians and vertebrates, it can be presumed to retain many more features of the ancestral organism than do extant vertebrates. It is well known that the enrichment of the CpG dinucleotides in CpG island promoters is maximum in TSSs [12, 13], so TSSs constitute candidate regions in which CpG island promoters or CpG island-like sequences might occur in the invertebrate genome. Incidentally, this approach that targets TSSs also circumvents the confusion arising from CpG-rich sequences that are indifferent to transcription initiation. In the computational study mentioned above, promoter regions were defined using the RefSeq database, which is a curated collection of publicly available nucleotide sequences . It is likely that many of the cDNA entries are truncated or incomplete at the 5’ end which makes the definition of their promoter regions unreliable. More importantly, the TSSs of approximately half of all ascidian genes can hardly be determined because of mRNA 5’-leader trans-splicing [22–24]. The 5’ ends of those primary transcripts, termed the outron, are discarded via the trans-splicing reaction. This fact is easily exemplified by downstream operonic genes, which are resolved from their primary transcripts by trans-splicing . Although it is almost impossible to know TSSs of them, it is essential to be distinguished from non-trans-spliced genes and to know the most 5' end position of the processed transcripts. Analyzing these data, we determined the structural features of the ascidian promoters and compared them with human promoters to identify and characterize their similarities and differences. To extend our understanding of gene regulation in higher eukaryotes, we undertook to clarify the origin of CpG islands and the two classes of vertebrate promoters.
In this study, we chose C. intestinalis embryos at the mid-tailbud stage (Additional file 1: Figure S1) for the genome-wide identification of TSSs. Since whole embryos still retaining the notochord contain a wide range of cell types, we may cover a large part of ascidian promoters. Total RNA was extracted from embryos and was subjected to oligo capping in which the 5’ cap of the mRNA was replaced with a synthetic RNA oligonucleotide (see Methods). After cDNA synthesis and subsequent PCR, we undertook massively parallel sequencing using the Illumina Genome Analyzer. We obtained two data sets containing fragments of different lengths 36 nt or 48 nt. Because we read the sequences from the 3’ end of the RNA oligonucleotide, all the sequences obtained should start with GG at their 5’ ends (see Methods). We recovered only the reads that started with GG, but then trimmed the GG from those. Although the genic sequences were trimmed by two nucleotides, this protocol eliminated dubious sequences that do not start with the dinucleotide. We also eliminated sequences containing undetermined nucleotides other than T, C, A, and G, yielding 4,247,902 reads of 34 nt and 4,770,608 reads of 46 nt. To detect the spliced leader (SL) of C. intestinalis, we considered, in addition to the canonical 16-nt sequence, all similar sequences, allowing a 1-nt mismatch or indel and some previously reported variants . The 34-nt data set consisted of 1,849,849 non-trans-spliced and 2,398,053 trans-spliced reads and the 46-nt data set consisted of 2,052,230 non-trans-spliced and 2,718,378 trans-spliced reads. Even if some SL-related 5’ mRNA sequences escaped from being detected by this process, it is unlikely such reads would map to the genome in the following step. Mapping or alignment to the KH assembly  was performed as described in the Methods. Sequences that mapped to more than one locus (multiple hits) were not considered further. The numbers of mapped 34-nt and 46-nt reads were 1,017,283 (non-trans-spliced), 1,932,570 (trans-spliced), 939,092 (non-trans-spliced), and 1,237,720 (trans-spliced), respectively. Because the original 5’-segment of a pre-mRNA is discarded during the trans-splicing reaction, mature trans-spliced mRNAs do not contain the initial segment of the primary transcript and therefore lack the information required to precisely identify TSS . Therefore, we decided to mainly examine non-trans-spliced reads to provide valid data for the promoter analyses presented here. The genomic positions to which the 5’ ends of the reads were aligned were defined as TSSs. The read counts were converted to values in parts per million (ppm) for transcript abundance estimation and normalization, and both of the short and long data sets were merged. The TSSs, which are generally scattered around a promoter region , were then clustered into 100-bp bins to define each promoter. In other words, two reads located more than 100 bp apart without any other reads between them were considered to be regulated by two separate promoters . In this clustering process, TSSs represented by reads occurring at less than 0.5 ppm were not considered. However, once promoters were defined, all the TSSs in the bins were counted to estimate the abundance of transcripts from each cluster. Because we can assume that every cell contains approximately one million mRNA molecules, we can consider the values in ppm as copy numbers of the transcripts in a cell . We set a threshold of 1.0 ppm to exclude transcriptional noise. As a result, we obtained 6312 and 8753 promoters for non-trans-spliced and trans-spliced genes, respectively, that could be considered active in the tailbud embryos. The most frequent TSS in each promoter (and if there were several, the most upstream one) was selected as its representative TSS. If the corresponding genes were found in the KH gene model , the gene names were also tabulated (Additional file 2: Tables S1 and S2). Note that one gene can have several alternative promoters.
The CpG island promoters seen in vertebrates are believed to have emerged from the deamination of other regions . Therefore, it is plausible that the appearance of the two classes of vertebrate promoters is also a consequence of deamination, following the global DNA methylation that occurred early in vertebrate evolution [16, 30]. Specific sequence motifs that function as transcription factor binding sites might have retained some CpG-rich sequences from the methylation and mutation to form CpG island promoters [31–33]. To confirm this hypothesis, we used a large-scale experimental approach to identify the TSSs of C. intestinalis. On the basis of our TSS information, we then examined the ascidian promoter sequences. The fact that the CpG scores, i.e. the ratios of the observed CpG number to the expected CpG number, tended to be quite high in the vicinity of the ascidian TSSs led us speculate CpG island promoters . However, it had to be noted that the G+C and CpG contents are low. When we applied the most conventional and conservative CpG island definition  to the promoters, only 3.5% (223 out of the 6313 promoters) meet the criteria. This is attributable to the fact that the ascidian G+C content, approximately 0.36, is much lower than the G+C criterion of 0.5 (Figure 3B). Even at TSSs, the average ascidian G+C content is approximately 0.4 at the most. Besides, the ascidian CpG score is much higher than the criterion of 0.6 (Figure 3A). If we try to define new criteria for the ascidian genome, the difference in the values for the TSSs and background sequences is much smaller than that observed for the human genome. The unique feature of the non-vertebrate deuterostome genomes, i.e. the presence of comparable amounts of CpG-poor and CpG-rich domains , also hinders us in defining CpG islands in these animals.
Contrary to our initial expectation, we failed to identify CpG island-like promoters in the invertebrate genome. Instead, we found that the general features of ascidian promoters are similar to those of CpG-poor vertebrate promoters rather than to CpG island promoters. It is reasonable to consider CpG-poor promoters more ancient because they are found in a wide variety of eukaryotes . Conversely, CpG island promoters must have appeared in an early stage of vertebrate evolution, derived by some mechanism, and have been adopted as important cis regulatory elements in descendant species. Because the CpG score is just the ratio of the observed to the expected numbers of dinucleotides, a high score does not necessarily mean a high frequency. We defined and used "CpG content", which showed a substantially different feature from CpG score in the ascidian genome (Figure 3C). Note that the CpG score and CpG content profiles are dissimilar and similar in the ascidian and human genomes, respectively. The CpG content will also be important to scrutinize genomes especially of various animals other than mammals. It is unlikely that the conventional CpG island definitions using only CpG score, G+C content, and length function in invertebrate genomes. Because the deamination of methylated CpG sites cannot explain the substantial increase in the CpG and G+C contents in the vicinity of vertebrate TSSs, we must search for and examine active mechanisms that may have given rise to CpG islands. The biased gene conversion [18, 34], the condensation of CpG-rich protein-coding sequences by retrotransposition , and the expansion of elements containing the CpG dinucleotide  are potential molecular mechanisms. The fact that CpG islands are not conserved satisfactory among species  may indicate that CpG island loss and gain are active phenomena, occurring up to the present time, even in extant vertebrates.
The number of C. intestinalis genes is reported to be 15,254 in the KH gene model . Whereas series of operonic genes have single promoters, alternative promoters have been reported for a large number of genes. The number of all RNA polymerase II promoters, including those of non-coding transcripts, may exceed 20,000. This study targeted the promoters that are active in the embryos. Although we believe that the 6312 promoters analysed here may well represent most of them, we eagerly await techniques with which to identify the TSSs of trans-spliced genes. Utilizing our data, the TSS of the TnI gene was recently identified as the first case for Ciona trans-spliced genes . CpG island promoters cannot be seen at least for this gene.
We have experimentally identified and characterized ascidian promoter sequences as the primordial type of vertebrate promoter. As far as we know, this is the first case for non-vertebrate deuterostomes. The sequences near TSSs tend to exhibit high CpG score and high G+C content, but their level and extent are actually restricted. Furthermore, the promoter sequences seem to be at least partially methylated. It is unlikely that they were the original type of vertebrate CpG island promoters. Rather than global methylation and subsequent deamination, some active mechanisms and maintaining mechanisms have presumably been required to form such a long and CpG-condensed region in vertebrate animals.
The genomes of more than 50 vertebrate species have been sequenced and even more genomes will be sequenced in the future . Now that an ascidian genome has been shown to lack CpG islands that function in promoter sequences, our curiosity is directed to primitive vertebrates, such as agnathans. It could be superficial to make a strong conclusion at this point. The searching for primitive organisms with CpG island promoters in order to determine the origin of CpG islands will certainly extend our understanding of the sophisticated roles of DNA methylation in higher eukaryotes [39–41].
More than 200 μg of total RNA was isolated from whole mid-tailbud-stage ascidian embryos (12-hour-old embryos), using ISOGEN (Nippon Gene) according to the manufacturer’s protocol. The RNA was subjected to oligo-capping method . In short, after successive treatments with bacterial alkaline phosphatase (TaKaRa) and tobacco acid pyrophosphatase (Ambion), the treated RNA was ligated to an RNA oligonucleotide with the sequence 5’- AAU GAU ACG GCG ACC ACC GAG AUC UAC ACU CUU UCC CUA CAC GAC GCU CUU CCG AUC UGG -3’ using T4 RNA ligase (TaKaRa). After treatment with DNase I, the poly(A)+ RNA was selected and used as the template for the first-strand cDNA synthesis with the primer 5’- CAA GCA GAA GAC GGC ATA CGA NNN NNN C -3’. The cDNA was then used as the template for PCR with the primers 5’- AAT GAT ACG GCG ACC ACC GAG -3’ and 5’- CAA GCA GAA GAC GGC ATA CGA -3’. The products were size fractionated by polyacrylamide gel electrophoresis. Approximately 1 ng of the 150-200-bp fraction was used for the sequencing reactions on the Illumina Genome Analyzer (Solexa). Both 36-cycle and 48-cycle sequencing reactions were performed on the same samples. The DNA sequences have been deposited in [DDBJ Sequence Read Archive: DRA000156].
Illumina Pipeline (GAPipeline 1.0) was used to extract the sequenced reads from the image data. The spliced leaders (SLs) in the trans-spliced sequences were replaced with splice acceptor sequence “ag” for the subsequent mapping. The sequences were aligned to the KH assembly  using SeqMap  for the 36-cycle reads, or to BLAT  for the 48-cycle reads, because of the high rate of cis-splicing. Because of the highly polymorphic genic features of this organism , we used a 90% match criterion, including insertions and deletions. If the 5’ end of a read was not aligned to the genome, the read was eliminated from the analysis. Multiple hits were removed, and only single best hits were considered for the subsequent analyses. Sequence logos were drawn with WebLogo 2.8.2 (http://weblogo.berkeley.edu/). The CpG score was defined as CpG * N / C / G with C, G, CpG, and N observed numbers of C, G, and CpG and the fixed window size, respectively. The CpG content defined in the present study was CpG / (N - 1). The assembly used for the human genome was UCSC hg18. To select human CpG-poor (CpG score < 0.5) and CpG-rich promoters (CpG score > 0.6), we used DBTSS 6.0 (http://dbtss.hgc.jp/) and calculated CpG scores in 200-bp regions around representative TSSs . The analysis was limited to protein-coding genes, but all the alternative promoters deposited in the database were included (out of all 101,436 promoters, 32,122 were for protein-coding genes). The numbers of CpG-poor and CpG-rich promoters were 18,034 and 12,493, respectively. Dinucleotides other than pyrimidine-purine (YR) were not considered in the analysis of the usage of the YR motif. The total numbers of YR motifs at TSSs were 3,610, 8,162, and 8,610 for ascidian, human CpG-poor, and human CpG island promoters, respectively. All the sequence analyses were performed with Perl scripts, which are available upon request.
transcription start site
The authors are grateful to Dr. Kenneth E. M. Hastings (McGill University) for discussions. We thank Ms. Ritsuko Kato and Ms. Mayu Fushimi (Ochanomizu University) for technical assistance. Computation time was provided by the supercomputer system at the Human Genome Centre, Institute of Medical Science, University of Tokyo. This work was supported by the Japan Society for the Promotion of Science (JSPS) through its "Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program)", the Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), KAKENHI (22310120 and 23770273), and the Global COE Program (Center of Education and Research for Advanced Genome-Based Medicine), MEXT, Japan.
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.