Intergenic, gene terminal, and intragenic CpG islands in the human genome
© Medvedeva et al. 2010
Received: 26 February 2009
Accepted: 19 January 2010
Published: 19 January 2010
Skip to main content
© Medvedeva et al. 2010
Received: 26 February 2009
Accepted: 19 January 2010
Published: 19 January 2010
Recently, it has been discovered that the human genome contains many transcription start sites for non-coding RNA. Regulatory regions related to transcription of this non-coding RNAs are poorly studied. Some of these regulatory regions may be associated with CpG islands located far from transcription start-sites of any protein coding gene. The human genome contains many such CpG islands; however, until now their properties were not systematically studied.
We studied CpG islands located in different regions of the human genome using methods of bioinformatics and comparative genomics. We have observed that CpG islands have a preference to overlap with exons, including exons located far from transcription start site, but usually extend well into introns. Synonymous substitution rate of CpG-containing codons becomes substantially reduced in regions where CpG islands overlap with protein-coding exons, even if they are located far downstream from transcription start site. CAGE tag analysis displayed frequent transcription start sites in all CpG islands, including those found far from transcription start sites of protein coding genes. Computational prediction and analysis of published ChIP-chip data revealed that CpG islands contain an increased number of sites recognized by Sp1 protein. CpG islands containing more CAGE tags usually also contain more Sp1 binding sites. This is especially relevant for CpG islands located in 3' gene regions. Various examples of transcription, confirmed by mRNAs or ESTs, but with no evidence of protein coding genes, were found in CAGE-enriched CpG islands located far from transcription start site of any known protein coding gene.
CpG islands located far from transcription start sites of protein coding genes have transcription initiation activity and display Sp1 binding properties. In exons, overlapping with these islands, the synonymous substitution rate of CpG containing codons is decreased. This suggests that these CpG islands are involved in transcription initiation, possibly of some non-coding RNAs.
Most mammalian DNA is depleted with CpG dinucleotides  whose fraction in a mammalian genome is close to 0.2-0.25 of the value expected from presupposition of random distribution . The shortage of genomic CpG dinucleotides is believed to be the consequence of frequent mutation ofmCpG to TpG dinucleotides [1, 3–5]. Nevertheless, some mammalian genomic segments called CpG islands (CGIs)  possess a high G+C content, with a frequency of CpG close to the expected value. In bioinformatics, CGIs are usually defined as DNA segments that are longer than 200 bp, have above 50% G+C content, and have a CpG frequency of at least 0.6 of that expected assuming letters at each sequence position occurring independently at random with the given composition . The number of CGIs varies substantially in different vertebrate species . There are about 50,200 such CGIs in the human genome, of which approximately 29,000 are in repeat-masked sequences .
The increased number of CpG sites in CGIs is often correlated with low methylation of cytosine in CpG dinucleotides [6–9]. This effect is usually explained by postulating protection of these sites from DNA methyltransferase by abundant and commonly utilized DNA binding proteins including Sp1 , E2F , CTCF  and others. The Sp1 protein is particularly strongly implicated in CGI functioning. Gardiner-Garden and Frommer observed  that CGIs contain many "G/C boxes", composed of the sequence GGGCGG, demonstrated to act as binding sites for the Sp1 transcription factor . Later, it was found that Sp1 can bind to both methylated and non-methylated variants of this binding site , and can protect non-methylated sites from methylation .
In his recent study Rozenberg et al.  demonstrated that binding sites of several regulatory proteins, including Sp1, contain a CpG pair and play an important role in the formation of sequences of mouse promoters which regulate the expression of housekeeping genes. This suggests that CGIs overlapping with promoters of housekeeping genes are related to their transcription initiation. According to  60% of widely expressed human genes and up to 40% of tissue-specific genes are associated with CpG islands. It has been shown lately that 72% of all promoters have high CpG content, and only 28% are in the class with low CpG content .
CGIs located near 5' region of known genes account for only a fraction of all CGIs in the genome (about 25% for CGIs longer than 500 bp in the HOVERGEN compilation , and about 50% according to our estimations, see below). Although many non-5' associated CGIs overlap with repeats [18, 19], many do not [18, 20], but instead are frequently positioned 3' to known genes, overlapping with final transcribed exons [3, 20]. Amazingly, CGIs located in these 3' regions have attracted almost no interest, even though these CGIs were mentioned in the publication that initially coined the term "CpG island" . More recently, computational approaches have also identified intragenic CGIs that overlap neither TSS nor final exons , although function of these CGIs have not yet been assessed.
CGIs not associated with 5' region of any gene can perform important biological functions. For instance, a C-to-T substitution in CGI encompassing parts of exon 15 and intron 15 of UBA1 affects expression of this gene . A CGI located within intron 10 of KCNQ1 and associated with an oppositely-oriented RNA transcript is involved in imprinting (paternal repression) of its locus . Imprinting of MAP3K12 gene is associated with differential methylation of a CGI located in its last exon . Many CpG islands are located near the 3' ends of genes associated with cancer development .
The main objective of this work was to study properties of CGIs located far from TSS of protein coding genes. We demonstrated that substantial selection pressure is applied to CpG pairs in CGIs independently from CGI location in the reference to gene starts locations, which implies functional importance of CpG pairs. We assumed with  that most of CGIs are involved in transcription initiation, thus one of our objectives was to study transcriptional activity of CGIs, particularly of CGIs located far from 5' regions of any protein coding gene. To do this we used C ap A nalysis G ene E xpression (CAGE) tags identified in the FANTOM project [25, 26]. We also assessed the representation of binding motifs recognized by regulatory factor Sp1 in CGIs located in 5', 3' and internal gene regions, as well as out of any known genes. In addition, we re-analyzed the published ChIP-chip data on Sp1 binding in chromosomes 21-22 and compared Sp1 binding preferences in DNA not overlapping with CGIs as well as in CGIs located in different gene segments and out of any genes. Fraction of non-5' CGI strongly enriched with CAGE tags was studied with special care; we observed substantial overrepresentation of probably strong Sp1 binding sites in such CGIs and collected known reports of transcription starts sites of long non-coding RNAs associated with such CGIs.
Tendency of CGIs to overlap with exons has been observed many times at limited data sets [16, 27, 28]. As the first step of our study we decided to give a quantitative estimation of this tendency separately for exons and introns located in different gene regions. Exons and introns were categorized according to their location within the gene (see Methods). For exons and introns from each category the total length of overlap with CGIs was calculated. We used Monte Carlo simulations to assess the statistical significance of the observed total overlap length. A round of simulation was performed as follows. Exons (introns) were located as in the human genome and "CGIs" were sampled. Intervals between CGIs were sampled from the interval distribution evaluated from the genome, whereas the lengths of "CGIs" were shuffled length of genuine CGIs. The total overlap between exons (introns) and "CGIs" was calculated. Then the whole procedure was repeated with switched CGI and exon sets, i.e. the annotated CGIs and "simulated exons (introns)" were taken. For each category of gene elements such simulations were repeated 10,000 times and the observed values of exon (intron) overlapping with CGIs were normalized for the average simulated values.
It is known that exons are usually more GC-rich than introns . At the same time, the algorithm for CGI computational identification uses the increased C+G content of a test DNA segment as one of the CGI conditions. On the other hand, a CG-rich exon can have an increased number of CpG dinucleotides owing to its specific amino acid composition, e.g. many arginine codons. Therefore, this exon would be misidentified as CGI, and many such events would explain an increased overlapping between CGIs and exons.
A more interesting alternative explanation of frequent overlapping of CGIs with exons is that it is caused by the common preferences of both segments to be located in some particular DNA regions. In this case the terminal intron segments that are close to exons would also overlap with CGIs more frequently than internal segments of long introns. To test this, we selected 200 bp intron fragments adjacent to donor and acceptor splice sites. As in the previous section, we used Monte Carlo simulations to assess expectation of the observed overall overlap lengths.
Ratio of bona fide CGIs-exons (introns) overlap and "simulated" overlap average
Bona fide overlap length/Simulated overlap average
3000 bp flank region
exon in 5' UTR
intron in 5' UTR (5' 200 bp)
intron in 5' UTR
intron in 5' UTR (3' 200 bp)
initial coding exon
initial intron in coding area (5' 200 bp)
initial intron in coding area
initial intron in coding area (3' 200 bp)
internal intron (5' 200 bp)
internal intron (3' 200 bp)
terminal coding exon
terminal intron in coding area (5' 200 bp)
terminal intron in coding area
terminal intron in coding area (3' 200 bp)
exon in 3' UTR
intron in 3' UTR (5' 200 bp)
intron in 3' UTR
intron in 3' UTR (3' 200 bp)
dN, dS and dN/dS
d N /d S
d N /d S
d N /d S
CG containing codons in CGI
CG containing codons out of CGI
AG containing codons in CGI
AG containing codons out of CGI
GC containing codons in CGI
GC containing codons out of CGI
GA containing codons in CGI
GA containing codons out of CGI
All codons in CGIs
All codons out of CGIs
The nonsynonymous substitution rate for codons containing CpG dinucleotides was very similar to that for other codons and depended only weakly on the overlapping with CGIs (Figure 2). The main factor affecting rates of nonsynonymous substitutions is the codon location near one of the gene termini. Figure 2 shows "V"-shaped d N plots for all the codons outside of CGIs, which indicates that internal exons are less variable than both terminal exons. This effect may be related to the increased protein variability at the N and C termini. At the same time, codons overlapping with CGIs show almost equal d N for the internal and the final exons. Thus, proteins coded by genes having a CGI at their 3' end are generally more conserved at their C end.
In contrast, synonymous substitution rates calculated for codons containing CpG dinucleotides were different from those for other codons and dramatically depended on their overlapping with CGIs (Table 2 and Figure 3). Generally, for codons with CpG dinucleotides overlapping with CGI resulted in d S decrease approximately two-fold (Table 2 and Figure 3). For codons that did not contain CpG the effect of CGI on d Swas much smaller. This effect did not depend on the gene region: a CGI overlapping with a 5', intragenic or 3' exon had a similar effect on d S, reducing the synonymous substitution rates of CpG containing codons by 49%, 40% and 37%, respectively.
Figure 4 shows the d N /d S ratio which reflects the selection pressure at the protein level . For codons that do not contain CpG the d N /d S ratios are almost identical for codons that do overlap and don't overlap with CGIs. Thus, it appears that selection at the protein level for non CpG-containing codons inside or outside of CGIs is practically the same. For CpG-containing codons one can see that the d N /d S ratios calculated for codons overlapping and not overlapping with CGIs are substantially different, and both ratios are much lower (red and light green curve, Figure 4), which indicates a comparatively greater stabilizing selection at such codons at the protein level.
The observation that CpG containing codons have lower d S when they overlap with CGIs gives additional evidence that amino acid composition (e.g. abundance of arginine) cannot explain the abundance of CpG dinucleotides and the frequent overlap of CGIs and exons. Function of CGI indeed seems to be more related to DNA or RNA.
In the previous sections we have demonstrated that in exons located far downstream from TSSs and overlapping with CGIs the synonymous substitution rate of CpG-containing codons is reduced. In addition, CGIs found far downstream from TSS often overlap with exons, but such CGIs are unlikely to be the misrecognized exons. Assuming that 5' related CGI are involved into transcription initiation [15, 17] we have investigated if CGIs located in other genome regions also participate in transcription initiation. To test this suggestion we have studied association of computationally identified CGIs with transcription start sites as identified by CAGE tagging [25, 26]. For our analysis we categorized CGIs into 4 non-overlapping classes: (1) 5' CGIs; (2) intragenic CGIs; (3) 3' CGIs; and (4) intergenic CGIs (see Methods, CGI classes). The number of CGI classes is smaller than the number of gene elements because the same CGI can often overlap with several gene elements, e.g. 5' UTR, the initial coding exon, the first intron, and sometimes other exons as well as downstream located introns.
CAGE tags in different CGIs classes
Fraction of CGIs, %
Total length of CGIs
Average GC-content, %
#CGIs with CAGE-tags
Total length of CGIs with CAGE-tags
Fraction of total length of CGIs with CAGE-tags in class, %
Total CAGE-tags in CGI class
Fraction of CAGE-tags in CGI class, %
CGIs with at least one CAGE-tag, %
Average CAGE-tags per CGI
Average CAGE-tags per CAGE-containing CGI
Density of CAGE-tags in CGIs, bp-1
Density of CAGE-tags in CAGE-containing CGIs, bp-1
One CAGE-tag per #bp
As it was already reported in  CAGE tags tend to form dense clusters in 5' CGIs. CGIs located elsewhere contain much less CAGE tags than 5' CGIs, but, interestingly, some intragenic, 3' or intergenic CGIs contain clusters of CAGE tags with the number and the density of CAGE tags comparable with those found in CAGE clusters in 5' CGIs. Additional file 1 contains intragenic and 3' CGIs that have greater than 40 CAGE tags per a CGI (which approximately corresponds to the average number of CAGE tags per 5' CGI). 3' CGIs usually contain more CAGE tags than intragenic CGIs. In some sense this agrees with the tendency of CGIs to overlap with the final coding exon rather than with internal exons.
Authors of  reported that CGIs overlapping mouse promoters of housekeeping genes contained an increased number of binding sites for different transcription regulatory factors, in particular Sp1, ETS, and NRF-1. Since binding of Sp1 is well studied with experimental methods, we decided to assess Sp1 binding in CGIs of different localization relative to known genes. We used both bioinformatics methods of identification of Sp1 recognition motifs in DNA sequence as well as re-assessment of the published experimental data.
CGIs were scanned for presence of Sp1 factor binding sites using a positional weight matrix (PWM) constructed from experimental data from the TRANSFAC database. We selected a threshold that identified 90% of Sp1 binding sequences from our experimentally confirmed training set (see Methods). To evaluate the representation of Sp1 binding sites in CGIs, we calculated the P -value (see Methods) for each CGI, i.e. the probability of a random sequence of the same length and the same dinucleotide content to contain at least this number of Sp1 occurrences. This P-value was calculated with the help of the AhoPro program . We compared results obtained for different CGI classes.
It is noteworthy that CGIs containing more than 40 CAGE tags contain a much more high scoring Sp1 recognition motifs than CGIs without evidence of high transcription activity (Figure 5), independently from their localization in relation to genes. Surprisingly, the greatest overrepresentation of high-scoring Sp1 recognition motifs sites is characteristic for 3' CGIs with more than 40 CAGE tags, but not for 5' CGI enriched with CAGE.
For further validation of Sp1 protein binding within CGIs, data on Sp1 transcription factor binding sites, experimentally assessed with ChIP-chip technology and published in  were analyzed. Cawley et al. detected frequent Sp1 binding sites far from 5' regions of any gene. We used their data to justify that Sp1 protein binds preferably within CpG islands regardless of their location in relation to genes.
Sp1 binding regions published in  are usually longer than 1 kB, which is significantly longer than many CGIs, especially those located far from TSS of genes. The authors of  used an extensive filtration procedure, which can lead to a high false negative rate, to limit their results to binding sites frequently occupied with Sp1. Therefore, the raw data were re-analyzed to allow comparison between ChIP signals within CGIs and those in other DNA segments. Additionally ChIP signals within CGIs located in different gene segments were examined.
Statistical significance of Sp1 signal
Type of the region
Statistical significance of Sp1/input ratios
Type of the region
Transcript starts in non 5' CGI
Number of CGIs with starts of long transcripts
Type of CGIs
Unverified RefSeq genes
Gene Bank mRNA and mRNA-like RNA
Spliced and unspliced ESTs
No transcripts found
CGIs with more that 40 CAGE-tags per CGI
CGIs with 20-40 CAGE-tags per CGI
The total number of 3' and intragenic CGI with more then 40 CAGE tags is rather small: 22 and 30 respectively. Decreasing the threshold for CAGE tags per CGI to 20-40 leads to 41 3' CGI and 44 intragenic CGIs. However, the number of highly CAGE-enriched non 5' CGIs is not large enough to render a convincing statistical significance value.
In this study we tried to systematically assess properties of CpG islands that are found far from transcription start sites of protein coding genes. About 43% of all CGIs belong to this class. Our study of CGIs which overlap with exons demonstrates that stabilizing selection protects CpG pairs located in CGIs from substitutions which do not affect the encoded amino acid sequence. We observed that many CGIs that are found far from TSS overlap with CAGE tags and thus participate in transcription; furthermore, highly CAGE-enriched CGIs are bound by transcription regulatory factor Sp1 with remarkably high significance. Although function of CGIs is still disputed, there is growing evidence that CGIs located near gene starts participate in transcription regulation [15, 17, 35]. Our finding allowed us to suggest that many CGIs that are found far from the start of any known protein coding gene are also participate in transcription.
As we have demonstrated, many such CGIs often overlap with exons, particularly the terminal gene exons. Many CGIs are located within a gene but far downstream from TSS (see Table 3). The aggregated number of genes with CGIs near their 3' end is estimated at 5 - 10%. Interestingly, it was observed recently  that some genes in human T-cells have an uncommon methylation pattern with a decreased methylation level observed near both gene termini.
We have detected many intergenic CGIs (see Table 3). It is known that UCSC browser table Knowngenes contains only highly verified genes, and excludes some genes with low justification scores. Based on our analysis, some CGIs considered in this study as intergenic may be related to these yet unverified genes.
The selection directed to maintain CGI sequence properties is not strong enough to overcome a strong selection at protein level applied to non-synonymous substitutions. The non-synonymous substitution rate differs only weakly for codons overlapping and non-overlapping with CGIs. However, the d N /d S ratio for CpG containing codons not overlapping with CGIs is much smaller than for codons overlapping with CGIs. This fact indicates that in this case selection at the protein level needs to be stronger to counterbalance the higher mutation rate.
The evolutionary distance between human and mouse is rather large, with the approximate sequence divergence for these species close to 0.5 substitutions for a neutrally evolving site . This value agrees very well with the d S values observed for codons that do not contain CpG. However, d S values for CpG containing codons are about twice as large as those for other codons. This is much less than the approximate ten-fold increase of mutation rate . The possible explanation may come from the effect suggested by Kondrashov et. al. . The idea is that hypermutable CpG dinucleotides  at neutral and pseudoneutral sites are likely to be destroyed by mutations and unlikely to be found in the alignment of human and mouse [41–43]. Those CpG dinucleotides that remain aligned in human and mouse genomes are likely to be stabilized by selection pressure of a different nature. Thus, even CpG dinucleotides that do not overlap with CpG islands at synonymous positions may be stabilized with some selection of yet unknown type. Interestingly, Bock et al  who specifically identified CGIs related to chromatin epigenetic state observed that about 90% of such islands overlapped with highly conserved DNA elements, including at least 20% of CGIs that did not overlap with TSS.
Although CGI overlap with disproportionally large number of CAGE tags (about a half of total CAGE tags are found within CGIs) many intragenic, intergenic, and gene terminal CGIs overlap with a small number of CAGE tags or with no CAGE tags at all. However, we believe, that FANTOM database can have some functional transcription starts missing.
Second, CAGE tags found in FANTOM database are obtained only in a number of tissues. CGIs located in 5' gene regions are usually found at starts of broadly expressed housekeeping genes . Transcripts from TSS tagged in 3' regions of known genes could be tissue specific. Since the number of tissues studied is limited, the number of tagged TSS should be less than that in 5' gene regions.
The suggested role of non-5' CGIs in transcription initiation agrees with the excessive Sp1 binding in CGIs. The observation that Sp1 binding sites are often present in CGIs is not new . It is noteworthy that CGIs enriched with CAGE tags contain more high-scoring Sp1 recognition motifs (Figure 5). Abundant Sp1 binding in gene 3' or intergenic regions was observed in genome-wide site location experiments , which reported that 36% of the clusters of Sp1, Myc and p53 binding sites lie within or immediately 3' to well-characterized genes. Authors of  assumed that in these cases, non-coding RNA transcription may be initiated.
Recent studies of the mouse genome  demonstrated that a large number of ncRNA are initiated in the 3' regions of the genes, with specific enrichment at the 3' terminus of the final exon. There are published reports which show that sometimes long ncRNA are synthesized to open large chromatin segments for subsequent transcription initiation [45, 46]. On the other hand a substantial number of these ncRNAs are complementary to known genes as anti-sense strands, which has led to the suggestion of an additional mechanism of gene silencing by natural antisense interference . The authors of  also observed that sense-antisense pairing was almost universally associated with candidate imprinted loci. As genetic imprinting is frequently associated with an altered methylation status of CpG islands [48, 49], CGIs located in 3' gene regions [23, 47] or intragenic CGIs [21, 22] may play an important role in this process by regulation of gene expression via inducing antisense-based gene interference.
Abundant non-coding transcripts discovered recently in all parts of a genome allow suggesting that there should be regulatory regions associated with transcription initiation of such RNA types. This agrees with a large number of CGIs not associated with transcription start site of any known protein coding gene. Here we demonstrate that many of such CGIs appear to be related to transcription initiation and at least some of them contain CpG pairs stabilized by natural selection. Expression of RNA controlled by promoters overlapping with these CGIs seems to be regulated by the same transcription factors as expression of protein coding genes, therefore these RNA molecules seem to be involved into the regulatory cascades and cellular processes possibly as non-coding RNA of some function. Additional studies, both experimental and in silico are needed to verify this hypothesis.
We used a MySQL database and a set of Perl and Ruby scripts as analysis tools. The source code of the scripts, test sequence sets, and lists of genes under putative regulation by non-5' CGIs can be found at http://line.imb.ac.ru/CGI/
The sequence of the human genome (hg17) and the Knowngene Table were downloaded from http://genome.ucsc.edu/. Redundant copies of genes and multiple copies with the same name but different locations were removed. Genes with less then 3 exons were also excluded. The resulting set amounted to 35,915 entries, derived from an initial 39,368 entries in the Knowngenes Table. Table Cpgislandext (UCSC) was used as the set of CpG islands. We have excluded CGIs with 'random' chromosome location, retaining 27,437 out of 27,801 computationally annotated CGIs. All Knowngenes genes were taken into account when testing if there were any protein coding gene TSS near intergenic or 3' located CGIs. Tables RefSeq genes, Human mRNA, Spliced EST and Human EST (USCS) were used to find starts of potentially protein coding and noncoding genes.
We compiled 11 sets of gene fragments, defined as follows: (1) 5'-flank regions began 3 kb upstream from TSS and extended till first found TSS. Overlaps with any transcribed sequence were excluded. (2) 5' UTR-exon regions contained non-translated 5' exons (or exon segments); overlaps with any translated sequences were excluded. (3) 5' UTR-intron regions contained introns separating non-translated exons or the last non-translated and the initial translated exons; overlaps with any translated sequences were excluded. (4) Initial coding exons contained entire initial coding exons for all gene or exon parts; overlapping exons of different isoforms were merged. (5) Initial introns contained introns separating the 1st and the 2nd coding exons, overlaps with regions included into groups (1-4) were excluded. (6) Internal exons contained all translated internal exons, excluding the initial and the final exons of any gene. (7) Internal introns contained all introns, but not the initial and the final introns and overlaps with any translated sequence. (8) Final exons contained the last translated exons or their parts. (9) Final introns contained the introns separating the final exons and the previous ones, excluding overlaps with any translated sequence. (10) 3' UTR exons contained 3' non-translated exons or their parts, excluding overlaps with any translated sequence. (11) 3' UTR introns contained introns separating the 3' non-translated exons or the 3' non-coding and the final coding exon, excluding overlaps with any translated sequence.
Since introns are usually much longer than exons, we also considered 200 bp intronic segments flanking the donor and the acceptor splice sites. The resulting regions, called " intron terminal regions " are comparable with exons in their length. We used 200 bp intron regions adjacent to exons as an additional control set to exclude the influence of the increased exon GC composition, which can be misinterpreted as CGI during computational identification. All genes elements are available in Additional file 2.
We considered 4 different classes of CpG islands: (1) 5' CGIs that overlapped with gene elements from groups 1-5 above; (2) intragenic CGIs that overlapped with gene elements from groups 6-7; (3) 3' CGIs that overlapped with gene elements from groups 8-11 or with a region 3 kb downstream of any gene; and (4) intergenic CGIs that were located at least 3 Kb from any known gene upstream or downstream. All genes, including single and double-exon genes were taken into account in this case. If a CGI contained at least one bp of a 5' region of any gene it was considered as a 5' CGI regardless of how many additional regions it overlapped. If a CGI contained at least one bp of a 3' region of any gene, but not overlapped with its 5' region, it was considered as a 3' CGI. If a CGI contained at least one bp of a known gene, but not overlapped with its 5' or 3' region, it was considered as an intragenic CGI. A CGI was considered as intergenic if it did not belong to any of these classes. Additionally, we used a control set of CG-rich random sequences with the length and dinucleotide distribution estimated from each of CGIs containing more than 40 CAGE tags. The overall number and length of CGIs of different classes are given in Table 3. Since we took a special care to remove putative 5' CGIs from the other classes, the majority of all CGIs fells into 5' CGI class (Table 3). CGIs of all classes are available in Additional file 3.
Given two sets of non-overlapping genome intervals (e.g. CGIs and exons) we used 10,000 Monte-Carlo simulations to compute expected distribution of aggregated overlap length. All length distributions were computed independently on each chromosome. During simulations intervals of one set corresponded to genome coordinates of elements (e.g. CGIs) and the other set contained intervals with lengths corresponding to those of the second set of genome elements but located at random positions in the chromosome. Each run of simulations was repeated twice with a different "fixed" element set (see Figure 1). Program source (Ruby 1.8) and additional details are available in Additional file 4.
To estimate d N and d S we used the EDAS database , which contains 28,530 alignments of human and mouse genes. For genes with several isoforms, the longest isoform was taken. Genes with less than three coding exons were excluded. We also excluded genes which had less than 70% identity within protein alignment for any coding exon. The resulting dataset contained exons from 8,775 genes. Six groups of protein coding exon segments were defined: 5', internal, and 3' exons, overlapping and non-overlapping with CGIs. In each of these groups, we selected codons containing a CG dinucleotide; if a CG dinucleotide was split between two adjacent codons, both codons were taken. A similar procedure was performed for codons containing AG, GA, GC dinucleotides. Sequences from each group as well as codons containing CG, AG, GA, GC were concatenated.
The transitional to transversional substitution rate ratio (R), as well as the numbers of synonymous substitutions per synonymous site (d S) and nonsynonymous substitutions per nonsynonymous site (d N ) were estimated by the Ina method . Unlike maximum likelihood methods, this was effective for very long alignments (~3*106 bp), and was fast enough to allow bootstrap resampling. We used our own implementation of this method (developed in Perl). The 95% confidence intervals for evolutionary parameters were calculated using bootstrap percentiles . 2000 bootstrap replications were used. Sequences of all groups and scripts used for estimation of substitution rates are available in Additional file 5.
The table of CAGE tags mapped on the RepeatMasked hg17 chromosome-build is available at http://gerg01.gsc.riken.jp/cage/download/hg17prmtr/cage.rep_tag.2005-01-16.chr_all_gff.tar.gz
This table contains 1,597,993 entries. We downloaded alignments of CAGE tag sequence with the genome region with a minimal identity of 0.88.
We used a positional weight matrix (PWM) [52, 53] as a model. A PWM for Sp1 was constructed by aligning experimental data contained in the TRANSFAC  database. Sequences containing binding sites for human Sp1 (mostly footprints), were obtained from the TRANSFAC database (July 2007 release), mapped on the human genome, extracted with genome flanking sequences, and realigned using the SeSiMCMC Gibbs sampler  (see Additional file 6 for details). The most frequent sequence length between different SeSiMCMC runs was equal to 9 bp and we accepted that all motifs had this length. A PWM was constructed from the alignment obtained with SeSiMCMC using the formula described in . The resulting alignment included 221 genome sequences (see motif logo in Figure 9).
To evaluate the P-value (i.e. to calculate the statistical significance of the observed number of Sp1 sites scoring higher than the fixed threshold in the test sequence) we used AhoPro . For a test sequence containing k possibly overlapping PWM hits scoring higher than threshold T, the P-value was defined as the probability of observing no less than k such (possibly overlapping) hits in the random (i.i.d) sequence with the same nucleotide distribution and length as in the tested CGI.
Experiments in  were conducted on the Affymetrix GeneChip® Human Tiling 1.0R Array Set. The results were downloaded from http://transcriptome.affymetrix.com/publication/tfbs/. Those chips contain unique 25 base-pair long sequence-tags for human chromosomes 21 and 22. The experiments for Sp1 were performed on two biological samples with three technical replicates for each chip. We used an modified version of TiMAT  to re-analyze the published cel-files. To allow comparison between different experiments the probes of each chip were median scaled to a signal value of 500. Additionally quantile-quantile  normalization was performed over all chips. The signal values from the two biological samples and the technical replicates were averaged to obtain one value for each probe. Signal values of biological probes of the Sp1 antibody treated and untreated control experiment were collected. As it is recommended in  mis-match probes (MM) were excluded and only perfect-match probes (PM) were considered for further investigation. Our aim was to compare the statistical distribution of PM values for tags located far from CGIs with PM values for tags overlapping different classes of CGIs as well as to compare signals for Sp1 antibody treated samples with those for untreated DNA.
transcription start site
cap analysis gene expression
Expressed sequence tag
Affimetrix perfect match probe
Affimetrix mis-match probe
The authors are pleased to thank Erica Golemis for valuable comments on the manuscript. We thank Biobase GmBH for providing free access to the Transfac database and personally Alexander Kel for his interest to this work and many discussions. The authors are also pleased to acknowledge Shamil Sunyaev, Andrei Mironov, Inna Pertsovskaya, Alexander Favorov, Sandy Der and Thomas Stranzl for fruitful discussions and Vadim Radionov for help with figures preparation. This study was partially supported with INTAS project 05-1000008-8028 (YAM, MVF, DBM, EOE, VJM), Russian Foundation of Basic Research projects 07-04-01584 (YAM, MVF, VJM), 07-04-01623 (DBM, IVK), 09-04-92745 (EOE), 07-04-12243 (NJO), International Science & Technology Center project 3737 (NJO), Russian Academy of Sciences Presidium program "Molecular and Cellular Biology" (EOE, NJO, VJM), Russian Academy of Sciences Presidium program "Basic Sciences for Medicine" (EOE), HHMI 55005610 (EOE), and Russian Federal Agency for Science and Innovation State Contracts 02.531.11.9003 (YAM, MVF, DBM, IVK, VJM) and 02.740.11.5008 (YAM, MVF, DBM, IVK, VJM).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.