Periodicity of SNP distribution around transcription start sites
© Higasa and Hayashi; licensee BioMed Central Ltd. 2006
Received: 13 October 2005
Accepted: 03 April 2006
Published: 03 April 2006
Several millions single nucleotide polymorphisms (SNPs) have already been collected and deposited in public databases and these are important resources not only for use as markers to identify disease-associated genes, but also to understand the mechanisms that underlie the genome diversification.
A spectrum analysis of SNP density distribution in the genomic regions around transcription start sites (TSSs) revealed a remarkable periodicity of 146 nucleotides. This periodicity was observed in the regions that were associated with CpG islands (CGIs), but not in the regions without CpG islands (nonCGIs). An analysis of the sequence divergence of the same genomic regions between humans and chimpanzees also revealed a similar periodical pattern in CGI. The occurrences of any mono- or di-nucleotide sequences in these regions did not reveal such a periodicity, thus indicating that an interpretation of this periodicity solely based on the sequence-dependent susceptibility to mutation is highly unlikely.
The periodical patterns of nucleotide variability suggest the location of nucleosomes that are phased at TSS, and can be viewed as the genetic footprint of the chromatin state that has been maintained throughout mammalian evolutionary history. The results suggest the possible involvement of the nucleosome structure in the promoter function, and also a fundamental functional/structural difference between the two promoter classes, i.e., those with and without CGIs.
Several million single nucleotide polymorphisms (SNPs) have already been collected and deposited in public databases  and these are important resources not only for use as markers to identify disease-associated genes , but also for an understanding of the mechanisms that underlie the diversification of the organism. The nucleotide diversity of human genome sequence appears to fluctuate from region to region [3–5]. The majority of the SNPs are believed to have no biological consequence, and therefore their diversity is primarily determined by the mutation rate within the germ cells, although it may be affected by the selective pressure that operates at the individual level . In this study, we used a spectral analysis approach to identify the pattern of nucleotide variability around the transcription start sites (TSSs), and survey its biological implication.
In many genes, the position of their TSSs fluctuates more or less , and the degree of this fluctuation could cancel out the effects of phasing to TSSs. We thus evaluated the periodicity in two classes of genes; those with small fluctuations and those with large fluctuations. We defined the extent of fluctuation as follows. We chose 4,660 genes for which more than 10 oligo-capped clones were mapped. They were then divided into two halves according to the start site fluctuation, which were estimated by the value of the standard deviation of the start positions. A spectrum analysis of the SNP density distribution of these two classes revealed a stronger signal (4.5 × 10-3) at the 146 bp periodicity for the TSSs with small fluctuations than for those with large fluctuations (1.5 × 10-3).
This periodical distribution of SNP density raised the question of whether they might be caused by the sequence features around TSSs. We examined the distribution of each of the four nucleotides, or all 16 dinucleotide that include CpG dinucleotide that is known to have a higher mutation rate (see Additional file 1). We found various periodical distributions both for some mono- and di-nucleotide sequences, but none of them showed the 146 nucleotides periodicity. The sequence-dependent susceptibility to mutation is thus not considered to explain the periodicity of the diversity profile.
CpG island and periodicity
In the mammalian genome, CpG dinucleotide is suppressed (depleted), because most of the C in CpG are methylated at the C5 position by the CpG methylase activity, that in turn tends to be mutated to T by spontaneous deamination. The CpG islands (CGIs) are exceptions, where their local C/G contents are high, and the dinucleotides are not depleted . The island is frequently located in the vicinity of TSS .
Periodicity of human-chimpanzee divergence
The genome diversity observable within one species (i.e., human) is limited because of various factors e.g., short time (2 to 3 × 104 years ago) since the establishment of Homo sapiens, the population bottle-neck, for assessing the variability of the genome sequence. On the other hand, the genome divergence between closely related species, i.e., human and chimpanzee, can yield more information about the genome variability, because the mutations are accumulated and fixed within each population since the separation of the two species 5 million years ago, and yet, they are close enough so that the genome sequences can be reliably aligned to precisely determine the locations of the changed nucleotides. We identified the TSS regions of the chimpanzee genome by BLAT  searching using the human TSSs as query sequences. Approximately 90% (9,087) of the 10,171 human TSS regions could be aligned with confidence, and 61% (5,529) of them were with CGI. A total of 400,285 divergent sites could be mapped in these regions. The spectrum analysis of nucleotide divergence between humans and chimpanzees again showed the 146 nucleotides periodicity, which was derived solely from the CGI-TSS regions (Fig. 2B, 2D and 2F).
We have shown that both the SNP density distribution and the nucleotide divergence profile between species (human and chimpanzee) are periodical around the TSSs, with its wave length of 146 nucleotides, which is identical to the length of DNA that wraps around the nucleosome. This periodicity comes solely from the TSS regions with CGI, and the range the periodicity is observed is roughly the same as the CGIs occupy, i.e., 0.4 kb or 2 to 3 nucleosome units. We are thus tempted to propose, that the CGIs are sites where nucleosomes are tightly packed and phased to the TSSs, and that the nucleotide variability is positionally biased within the nucleosomal structure. Several previous reports have also supported this idea, i.e., nucleosome positioning in the promoter region has been experimentally explored for a few human genes [18–21]. Among these promoters, those with CGI were shown to be organized into a phased array of nucleosomes [18, 19], while those without CGI carried only a single nucleosome located at some distances from their TSSs [20, 21].
One possible explanation for the intra-nucleosomal bias of nucleotide variability is the local difference of mutation rate in germ cells. The nucleosome structure can locally affect the mutation rate, because its determinants, e.g., mutagen accessibility, depurination rate, or the efficiency of damage recognition and repair may be position-dependent. While assuming that the periodicity of nucleotide variability is ascribed to that of the mutation rate, it follows that the CGI-TSSs are in a nucleosome-packed and phased state, but nonCGI-TSSs are not, in the germ cell lineage.
The CGIs have originally been recognized to be the characteristic promoter regions of housekeeping genes, whose expression is necessary for the maintenance of cell physiology, and so, are widely expressed regardless of the cell types. However, the distinction between housekeeping genes vs. non-housekeeping (or accessory) genes has been somewhat arbitrary. The germ line cells are, by definition, in an undifferentiated state (as far as gene expression pattern is concerned), and so, the genes with CGI-TSSs are likely to be expressed. It is thus a plausible idea that CGIs are involved in the expression of genes essential for the functioning and survival of germ cells, and that an ordered nucleosome location is required for this expression.
An alternative explanation for the bias of the nucleotide variability within the nucleosome structure is the sequence constraint that acts at the individual level. The recognition sequences of various transcription factors (and so, sites of conservative pressure) may thus be located at a particular site (or side) of the nucleosomal structure. According to this scenario, some additional assumptions are needed to explain the absence of periodicity of the nucleotide variability in the nonCGI-TSSs.
The sequence features that are preferred for winding around the nucleosome have been previously amply described. Using experimentally derived nucleosomal DNA sequences, some dinucleotides, e.g., AA or TT, have been shown to appear at distances that are multiples of 10.1 to 10.5 nucleotides, which are the pitches of the helix of double-stranded DNA wrapping around the core histones [22–25]. Using the dataset of the TSS regions described here, we examined the periodicity of the distances by an autocorrelation function  for the two dinucleotides mentioned above. We found significant periodicities of the ten-nucleotide for TT distances in CGI-TSSs but not in nonCGI-TSSs (see Additional file 2). These were located in the region where the periodical nucleotide variability was observed (Fig. 3). These results further support the idea that CGI-TSSs are likely to be organized into a phased array of nucleosomes but not nonCGI-TSSs. In addition, these nucleosome positionings were consistent with those identified by others as the sites of micrococcal nuclease resistance for the gene with CGI-TSS . The distribution of the nucleosomal DNA signal detected in this study may also suggest the general chromatin architecture around CGI-TSSs.
Ioshikhes et al., have shown that certain transcription factor binding sites may also show the same distance periodicity (10.1 – 10.5 nucleotides) . We could not detect statistically significant periodicity of SNP density of the 10 nucleotides around TSSs (see Additional file 3).
We showed here, that nucleotide variability can be viewed as the genetic footprint of the chromatin state around TSSs. The direct proof of the two possible explanations here may be provided by showing the phased and packed localization of nucleosomes around the TSSs in germ cells. This requires the development of a method that enables an examination of the spatial relationship of histone molecules with specific genomic sites at the resolution of the nucleotide level, and also at the sensitivity of the detection of the events occurring in single to a few cell level, since the germ cells are few in number.
We herein reported a periodical pattern of SNP density distribution around the transcription start sites (TSSs) that are associated with CpG islands (CGIs). The wavelength of the periodicity matches the length of DNA in the nucleosome. The sequence divergence of the same genomic region between humans and chimpanzees also revealed a similar periodical pattern. These results indicated that nucleotide variability can be viewed as the genetic footprint of the chromatin state around the TSSs, which has remained throughout mammalian evolutionary history.
Definition of Transcription Start Sites (TSSs)
The human promoter sequences were downloaded from the DataBase of Transcriptional Start Sites (DBTSS version 4) , which contained information of the exact genomic positions (on the Reference Human Genome Sequence, Build 34) of the TSSs and the adjacent regions for 12,763 human genes. We re-mapped the TSSs onto the reference sequence of the latest version (i.e., Build 35) , and the alternative TSSs were excluded, so that a total of 10,171 TSSs were subjected to the analysis. The 6,001 nucleotides sequences between -3,000 nucleotides and +3,000 nucleotides to TSSs were obtained from the reference sequence. The accession numbers of the reference sequences and sequences around TSSs are available in Additional files 4, 5 and 6, respectively.
The chromosomal positions of validated SNPs around the TSSs were obtained from dbSNP (Build 124) , and then were converted to those relative to TSS. The occurrences of validated SNPs at each nucleotide position were summed up to obtain the SNP density. These data are available in an Additional file 7.
The position-dependent frequency content and power spectra were calculated by the windowed Fast Fourier Transform using the MATLAB® (The MathWorks, Inc., USA). Sliding windows (1,024 nucleotides) at a step of 100 nucleotides were adopted to provide an appropriate balance between the spatial scope and the resolution of frequency for the analysis. The Hanning function was applied to each window to avoid an artifact of discontinuity at the ends. The Matlab scripts and input files for each category of TSS's (i.e., all TSS, CGI-TSSs, and nonCGI-TSSs) are available as Additional file 8.
Statistical evaluation of periodicity
The power spectra were evaluated by the occurrence of the periodicity in 103 random data sets. Each set was composed of N random sequences of length 6,001 nucleotides picked from random positions in the reference sequence (Build 35), where N is the number of TSS regions in each category. The mean and standard deviation of the power was obtained for each periodicity. The confidence interval of 99% was assumed to be 2.58 times the standard deviation (Z-score = 2.58) from the mean, in the distributions of the power values.
Sequence alignment of TSSs between human and chimpanzee
TSS regions of the chimpanzee genome were obtained by BLAT search (version 32)  of the genome  using the 10,171 human TSS regions as query. Alignments with the score below 400 were excluded. Most of the human TSSs (10,099/10,171) could be aligned with the chimpanzee genome. To eliminate the alignment errors in the process of determination of mismatch positions in sequence pairs with insertions/deletions, the alignment was limited between the TSSs and most proximal gaps of more than two nucleotides. Consequently, the number of aligned sequences varied depending on the positions, from 9,087 at TSS to 2,738 and 3,243 at -3,000 and +3,000, respectively. The mismatch frequencies at each nucleotide position were assumed to be the nucleotide divergence between the human and chimpanzee TSS regions.
We thank Drs. S. Sugano, Y. Suzuki and R. Yamashita of Tokyo University for kindly providing the information on TSSs, Dr. H. Tachida of Kyushu University for advice and discussions about the nucleotide divergence between the human and chimpanzee TSS regions, and Dr. Tomoko Tahira for the critical reading and comments of this manuscript. This work was supported by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView ArticleGoogle Scholar
- Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-237. 10.1038/ng1090.PubMedView ArticleGoogle Scholar
- Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74 (1): 106-120. 10.1086/381000.PubMedPubMed CentralView ArticleGoogle Scholar
- Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409 (6822): 928-933. 10.1038/35057149.PubMedView ArticleGoogle Scholar
- Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH: Haplotype variation and linkage disequilibrium in 313 human genes. Science . 2001, 293 (5529): 489-493. 10.1126/science.1059431.PubMedView ArticleGoogle Scholar
- Nei M, Jin L: Variances of the average numbers of nucleotide substitutions within and between populations. Mol Biol Evol. 1989, 6 (3): 290-300.PubMedGoogle Scholar
- Tahira T, Baba S, Higasa K, Kukita Y, Suzuki Y, Sugano S, Hayashi K: dbQSNP: a database of SNPs in human promoter regions with allele frequency information determined by single-strand conformation polymorphism-based methods. Hum Mutat. 2005, 26: 69-77. 10.1002/humu.20196.PubMedView ArticleGoogle Scholar
- dbQSNP Database. [http://qsnp.gen.kyushu-u.ac.jp]
- Suzuki Y, Yamashita R, Nakai K, Sugano S: DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 2002, 30 (1): 328-331. 10.1093/nar/30.1.328.PubMedPubMed CentralView ArticleGoogle Scholar
- Suzuki Y, Yamashita R, Sugano S, Nakai K: DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004, D78-D81. 10.1093/nar/gkh076. 32 DatabaseGoogle Scholar
- Suzuki Y, Taira H, Tsunoda T, Mizushima-Sugano J, Sese J, Hata H, Ota T, Isogai T, Tanaka T, Morishita S, Okubo K, Sakaki Y, Nakamura Y, Suyama A, Sugano S: Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO reports. 2001, 2 (5): 388-393.PubMedPubMed CentralView ArticleGoogle Scholar
- Bird AP: CpG-rich islands and the function of DNA methylation. Nature. 1986, 321 (6067): 209-213. 10.1038/321209a0.PubMedView ArticleGoogle Scholar
- Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics. 1992, 13 (4): 1095-1107. 10.1016/0888-7543(92)90024-M.PubMedView ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- EMBOSS. [http://emboss.sourceforge.net/]
- Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene. 2005, 350 (2): 129-136. 10.1016/j.gene.2005.01.012.PubMedView ArticleGoogle Scholar
- Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Deroo BJ, Archer TK: Glucocorticoid receptor activation of the IκBα promoter within chromatin. Mol Biol Cell. 2001, 12 (11): 3365-3374.PubMedPubMed CentralView ArticleGoogle Scholar
- Levy-Wilson B, Fortier C, Blackhart BD, McCarthy BJ: DNase I- and micrococcal nuclease-hypersensitive sites in the human apolipoprotein B gene are tissue specific. Mol Cell Biol. 1988, 8 (1): 71-80.PubMedPubMed CentralView ArticleGoogle Scholar
- Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D: Ordered recruitment of chromatin modifying and general transcription factors to the IFN-β promoter. Cell. 2000, 103 (4): 667-678. 10.1016/S0092-8674(00)00169-0.PubMedView ArticleGoogle Scholar
- Sewack GF, Hansen U: Nucleosome positioning and transcription-associated chromatin alterations on the human estrogen-repsonsive pS2 promoter. J Biol Chem. 1997, 272 (49): 31118-31129. 10.1074/jbc.272.49.31118.PubMedView ArticleGoogle Scholar
- Kogan S, Trifonov EN: Gene splice sites correlate with nucleosome positions. Gene. 2005, 352: 57-62. 10.1016/j.gene.2005.03.004.PubMedView ArticleGoogle Scholar
- Ioshikhes I, Bolshoy A, Derenshteyn K, Borodovsky M, Trifonov EN: Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences. J Mol Biol. 1996, 262 (2): 129-139. 10.1006/jmbi.1996.0503.PubMedView ArticleGoogle Scholar
- Bolshoy A: CC dinucleotides contribute to the bending of DNA in chromatin. Nat Struct Biol. 1995, 2 (6): 446-448. 10.1038/nsb0695-446.PubMedView ArticleGoogle Scholar
- Satchwell SC, Drew HR, Travers AA: Sequence periodicity in chicken nucleosome core DNA. J Mol Biol. 1986, 191 (4): 659-675. 10.1016/0022-2836(86)90452-3.PubMedView ArticleGoogle Scholar
- Ioshikhes I, Trifonov EN, Zhang MQ: Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc Natl Acad Sci USA. 1999, 96 (6): 2891-2895. 10.1073/pnas.96.6.2891.PubMedPubMed CentralView ArticleGoogle Scholar
- DataBase of Transcriptional Start Sites (DBTSS). [http://dbtss.hgc.jp/]
- NCBI Reference Sequence Databases. [ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/]
- NCBI dbSNP. [http://www.ncbi.nih.gov/SNP/]
- BLAT. [http://www.soe.ucsc.edu/~kent/exe/linux/]
- UCSC Genome Browser. [http://hgdownload.cse.ucsc.edu/goldenPath/panTro1/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.