Open Access

Periodicity of SNP distribution around transcription start sites

BMC Genomics20067:66

DOI: 10.1186/1471-2164-7-66

Received: 13 October 2005

Accepted: 03 April 2006

Published: 03 April 2006



Several millions single nucleotide polymorphisms (SNPs) have already been collected and deposited in public databases and these are important resources not only for use as markers to identify disease-associated genes, but also to understand the mechanisms that underlie the genome diversification.


A spectrum analysis of SNP density distribution in the genomic regions around transcription start sites (TSSs) revealed a remarkable periodicity of 146 nucleotides. This periodicity was observed in the regions that were associated with CpG islands (CGIs), but not in the regions without CpG islands (nonCGIs). An analysis of the sequence divergence of the same genomic regions between humans and chimpanzees also revealed a similar periodical pattern in CGI. The occurrences of any mono- or di-nucleotide sequences in these regions did not reveal such a periodicity, thus indicating that an interpretation of this periodicity solely based on the sequence-dependent susceptibility to mutation is highly unlikely.


The periodical patterns of nucleotide variability suggest the location of nucleosomes that are phased at TSS, and can be viewed as the genetic footprint of the chromatin state that has been maintained throughout mammalian evolutionary history. The results suggest the possible involvement of the nucleosome structure in the promoter function, and also a fundamental functional/structural difference between the two promoter classes, i.e., those with and without CGIs.


Several million single nucleotide polymorphisms (SNPs) have already been collected and deposited in public databases [1] and these are important resources not only for use as markers to identify disease-associated genes [2], but also for an understanding of the mechanisms that underlie the diversification of the organism. The nucleotide diversity of human genome sequence appears to fluctuate from region to region [35]. The majority of the SNPs are believed to have no biological consequence, and therefore their diversity is primarily determined by the mutation rate within the germ cells, although it may be affected by the selective pressure that operates at the individual level [6]. In this study, we used a spectral analysis approach to identify the pattern of nucleotide variability around the transcription start sites (TSSs), and survey its biological implication.


Nucleotide diversity

We first noticed a periodicity of the nucleotide diversity around TSSs using the genotype data obtained from the dbQSNP database (version 11) [7], in which approximately 104 SNPs located around the 1.2 kb promoter regions of 4 × 103 genes have been identified and mapped on the Reference Human Genome Sequence [8]. These SNPs were discovered by re-sequencing the DNA of eight individuals. In this database, all data including the regions without detectable SNPs have been described. Thus, the per-nucleotide diversity (π) of each nucleotide position relative to TSS can be estimated by aligning each of the examined sequences at TSS, since the examined number of individuals are known [6]. A striking feature of the distribution of π was its waviness (data not shown). We expanded the analysis using the TSS regions described in DBTSS, in which approximately 1.3 × 104 TSSs have been identified by mapping the 5'-end sequences of more than 4 × 105 full-length cDNA clones onto the genome [9, 10]. We further selected 10,171 sites, which were the most frequently used TSSs for each of the genes (the genomic site with the largest number of the 5' ends for each gene defined in DBTSS), to avoid overrepresentation of the genes with multiple promoters. TSS regions, i.e., the sequences 3 kb in both directions from the start sites, were collected from the reference human genome sequence, and 97,041 validated SNPs (defined in dbSNP) that fell in these regions were mapped (approximately 1 SNP per 600 nucleotides). Next, the SNPs at each nucleotide position (relative to TSS) were counted to obtain the distribution of the SNP density around the TSS (Fig. 1). In this case, the per-nucleotide diversity could not be estimated, because the number of chromosomes examined to find the SNPs is unknown. However, the SNP density can be regarded as an indicator of the nucleotide diversity, since an ascertainment bias is unlikely to affect the local distribution of SNPs at this resolution. The wavy nature of the distribution similar to the per-nucleotide diversity described above was also observed.
Figure 1

Distribution of SNPs around TSSs. The distribution of the density of validated SNPs (no. of vSNPs per gene) at the positions relative to the TSSs of 10,171 genes are shown (gray). Noise filtering was performed using FFT. After the SNP density data was transformed to the frequency domain by means of an FFT, the one-sided low-pass Hanning filter for components below 50 nucleotides was applied. The denoised curve was obtained by the inverse FFT of the filtered array (magenta).

Spectrum analysis

A spectrum analysis by Fast Fourier transformation (FFT) of the SNP density distribution revealed a remarkable periodicity around the TSSs, with the most conspicuous peak of power value 1.7 × 10-3 at the wave length 146 nucleotides, and this periodicity persisted at positions ranging approximately between -200 and +200 (Fig. 2A). To determine the statistical significance of this periodicity, we estimated the mean and standard deviation of the power values for each wavelength in random spectra of an equivalent number of data set. Namely, we carried out 103 simulations, each consisting of the distribution of validated SNPs around randomly chosen 10,171 genomic sites. As shown in Figure 2A, the power of the peak at the 146 nucleotide was statistically significant, since the power value fell far outside the three times the standard deviation of the power for the random sites.
Figure 2

Spectrum analysis by Fast Fourier transformation. Spectra of distributions of SNP density (A, C, and E) and nucleotide divergence between humans and chimpanzees (B, D, and F) of three TSS categories; all TSS (A and B), CGI-TSSs (C and D) and nonCGI-TSSs (E and F). The side view and sectional view at the periodicity 146 nucleotides of the FFT diagrams are shown on the left and top of the diagram panels, respectively. The magenta and red lines are the means and the 99 % confidence intervals of the power values. The number of sequences analyzed are 10,171 (A), 6,329 (C), and 3,842 (E). The diagrams and their side views of SNP density (A, C and E) are dynamically colored according to the Z-scores, while those of divergence (B, D and F) are colored according to the power in arbitrary units, which are the square of coefficients for the polynomials of the trigonometric functions in the FFT. The color range for SNP density goes from blue to red, corresponding to 0 to 25 in Z-score. Those for divergence correspond to 0 to 3, respectively, in power value (a.u.).

In many genes, the position of their TSSs fluctuates more or less [11], and the degree of this fluctuation could cancel out the effects of phasing to TSSs. We thus evaluated the periodicity in two classes of genes; those with small fluctuations and those with large fluctuations. We defined the extent of fluctuation as follows. We chose 4,660 genes for which more than 10 oligo-capped clones were mapped. They were then divided into two halves according to the start site fluctuation, which were estimated by the value of the standard deviation of the start positions. A spectrum analysis of the SNP density distribution of these two classes revealed a stronger signal (4.5 × 10-3) at the 146 bp periodicity for the TSSs with small fluctuations than for those with large fluctuations (1.5 × 10-3).

This periodical distribution of SNP density raised the question of whether they might be caused by the sequence features around TSSs. We examined the distribution of each of the four nucleotides, or all 16 dinucleotide that include CpG dinucleotide that is known to have a higher mutation rate (see Additional file 1). We found various periodical distributions both for some mono- and di-nucleotide sequences, but none of them showed the 146 nucleotides periodicity. The sequence-dependent susceptibility to mutation is thus not considered to explain the periodicity of the diversity profile.

CpG island and periodicity

In the mammalian genome, CpG dinucleotide is suppressed (depleted), because most of the C in CpG are methylated at the C5 position by the CpG methylase activity, that in turn tends to be mutated to T by spontaneous deamination. The CpG islands (CGIs) are exceptions, where their local C/G contents are high, and the dinucleotides are not depleted [12]. The island is frequently located in the vicinity of TSS [13].

We mapped CGIs within the 6 kb TSS regions by NewCpGreport program (EMBOSS v.2.10.0 package) [14, 15] using default parameters, i.e., the C/G content greater than 50 %, the observed/expected ratio of the CpG appearance greater than 0.6 and the size of the island longer than 200 nucleotides. The regions were then classified into two groups, the TSSs within CGI (CGI-TSSs) or not (nonCGI-TSSs). Among the 10,171 TSSs, 65 % were CGI-TSSs (Fig. 3), which closely agreed with the previously reported values [16]. A spectrum analysis of the SNP density distribution of the two classes of regions revealed a stronger signal (2.5 × 10-3) at the 146 nucleotides periodicity for the CGI-TSSs, but none for the nonCGI-TSSs (Fig. 2C and 2E). We also noticed that the range of the genome around TSS covered by CGI roughly matched the range where the 146 nucleotides periodicity of SNP density is observed (Fig. 3).
Figure 3

Co-localization of CpG island and the 146 nucleotides periodicity. Occupancy of CpG islands (solid line, scale on the left) and the power of the 146 nucleotides periodicity of SNP density (dashed line, scale on the right) around the TSSs are shown. a.u., power in arbitrary units.

Periodicity of human-chimpanzee divergence

The genome diversity observable within one species (i.e., human) is limited because of various factors e.g., short time (2 to 3 × 104 years ago) since the establishment of Homo sapiens, the population bottle-neck, for assessing the variability of the genome sequence. On the other hand, the genome divergence between closely related species, i.e., human and chimpanzee, can yield more information about the genome variability, because the mutations are accumulated and fixed within each population since the separation of the two species 5 million years ago, and yet, they are close enough so that the genome sequences can be reliably aligned to precisely determine the locations of the changed nucleotides. We identified the TSS regions of the chimpanzee genome by BLAT [17] searching using the human TSSs as query sequences. Approximately 90% (9,087) of the 10,171 human TSS regions could be aligned with confidence, and 61% (5,529) of them were with CGI. A total of 400,285 divergent sites could be mapped in these regions. The spectrum analysis of nucleotide divergence between humans and chimpanzees again showed the 146 nucleotides periodicity, which was derived solely from the CGI-TSS regions (Fig. 2B, 2D and 2F).


We have shown that both the SNP density distribution and the nucleotide divergence profile between species (human and chimpanzee) are periodical around the TSSs, with its wave length of 146 nucleotides, which is identical to the length of DNA that wraps around the nucleosome. This periodicity comes solely from the TSS regions with CGI, and the range the periodicity is observed is roughly the same as the CGIs occupy, i.e., 0.4 kb or 2 to 3 nucleosome units. We are thus tempted to propose, that the CGIs are sites where nucleosomes are tightly packed and phased to the TSSs, and that the nucleotide variability is positionally biased within the nucleosomal structure. Several previous reports have also supported this idea, i.e., nucleosome positioning in the promoter region has been experimentally explored for a few human genes [1821]. Among these promoters, those with CGI were shown to be organized into a phased array of nucleosomes [18, 19], while those without CGI carried only a single nucleosome located at some distances from their TSSs [20, 21].

One possible explanation for the intra-nucleosomal bias of nucleotide variability is the local difference of mutation rate in germ cells. The nucleosome structure can locally affect the mutation rate, because its determinants, e.g., mutagen accessibility, depurination rate, or the efficiency of damage recognition and repair may be position-dependent. While assuming that the periodicity of nucleotide variability is ascribed to that of the mutation rate, it follows that the CGI-TSSs are in a nucleosome-packed and phased state, but nonCGI-TSSs are not, in the germ cell lineage.

The CGIs have originally been recognized to be the characteristic promoter regions of housekeeping genes, whose expression is necessary for the maintenance of cell physiology, and so, are widely expressed regardless of the cell types. However, the distinction between housekeeping genes vs. non-housekeeping (or accessory) genes has been somewhat arbitrary. The germ line cells are, by definition, in an undifferentiated state (as far as gene expression pattern is concerned), and so, the genes with CGI-TSSs are likely to be expressed. It is thus a plausible idea that CGIs are involved in the expression of genes essential for the functioning and survival of germ cells, and that an ordered nucleosome location is required for this expression.

An alternative explanation for the bias of the nucleotide variability within the nucleosome structure is the sequence constraint that acts at the individual level. The recognition sequences of various transcription factors (and so, sites of conservative pressure) may thus be located at a particular site (or side) of the nucleosomal structure. According to this scenario, some additional assumptions are needed to explain the absence of periodicity of the nucleotide variability in the nonCGI-TSSs.

The sequence features that are preferred for winding around the nucleosome have been previously amply described. Using experimentally derived nucleosomal DNA sequences, some dinucleotides, e.g., AA or TT, have been shown to appear at distances that are multiples of 10.1 to 10.5 nucleotides, which are the pitches of the helix of double-stranded DNA wrapping around the core histones [2225]. Using the dataset of the TSS regions described here, we examined the periodicity of the distances by an autocorrelation function [24] for the two dinucleotides mentioned above. We found significant periodicities of the ten-nucleotide for TT distances in CGI-TSSs but not in nonCGI-TSSs (see Additional file 2). These were located in the region where the periodical nucleotide variability was observed (Fig. 3). These results further support the idea that CGI-TSSs are likely to be organized into a phased array of nucleosomes but not nonCGI-TSSs. In addition, these nucleosome positionings were consistent with those identified by others as the sites of micrococcal nuclease resistance for the gene with CGI-TSS [18]. The distribution of the nucleosomal DNA signal detected in this study may also suggest the general chromatin architecture around CGI-TSSs.

Ioshikhes et al., have shown that certain transcription factor binding sites may also show the same distance periodicity (10.1 – 10.5 nucleotides) [26]. We could not detect statistically significant periodicity of SNP density of the 10 nucleotides around TSSs (see Additional file 3).

We showed here, that nucleotide variability can be viewed as the genetic footprint of the chromatin state around TSSs. The direct proof of the two possible explanations here may be provided by showing the phased and packed localization of nucleosomes around the TSSs in germ cells. This requires the development of a method that enables an examination of the spatial relationship of histone molecules with specific genomic sites at the resolution of the nucleotide level, and also at the sensitivity of the detection of the events occurring in single to a few cell level, since the germ cells are few in number.


We herein reported a periodical pattern of SNP density distribution around the transcription start sites (TSSs) that are associated with CpG islands (CGIs). The wavelength of the periodicity matches the length of DNA in the nucleosome. The sequence divergence of the same genomic region between humans and chimpanzees also revealed a similar periodical pattern. These results indicated that nucleotide variability can be viewed as the genetic footprint of the chromatin state around the TSSs, which has remained throughout mammalian evolutionary history.


Definition of Transcription Start Sites (TSSs)

The human promoter sequences were downloaded from the DataBase of Transcriptional Start Sites (DBTSS version 4) [27], which contained information of the exact genomic positions (on the Reference Human Genome Sequence, Build 34) of the TSSs and the adjacent regions for 12,763 human genes. We re-mapped the TSSs onto the reference sequence of the latest version (i.e., Build 35) [28], and the alternative TSSs were excluded, so that a total of 10,171 TSSs were subjected to the analysis. The 6,001 nucleotides sequences between -3,000 nucleotides and +3,000 nucleotides to TSSs were obtained from the reference sequence. The accession numbers of the reference sequences and sequences around TSSs are available in Additional files 4, 5 and 6, respectively.

Polymorphism data

The chromosomal positions of validated SNPs around the TSSs were obtained from dbSNP (Build 124) [29], and then were converted to those relative to TSS. The occurrences of validated SNPs at each nucleotide position were summed up to obtain the SNP density. These data are available in an Additional file 7.

Spectrum analysis

The position-dependent frequency content and power spectra were calculated by the windowed Fast Fourier Transform using the MATLAB® (The MathWorks, Inc., USA). Sliding windows (1,024 nucleotides) at a step of 100 nucleotides were adopted to provide an appropriate balance between the spatial scope and the resolution of frequency for the analysis. The Hanning function was applied to each window to avoid an artifact of discontinuity at the ends. The Matlab scripts and input files for each category of TSS's (i.e., all TSS, CGI-TSSs, and nonCGI-TSSs) are available as Additional file 8.

Statistical evaluation of periodicity

The power spectra were evaluated by the occurrence of the periodicity in 103 random data sets. Each set was composed of N random sequences of length 6,001 nucleotides picked from random positions in the reference sequence (Build 35), where N is the number of TSS regions in each category. The mean and standard deviation of the power was obtained for each periodicity. The confidence interval of 99% was assumed to be 2.58 times the standard deviation (Z-score = 2.58) from the mean, in the distributions of the power values.

Sequence alignment of TSSs between human and chimpanzee

TSS regions of the chimpanzee genome were obtained by BLAT search (version 32) [30] of the genome [31] using the 10,171 human TSS regions as query. Alignments with the score below 400 were excluded. Most of the human TSSs (10,099/10,171) could be aligned with the chimpanzee genome. To eliminate the alignment errors in the process of determination of mismatch positions in sequence pairs with insertions/deletions, the alignment was limited between the TSSs and most proximal gaps of more than two nucleotides. Consequently, the number of aligned sequences varied depending on the positions, from 9,087 at TSS to 2,738 and 3,243 at -3,000 and +3,000, respectively. The mismatch frequencies at each nucleotide position were assumed to be the nucleotide divergence between the human and chimpanzee TSS regions.



We thank Drs. S. Sugano, Y. Suzuki and R. Yamashita of Tokyo University for kindly providing the information on TSSs, Dr. H. Tachida of Kyushu University for advice and discussions about the nucleotide divergence between the human and chimpanzee TSS regions, and Dr. Tomoko Tahira for the critical reading and comments of this manuscript. This work was supported by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology, Japan.

Authors’ Affiliations

Division of Genome Analysis, Research Center for Genetic Information, Medical Institute of Bioregulation, Kyushu University


  1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView ArticleGoogle Scholar
  2. Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003, 33 (Suppl): 228-237. 10.1038/ng1090.PubMedView ArticleGoogle Scholar
  3. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74 (1): 106-120. 10.1086/381000.PubMedPubMed CentralView ArticleGoogle Scholar
  4. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409 (6822): 928-933. 10.1038/35057149.PubMedView ArticleGoogle Scholar
  5. Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH: Haplotype variation and linkage disequilibrium in 313 human genes. Science . 2001, 293 (5529): 489-493. 10.1126/science.1059431.PubMedView ArticleGoogle Scholar
  6. Nei M, Jin L: Variances of the average numbers of nucleotide substitutions within and between populations. Mol Biol Evol. 1989, 6 (3): 290-300.PubMedGoogle Scholar
  7. Tahira T, Baba S, Higasa K, Kukita Y, Suzuki Y, Sugano S, Hayashi K: dbQSNP: a database of SNPs in human promoter regions with allele frequency information determined by single-strand conformation polymorphism-based methods. Hum Mutat. 2005, 26: 69-77. 10.1002/humu.20196.PubMedView ArticleGoogle Scholar
  8. dbQSNP Database. []
  9. Suzuki Y, Yamashita R, Nakai K, Sugano S: DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 2002, 30 (1): 328-331. 10.1093/nar/30.1.328.PubMedPubMed CentralView ArticleGoogle Scholar
  10. Suzuki Y, Yamashita R, Sugano S, Nakai K: DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004, D78-D81. 10.1093/nar/gkh076. 32 Database
  11. Suzuki Y, Taira H, Tsunoda T, Mizushima-Sugano J, Sese J, Hata H, Ota T, Isogai T, Tanaka T, Morishita S, Okubo K, Sakaki Y, Nakamura Y, Suyama A, Sugano S: Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO reports. 2001, 2 (5): 388-393.PubMedPubMed CentralView ArticleGoogle Scholar
  12. Bird AP: CpG-rich islands and the function of DNA methylation. Nature. 1986, 321 (6067): 209-213. 10.1038/321209a0.PubMedView ArticleGoogle Scholar
  13. Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics. 1992, 13 (4): 1095-1107. 10.1016/0888-7543(92)90024-M.PubMedView ArticleGoogle Scholar
  14. Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
  15. EMBOSS. []
  16. Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene. 2005, 350 (2): 129-136. 10.1016/j.gene.2005.01.012.PubMedView ArticleGoogle Scholar
  17. Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
  18. Deroo BJ, Archer TK: Glucocorticoid receptor activation of the IκBα promoter within chromatin. Mol Biol Cell. 2001, 12 (11): 3365-3374.PubMedPubMed CentralView ArticleGoogle Scholar
  19. Levy-Wilson B, Fortier C, Blackhart BD, McCarthy BJ: DNase I- and micrococcal nuclease-hypersensitive sites in the human apolipoprotein B gene are tissue specific. Mol Cell Biol. 1988, 8 (1): 71-80.PubMedPubMed CentralView ArticleGoogle Scholar
  20. Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D: Ordered recruitment of chromatin modifying and general transcription factors to the IFN-β promoter. Cell. 2000, 103 (4): 667-678. 10.1016/S0092-8674(00)00169-0.PubMedView ArticleGoogle Scholar
  21. Sewack GF, Hansen U: Nucleosome positioning and transcription-associated chromatin alterations on the human estrogen-repsonsive pS2 promoter. J Biol Chem. 1997, 272 (49): 31118-31129. 10.1074/jbc.272.49.31118.PubMedView ArticleGoogle Scholar
  22. Kogan S, Trifonov EN: Gene splice sites correlate with nucleosome positions. Gene. 2005, 352: 57-62. 10.1016/j.gene.2005.03.004.PubMedView ArticleGoogle Scholar
  23. Ioshikhes I, Bolshoy A, Derenshteyn K, Borodovsky M, Trifonov EN: Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences. J Mol Biol. 1996, 262 (2): 129-139. 10.1006/jmbi.1996.0503.PubMedView ArticleGoogle Scholar
  24. Bolshoy A: CC dinucleotides contribute to the bending of DNA in chromatin. Nat Struct Biol. 1995, 2 (6): 446-448. 10.1038/nsb0695-446.PubMedView ArticleGoogle Scholar
  25. Satchwell SC, Drew HR, Travers AA: Sequence periodicity in chicken nucleosome core DNA. J Mol Biol. 1986, 191 (4): 659-675. 10.1016/0022-2836(86)90452-3.PubMedView ArticleGoogle Scholar
  26. Ioshikhes I, Trifonov EN, Zhang MQ: Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc Natl Acad Sci USA. 1999, 96 (6): 2891-2895. 10.1073/pnas.96.6.2891.PubMedPubMed CentralView ArticleGoogle Scholar
  27. DataBase of Transcriptional Start Sites (DBTSS). []
  28. NCBI Reference Sequence Databases. []
  29. NCBI dbSNP. []
  30. BLAT. []
  31. UCSC Genome Browser. []


© Higasa and Hayashi; licensee BioMed Central Ltd. 2006

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.