Prediction and analysis of nucleosome exclusion regions in the human genome
© Radwan et al. 2008
Received: 02 January 2008
Accepted: 22 April 2008
Published: 22 April 2008
Skip to main content
© Radwan et al. 2008
Received: 02 January 2008
Accepted: 22 April 2008
Published: 22 April 2008
Nucleosomes are the basic structural units of eukaryotic chromatin, and they play a significant role in regulating gene expression. Specific DNA sequence patterns are known, from empirical and theoretical studies, to influence DNA bending and flexibility, and have been shown to exclude nucleosomes. A whole genome localization of these patterns, and their analysis, can add important insights on the gene regulation mechanisms that depend upon the structure of chromatin in and around a gene.
A whole genome annotation for nucleosome exclusion regions (NXRegions) was carried out on the human genome. Nucleosome exclusion scores (NXScores) were calculated individually for each nucleotide, giving a measure of how likely a specific nucleotide and its immediate neighborhood would impair DNA bending and, consequently, exclude nucleosomes. The resulting annotations were correlated with 19055 gene expression profiles. We developed a new method based on Grubbs' outliers test for ranking genes based on their tissue specificity, and correlated this ranking with NXScores. The results show a strong correlation between tissue specificity of a gene and the propensity of its promoter to exclude nucleosomes (the promoter region was taken as -1500 to +500 bp from the RefSeq-annotated transcription start site). In addition, NXScores correlated well with gene density, gene expression levels, and DNaseI hypersensitive sites.
We present, for the first time, a whole genome prediction of nucleosome exclusion regions for the human genome (the data are available for download from Additional Materials). Nucleosome exclusion patterns are correlated with various factors that regulate gene expression, which emphasizes the need to include chromatin structural parameters in experimental analysis of gene expression.
Nucleosomes are DNA-protein complexes that form the building blocks of eukaryotic chromatin. They are involved in genome condensation, and play a major role in the regulation of gene expression . Each nucleosome is made up of eight histone proteins that together form a structural unit able to accommodate 147 base pairs of DNA wound around it. The DNA sequence has to have the flexibility and curvature that allows it to circle around a nucleosome . Empirical and theoretical studies have both shown that there are certain DNA sequence patterns that are too rigid to form such loops . These patterns include GC-rich motifs as well as poly-A and poly-T tracts, and we have previously compiled them into NXSensor, a web tool that predicts which DNA sequences would not be conducive to nucleosome binding; we called these motifs nucleosome exclusion sequences .
Transcriptional regulation in eukaryotes is a complex process, as exemplified by recent publications [5, 6]. The formation and positioning of nucleosomes are crucial steps in gene regulation, in that they influence access to DNA by the transcriptional machinery. Experimental work on nucleosome positioning in yeast [7–11] and fly  has yielded significant results, and technological progress is such that we are quickly learning more about nucleosome positioning in the human genome [13, 14].
Experimental work that verifies where a nucleosome is positioned is dependent upon when the cells were sampled, and on which tissue or cell line the analysis was carried out. In addition, it is known that nucleosomes slide to allow certain regulatory mechanisms to take place , and it has been shown in yeast that nucleosomes are only occasionally positioned by intrinsic sequence signals . We therefore chose nucleosome exclusion sequences as our predictive method, rather than nucleosome positioning sequences. We can with a certain level of certainty predict where nucleosomes would not bind, and it is therefore inferred that they can, and probably do, bind elsewhere.
Reported studies in [4, 17] observed certain trends in the nucleosome exclusion patterns of promoter regions. Both studies showed that there is a peak of nucleosome exclusion sequences just upstream of the transcriptional start site of genes. This pattern has subsequently been found to be true also in yeast . The studies in [4, 17] found that widely expressed genes, sometimes referred to as housekeeping genes, had a higher nucleosome exclusion potential than did tissue specific genes. This implies that the promoter regions of widely expressed genes were less likely to have nucleosomes in them than were the promoter regions of genes that had a narrow tissue distribution. This may allow easy access of the transcriptional machinery to the DNA of ubiquitously expressed genes. However, these studies had taken relatively small numbers of carefully selected human genes: 100 of each category in the case of , and 500 each in , and they both relied on manual selection and categorization of genes. The question remained whether there is a genome-wide trend of a gradient of nucleosome positioning potentials, and what implications this may have for the specificity of gene expression. These were the initial questions that we set out to answer in this study.
The objective of the present study, therefore, was to carry out a whole genome annotation of nucleosome exclusion regions (NXRs) in the human genome, and to correlate the results with tissue specificity, gene expression levels, and DNaseI hypersensitive sites. We calculated nucleosome exclusion scores (NXScores) across the whole genome, and observed NXScore trends in promoter regions. We classified tissue specific and widely expressed genes according to a new method proposed here based on Grubbs' outliers test, and validated the results using a previously described method based on Shannon's entropy . From a computational perspective, patterns such as NXRs and NXSs are fuzzy, non-exact, and overlapping, which poses a challenge for the analysis of all 3.4 billion base pairs of the whole human genome. We therefore developed a pilot grid architecture that can carry out such computationally intensive tasks. In this paper we report our results in the context of the regulation of gene expression.
We obtained the gene expression profiles of 19055 genes and developed a new method for ranking the tissue specificity of those genes. The available SymAtlas "tissue list" includes 79 cell types, tissues and organs, which makes it difficult to classify genes categorically into tissue specific groups. Furthermore, genes that have been classified as tissue specific by other researchers were often expressed equally in three or four different tissues. In order to overcome this problem, we refer to genes as having a wide tissue distribution if they are expressed at relatively equal levels in five tissues or more, and as having a narrow distribution if they are expressed at relatively equal levels in only one or two tissues. To follow this idea through, we needed a method of ranking genes according to their tissue distribution, so that we could correlate this with NXScores.
The RefSeq-annotated transcriptional start site (TSS) was used to identify the promoter region of each gene, and NXScores were calculated for the region TSS-1500 to TSS+500. The resulting values were used to sort the 19055 genes in ascending order (i.e., from no nucleosome exclusion to complete nucleosome exclusion). The sorted list was then divided into n groups. The mean tissue specificity for each of these groups was calculated using a method we developed based on Grubbs' test , and we validated this method using an already established method for ranking tissue specificity based on Shannon entropy .
The results show that previous localized findings [4, 17] are valid on a whole genome level. There is a direct correlation between the NXScores in the promoter region and tissue specificity. The higher the NXScore of a promoter sequence, the less likely it is to include a nucleosome, and the less tissue specific the associated gene is. Given the complexity of transcriptional regulation in the eukaryotic system, there may be a few exceptions to this, but the genome-wide trend is clearly observed from our results. One could deduce from this that the transcriptional machinery has relatively unimpeded access to the TSSs of widely expressed genes. It is expected that the types of transcription factors that switch on widely expressed genes are generally not those that can tolerate the DNA being wound around a nucleosome.
Our results indicate that there is a gradually increasing tendency for the promoter to be nucleosome-free the closer one gets to the TSS (Figures 1 and 4). We used the RefSeq gene-annotations of transcriptional start sites (RefSeq-TSSs), and found the average NXScore to peak about 30 bp upstream from the RefSeq-TSS. However, we also found that the RefSeq-TSSs themselves are often 20–40 bp downstream from the TSSs determined by experimental methods [5, 26]. Therefore the peak of nucleosome exclusion seen in our results appears, on average, to be centered on the transcriptional start site. This is in agreement with the findings of , who provided experimental evidence that the region around the TSS in humans was relatively nucleosome-free.
Figure 4 highlights that all 19055 genes follow the trend explained above, and that the more tissue specific groupings follow that trend but with lower NXScore peaks. The top 10% most tissue specific genes have the lowest NXScore peak, meaning that even though their TSS region is depleted of nucleosomes, there are more exceptions to that trend in this group than there are across all the genes. This is in agreement with the previous conclusion that the more tissue-specific a gene is, the more likely it is to have nucleosomes on its promoter. The differences in NXScore peak value observed here suggest that with gradually increasing tissue specificity, nucleosome binding to promoter regions plays an increasingly important role in gene regulation.
NXScores for each gene were calculated from the RefSeq-annotated TSS to the RefSeq-annotated 3'UTR end of the gene, including all exons and introns. Then the median expression level was calculated for each gene using the SymAtlas gene expression profiles . We calculated the median in order to filter very high or very low expression levels that may be associated with specific tissues, since our objective for this analysis was to capture expression levels across each gene irrespective of tissue specificity. The genes were then sorted according to increasing NXScore, the sorted list was equally divided into 5 groups, and the mean expression level was calculated for each group. This grouping and the calculations undertaken were used to inspect general trends while filtering noise that may be associated with a limited number of genes within each group.
Thus far all our observations have been in silico. To validate our annotations, we compared our scores to conserved nucleosome locations that have been reported in recent studies [9, 13]. The study in  reported nucleosome occupancy on the promoter regions of several human genes, and we looked at the NXScores of those exact sequences. For further validation, we ran the NXScores algorithm on selected regions of the Saccharomyces cereviciae genome, namely those used in , to report experimentally verified nucleosome positions.
The study in  calculated the average log 2(Cy5 = Cy3) data of 57 MITF-bound promoters in the human genome. We compared these results with our calculated average NXScores promoter profile for the 19055 genes under inspection (Figure 6c), and obtained a medium-to-strong negative correlation (r = -0.47 based on Pearson product-moment correlation coefficient). This correlation is satisfactory keeping in mind that nucleosomes can slide according to the transcriptional activity of the promoter, and that our profile was calculated as a consensus promoter profile representing the 19055 genes, while the nucleosome positioning results were obtained using 57 MITF-bound promoters. In fact, overall, there were almost no examples where NXScores were high on areas that were experimentally shown to be occupied by nucleosomes.
As a final comparison, we looked at whether nucleosome exclusion scores would correlate with DNaseI hypersensitive sites (DHSs). It is known that nucleosome-free areas are more prone to digestion by DNaseI, and it was reported in  that ubiquitous DHSs, shared by 6 cell lines, were found near the transcriptional start sites of some genes, implying a wide usage of that gene, or at least of that promoter. The study in  predicted the hydroxyl radical cleavage intensity on naked DNA for each nucleotide in the ENCODE regions. We downloaded this data for the whole set of hg18 ENCODE regions from the UCSC Genome Browser, and calculated the mean value of the predicted cleavage intensity for each region. The objective of this analysis was to investigate whether regions with high NXScores would have a high predicted cleavage intensity.
First, we calculate the NXScores for each region, and took the locations for NXScore peaks that had NXScores higher than p, where p = μ + (τ × s). μ and s are the mean and standard deviation of the NXScores across the region, respectively. τ is a parameter for determining the height of the calculated peaks, such that the higher the τ value, the higher the peak value and fewer the number of the peaks across the regions, and vice versa. Next, for each peak location, we calculated the mean predicted cleavage intensity of a 147 bps neighborhood centered at the peak, and we averaged these values for all peak locations in a specific region for a specific τ.
In this way we were able to show that the mean predicted cleavage intensity around the peaks is higher than the mean intensity across the whole region, thus proving that regions with a high NXScore also have a high cleavage intensity.
A grid computing architecture was used to conduct a whole genome annotation for nucleosome exclusion regions (NXRs), and to calculate nucleosome exclusion scores (NXScores) per nucleotide across the entire human genome. The results, which are made available here, provide a measure of how likely a particular nucleotide neighborhood impairs nucleosome formation. We confirmed previous reports that nucleosomes tend to be excluded from the area surrounding the TSS of genes. We developed a new method for ranking the tissue specificity of gene use, and found that, generally speaking, the wider the tissue distribution of a gene, the fewer nucleosomes are likely to be found in the promoter region of that gene. In addition, we found that high NXScores were correlated with moderate gene expression levels, and with the presence of DNaseI hypersensitive sites.
There is increasing evidence in the literature that chromatin structure plays a crucial role in gene regulation [14, 29]. These results contribute to our understanding of the relationships between nucleosome distribution and gene regulation.
This study used a pilot grid computing architecture for computationally intensive DNA sequence analysis. The grid fabric is part of the IBM sponsored LAGrid  project. The architecture is composed of a group of utility and computational services. Utility services facilitate the automated composition of workflows, capture domain semantics, and address the distributed nature of the computations. The computational services are web services wrappers for sequence manipulation tools and algorithms, such as pattern searching and clustering. This architecture was designed with the objective of applying real-time, high performance capabilities to computationally intensive sequence analysis questions. In silico annotations of nucleosome exclusion regions across the whole human genome offered an opportunity to test and validate this design. A detailed description of the architecture and services was reported in .
We used a slightly modified version of nucleosome exclusion patterns identified in , which in turn were based on experimental data from a variety of sources [32–34]. These sequences were used to locate nucleosome exclusion regions (NXRs) throughout the human genome:
[(G/C)3 N 2]≥3; e.g.: GGCAACGCTTGGGTA
A ≥10(= T ≥10); e.g.: AAAAAAAAAA, TTTTTTTTTT
It should be noted that our algorithm did not include sequences that had a weaker tendency to exclude nucleosomes, or that were rare on a genome-wide level, such as TGGA repeats . This is because nucleosomes are known to slide  and we did not want to annotate a weak signal in case nucleosomes could slide into that particular region. Having said that, we do intend to update the annotations on the supporting online website when other strong exclusion sequences are reported and verified.
The hg18 (March 2006) human genome build was downloaded from the UCSC Genome Browser, and scanned base by base for NXRs. NXRs were annotated, and overlapping patterns were merged into one contiguous region in the final annotation. The annotations were compiled into a well supported exchange format for Feature description, GFF.
The nucleosome exclusion score (NXScore) measures the tendency of a specific DNA region to exclude nucleosomes. In order to have a continuous score across query sequences of variable length, the NXScore for each single base pair was calculated relative to a 147 base pair window, defined as the neighborhood of a particular nucleotide, centered at that nucleotide. The results per nucleotide are used to calculate the NXScore for any given region, as shown below:
Calculating the NXScore per nucleotide depends on the density of NXRs in the 147 bps neighborhood of that nucleotide, however to fine tune our score calculation we specifically evaluated the weighted density of NXRs in the neighborhood. The idea behind the weighted density is to assign higher weights to NXRs close to the base pair under calculation than distant NXRs. We used a simple linearly decreasing weighting function, after finding that other functions yielded similar results, and maintaining that our main concern was identifying the peaks rather than the rate of change of the scores. For example, the score for a nucleotide whose 147 bp neighborhood contains one NXR of length x and located at either end of the neighborhood should be less than the score for a nucleotide whose 147 bp window contains one NXR of the same length x but at the center of the window (i.e., surrounding the nucleotide in question).
NXScores can take the values 0 to 1 inclusive, such that if a nucleotide is centered in a neighborhood that is full of NXRs, then its NXScore will be equal to one. On the other hand, if the neighborhood is free from NXRs then the NXScore for that nucleotide will be zero. Figure 8 illustrates an example of NXRs and NXScores of a particular gene, chosen from chromosome 21. For display purposes, the NXScores in this figure and the rest of the figures in the paper were scaled up to span the range from 0 to 1000 inclusive.
Having defined the NXScores for single nucleotides, and given a DNA sequence of length n bp, its NXScore S n is defined as the average NXScores for the n bp that make up that sequence. This can be represented by the formula: , where s i is the NXScore for bp i, and the summation is over the n bp.
A number of methods, based on microarray gene expression datasets, have been proposed for measuring the tissue specificity of gene use. Despite the inherent limitations of comparing microarray datasets, some methods have been able to describe trends in tissue specificity. In , the effectiveness of using Shannon entropy was demonstrated for ranking genes according to their tissue specificity, from narrow or tissue-specific expression, to wide or ubiquitous expression. Shannon entropy was used and updated in . Earlier, a method derived from Akaike's information criterion, which was originally developed to detect outliers in a data set, was applied in , and was used to rank genes according to their tissue specificity. Using the GNF-SymAtlas  gene expression dataset , we categorized known genes according to their tissue specificity levels, and investigated their possible correlation to NXScores.
The GNF-SymAtlas dataset contains 44,775 expression profiles across 79 human tissues and cell types and is itemized by oligonucleotide probes . First, all non-specific and partially-specific microarray probe sets were removed from our dataset, leaving only the specific target data, with each probe corresponding typically to only one gene. This probe set was then joined with the Known Gene database of the UCSC Genome Browser , from which information on the chromosomal location of the gene, and its transcriptional start and end positions was extracted. Any further redundancies were filtered out at this step, resulting in 19055 genes. The promoter region sequence (-1500 to +500) for each of these genes was downloaded and fed for analysis through processing and pattern matching modules. A distributed grid computing architecture was used, in which the modules were wrapped as web services and distributed among a number of grid nodes. The development was done using Java, and some modules utilized BioJava APIs .
We propose a new and efficient technique for ranking genes according to their tissue specificity, based on Grubbs' outliers test . To validate our results we also used the previously published ranking mechanism utilizing Shannon's entropy. Both techniques gave almost the same results, verifying that this use of Grubbs' test is valid. It should be noted that the proposed technique in this study has the advantage of being able to detect both up-regulated and down-regulated genes in a microarray data set. We defined up-regulated genes as those that are expressed at a significantly high level in a limited group of tissues compared to their expression in other tissues, and down-regulated genes as those that are expressed at significantly lower levels in a limited group of tissues compared to mid- to high-expression in other tissues. Even though for this particular study we only used up-regulated genes, the applicability of this method is valid for other data sets.
Grubbs' outliers test:
The Grubbs' test , also known as the maximum normalized residual test, can be used to test for outliers in a univariate data set. Given the expression profile of a gene, the Grubbs' test G can be calculated as , where, , w t is the expression score for tissue t, std is the standard deviation for the expression profile, and is the mean expression score. The more specific the gene, the higher the G value, and vice versa. While this formula for G identifies up-regulated genes, replacing max with min can identify down-regulated genes.
The concept of Shannon's entropy  has a central role in information theory, and is sometimes referred to as the measure of uncertainty. The entropy of a random variable is defined in terms of its probability distribution, and has been shown to be a good measure of randomness or uncertainty. The entropy is maximum when the variable is uniformly distributed, i.e., it exhibits the highest uncertainty.
Given a gene expression profile similar to those in Figure 8, the Shannon entropy (H) can be calculated as , and, , where w t represents the expression score for tissue t, and p t is calculated by normalizing this value relative to the sum of expression scores for all tissues. The more specific the gene, the less its entropy, and vice versa.
The authors thank Ivan Bajic and the two anonymous reviewers for helpful and valuable comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.