YY1's DNA-Binding motifs in mammalian olfactory receptor genes

Background YY1 is an epigenetic regulator for a large number of mammalian genes. While performing genome-wide YY1 binding motif searches, we discovered that the olfactory receptor (OLFR) genes have an unusual cluster of YY1 binding sites within their coding regions. The statistical significance of this observation was further analyzed. Results About 45% of the olfactory genes in the mouse have a range of 4-8 YY1 binding sites within their respective 1 kb coding regions. Statistical analyses indicate that this enrichment of YY1 motifs has likely been driven by unknown selection pressures at the DNA level, but not serendipitously by some peptides enriched within the OLFR genes. Similar patterns are also detected in the OLFR genes of all mammals analyzed, but not in the OLFR genes of the fish lineage, suggesting a mammal-specific phenomenon. Conclusion YY1, or YY1-related transcription factors, may help regulate olfactory receptor genes. Furthermore, the protein-coding regions of vertebrate genes can contain cis-regulatory elements for transcription factor binding as well as codons.


Background
The transcription factor YY1 is a Gli-Kruppel type zinc finger protein that is highly conserved from insects through vertebrates [1]. YY1 can function as an activator, repressor, or initiator depending upon the other regulatory elements in the region [2]. YY1 also interacts with a variety of proteins including components of RNA polymerase II complex, transcription factors, and histone-modifying complexes [3][4][5]. According to genome-wide surveys, about 10% of all human genes contain YY1 binding motifs in their promoter regions [6]. Functionally, YY1 is involved in many biological processes, including embryonic development, cell cycle progression, apoptosis, B cell development, Polycomb group Gene (PcG)-mediated repression, genomic imprinting, and X chromosomal inactivation [2,3,7]. YY1 was also initially identified as a factor controlling the transcriptional activity of the murine retrotransposon 'Intracisternal A Particle' [8]. Since then, many retroposons, including SINE, LINE, and endogenous retrovirus families, have been shown to contain YY1 binding sites in their promoter regions [3,4]. Due to this ubiquitous presence of YY1 binding sites in genome-wide repeats, YY1 has also been regarded as a surveillance gene that is responsible for repressing transcriptional background noise from these repeats [9].
The olfactory receptor (OLFR) genes of mammals encode short, single coding exon, G protein-coupled receptors that are responsible for sensing a large number of airborne scents [10]. This gene family is comprised of over 800 and 1,300 gene members in human and mouse respectively, forming the largest gene family in mamma-lian genomes [10][11][12][13]. The aquatic vertebrates, the teleost fish lineage, also have a similar odorant receptor gene family [14]. However, the odorant receptor (OR) family of the fish lineage consists of a much smaller number of genes than that of the mammals, and these OR genes are also much more diverse in sequence identity than those of mammals. Mammalian OLFR genes are divided into Class I and Class II groups based on sequence identity [15]. Class II genes make up ~90% of OLFRs and are thought to have expanded during the transition to land-based living.
In mammals these olfactory receptors presumably expanded due to the selective advantage conferred by a well developed sense of smell [15]. While mice and other mammals retain function and expression of almost all OLFRs, the majority of these are pseudogenized in humans [16]. The mammalian OLFR genes are highly tissue-specific and are expressed primarily in the olfactory epithelium though a subset expresses in a chemosensory role in other tissues such as kidney and sperm [17][18][19]. Furthermore, only one copy (allele) out of all 1,000 OLFR genes is selected and expressed in each neuron cell of the olfactory tissue [20]. The unusual transcriptional control of the OLFR gene family is likely mediated through unknown trans-acting factors [21]. The tissue-specific nature of their expression coupled with their widespread duplication requires a mitotically-stable global silencing mechanism in all cell types. According to recent studies, potential cis-regulatory elements recruiting these trans factors are hypothesized to be located within the proteincoding regions of the OLFR genes rather than their surrounding genomic regions [22,23].
While performing genome-wide searches of the DNAbinding motifs of YY1, we discovered that the mammalian OLFR genes contain unusual clusters of YY1 binding sites within their protein-coding regions, whereas most YY1 binding sites are solitary and upstream of a regulated gene. In this study, we further analyzed the significance of this discovery with several bioinformatic and statistical measures, which will be described below. Specifically we test whether the presence of the YY1 binding sites could be explained by DNA sequence or amino acid motif conservation.

YY1 DNA-binding motifs in the mammalian OLFR genes
YY1 is predicted to be a global epigenetic regulator based on its ubiquitous expression and interaction with many histone-modifying enzymes [4]. As part of the efforts exploring this possibility, we have performed several series of YY1 binding motif searches using the genome sequences of mammals (human, mouse, and cow) [24][25][26]. Here we first scanned the genome sequences of human and mouse using a Position Weight Matrix (PWM)-based Perl script. Repetitive elements are known to contain YY1 binding sites so the RepeatMasked genome was used [27]. Later, the results of these searches were visualized using the repeat-masked Custom Track of the UCSC genome browser [28]. While inspecting global localization patterns of YY1 binding motifs in each genome, we noticed that clusters of YY1 binding motifs are co-localized with the genomic regions harboring olfactory receptor (OLFR) genes. The mammalian OLFR genes show a single coding exon structure, and they are also localized as gene clusters in specific regions of mammalian chromosomes. One such example is shown using the 100-kb genomic region from Mmu 7 chromosome (Figure 1). This figure shows a representative sample of 5 OLFR genes, and the locations of these genes coincide with those of YY1 binding motifs. Each OLFR gene appears to contain a range of 4 to 8 YY1 binding motifs within its 1-kb-long Open Reading Frame (ORF). As expected, the coding regions correlate well to the placental mammal conservation plot, a default track available on the UCSC genome browser. Also, the identified YY1 binding motifs appear to be random in location within the coding regions but do show a bias in orientation with respect to the direction of OLFR gene transcription, which will be described later.
The unusual clustering of YY1 binding motifs within the protein-coding regions of the OLFR genes was further analyzed to test if this pattern is unique to only the OLFR genes or also found in the ORFs of other genes. For this analysis, the entire set of the mouse mRNA database was scanned with the PWM-based Perl program to identify YY1 binding motifs. The number of the identified YY1 binding motifs within a given DNA sequence corresponding to the transcript was further divided by the size of the mRNA sequence, yielding a YY1 density score. The mouse mRNA sequences (total number = 20,191) were subsequently binned based on their relative YY1 density scores (the X-axis on Figure 2a). The Y-axes of figure 2 represents the number of genes within a given range of the YY1 density score. This analysis indicated that the majority of the non-olfactory mouse mRNA sequences (19,083 sequences) were distributed evenly and randomly throughout the varying ranges of the YY1 density score (0 to 0.114) (Figure 2a). In contrast, our detailed inspection revealed that about half of the OLFR gene set (1,108) show very high YY1 density scores. To better visualize this unusual pattern, we separated only the OLFR gene set from the rest of the mRNA set, and derived another histogram ( Figure 2b). As shown in Figure 2b, about 45% (or 496 of 1,108) of the OLFR genes have YY1 density scores ranging from 0.032 to 0.094, which are equivalent to 4 to 8 (or more) YY1 binding sites per a 1-kb mRNA sequence. In contrast, a scan of the 1 kb upstream regions of 20,419 Refseq genes revealed that only 10% of these putative pro-moter containing regions had a density greater than 4 YY1 binding sites.
We also repeated the above analyses using the mRNA data sets derived from human, cow, and zebrafish transcriptomes to test the evolutionary conservation of this unusual clustering of YY1 binding motifs within OLFR genes (Figure 2c &2d). The results from human and cow (data not shown) mRNA data sets also showed a pattern consistent with the mouse data set: an unusual clustering of YY1 binding motifs within many OLFR genes. In humans we found 57% (or 215 of 380) of OLFRs contain more than 4 YY1 binding sites while percentage in cow rose to 82% (or 740 of 900). It is important to note that the total number of the human OLFR genes (380) in the figure is smaller than those of the other mammals since a large fraction of the human OLFR genes are known to have become pseudogenes in recent evolutionary times and we removed all genes annotated as hypothetical. In contrast, the OLFR genes of zebrafish do not show a similar pattern to mammals. The total 25 OLFR genes of zebrafish show as wide a range of the YY1 density scores as seen in the other non-OLFR genes ( Figure 2d). These results confirm that the unusual clustering of YY1 binding motifs within the OLFR genes is a feature of the mammalian lineage.

Statistical tests for the clustering of YY1 binding motifs within OLFR genes
We performed two different series of analyses to test the functional significance of the observed clustering of YY1 binding motifs within the mammalian OLFR genes. First, we tested whether some peptide motifs enriched within the protein sequences of the OLFR genes are responsible for a spurious display of the YY1 binding motifs in the nucleotide sequences of the OLFR genes. If some peptide motifs are enriched then we expect a significant difference in the number of those motifs found in OLFR genes versus the rest of the proteome. For this test, we identified 49 individual dipeptides that can be encoded by the core motifs (CCAT or ATGG) of the YY1 binding consensus sequence, and determined if any of these dipeptides were over-represented in the protein sequences of the OLFR genes ( Figure 3). We first calculated the frequencies of all 400 possible combinations of dipeptides using all other protein sequences except OLFR genes of the mouse to derive a reference set of the expected frequencies for the YY1 binding sites in genomic context Figure 1 YY1 binding sites in genomic context. YY1 binding sites are mapped onto the Custom Track of the UCSC Genome Browser http://genome.ucsc.edu/. YY1 binding sites are marked by blue and red bars indicating forward and reverse directions respectively. Height on the Y-axis represents the score of each YY1 binding site in relation to the consensus sequence. This 100 kb segment of the genome contains five olfactory receptor genes and shows association of several YY1 binding sites correlated with the exons of the genes. Mammalian conservation is shown in the plot at the bottom of the map. This map was created through scanning the RepeatMasked sequence with a scoring matrix representing the YY1 binding site consensus. Mammalian conservation correlates with exons as expected, but also with many YY1 binding sites in this region.

kb
Mouse chromosome 7 400 dipeptides. In parallel, we also separately calculated the frequencies of the 400 dipeptides, including the 49 dipeptides that can encode the YY1 core motif, using only the protein sequences of the mouse OLFR genes. The frequencies derived from the OLFR genes (Observed Values) were compared with the reference set (Expected Values) via the Z-score to identify any over-represented dipeptides within the OLFR genes (Figure 4a) ( Table 1). This global comparison did not immediately identify any dipeptides that are unusually enriched within the protein sequences of the OLFR genes. However, according to the detailed analyses using the Z-score values, two dipeptides (PL and AI) among the 49 dipeptides for the YY1 core motif showed significant enrichment (P < 0.01, Figure 4b). Even so, this enrichment is limited to the 4 bp YY1 core binding site and not the complete 10 bp motif which we used to calculate YY1 density in OLFR genes. Furthermore, the reverse-translated sequences of these two peptides (PL and AI) have 24 and 12 fold degeneracy, respectively. This means that only one out of 24 PL or 12 AI dipeptides may Histogram of genes with increasing YY1 density Figure 2 Histogram of genes with increasing YY1 density. Each olfactory receptor gene was scanned for the number of YY1 binding sites found within its transcript. The combined score was divided by the gene length, resulting in the YY1 density for the given olfactory receptor gene. Note that the quality of the YY1 scores varies based on the match generated by the Position Weight Matrix, and thus the calculation of YY1 binding sites per 1 kb is approximate. Genes were ordered based on increasing YY1 density and the histograms were derived by the relative position of a gene within the group of all genes. A) The mouse non-olfactory genes binned by relative order of increasing YY1 density (x-axis). The y-axis indicates the number of genes in each bin. B) Mouse olfactory genes binned in the same fashion, also from their position within the group of all genes. The highest group of bins corresponds to a density of 4-11 YY1 binding sites per 1 kb transcript for mouse though only 6 genes show a YY1 density corresponding to more than 8 YY1 sites per kb. For human and mouse (panels B&C), there is a large group of olfactory receptor genes with high YY1 density, whereas fish (D) does not show this pattern.
contain the actual core motif of the YY1 binding consensus sequence. Therefore, the detection of 4 to 8 YY1 binding motifs per one OLFR gene cannot be simply accounted for by the serendipitous overlap between the YY1 core motif and a subset of codon combinations encoding frequent peptide motifs of the OLFR genes.
As a second measure, we carefully analyzed the positions of the identified YY1 binding motifs within the ORFs of the OLFR genes ( Figure 5). Conservation in location of YY1 binding site would indicate conservation of the encoded dipeptide motif whereas a lack of conservation in motif location is consistent with selection for the presence of the motif at the DNA level. The full length YY1 binding motifs within the OLFR genes do not show any patterns in their relative positions to the protein sequences of the OLFR genes as would be detected by conserved dipeptide motifs. However, there is a bias in the orientation YY1 binding sites in OLFR genes. OLFR genes contain an average of 3.4 forward sites for every reverse site per 1 kilobase. Non-OLFR genes show only a slight bias of 1.2 reverse sites for every forward site per 1 kilobase. In contrast, a similar analysis on the members of the histone 4 gene family (Hist1 h4) resulted in a different outcome. Two potential YY1 binding motifs were found within the protein-coding regions of this gene family, but the relative positions and orientations of these two motifs are identical and fixed among all the members of this gene family in mouse. The two identified YY1 binding motifs also coincide with two conserved peptide motifs (AM, and IA). Given the high levels of sequence conservation detected within the members of the histone 4 family (95% sequence identity), it is uncertain whether the identified YY1 binding motifs are genuine cis-regulatory elements or simply reflecting the serendipitous sequence match between the YY1 core motifs and some combinations of codons encoding conserved peptides. This pattern is in stark contrast to the random patterns associated with the position of YY1 binding motifs in the OLFR genes. This random pattern supports our initial idea that the clusters of the YY1 binding motifs within the OLFR genes most likely have been formed and maintained as cis-regulatory elements by purifying selection at the DNA sequence level. The bias favoring forward orientation may be an indication that forward sites enhance the efficiency of YY1 suppression of olfactory receptor genes. It is conceivable that multiple YY1 binding sites have been selected in OLFR genes with a preference for directionality as well as number. Alternatively, the non-random placement and orientation in histone 4 genes along with their extreme conservation prohibits a determination of whether selection is occurring at the DNA level or the protein level.

Discussion
In the current study, we have shown an unusual enrichment of YY1 DNA-binding motifs in the OLFR gene family of mammals. About half of the members of the OLFR gene family have a range of 4-8 YY1 binding motifs within their protein-coding regions ( Figures 1&2). Statistical analyses further confirmed that this enrichment of YY1 binding motifs is consistent with functional relevance and has likely been driven by unknown selection pressure at the DNA level. Overall, the current study suggests a potential role of YY1 or YY1-related transcription factors in the regulation of the mammalian OLFR genes. Also, this study provides further evidence that the protein-coding regions of vertebrate genes can both encode codon information and contain cis-regulatory elements for transcription factor binding.
The mammalian OLFR genes are expressed primarily in olfactory neurons, and only one single copy (allele) out of the entire 1,000 family members is expressed and functional in a given neuron cell [20]. This highly tissue-specific expression pattern of the OLFR genes necessitates a global repression mechanism for the majority of the OLFR genes in neural cells and in the other cell types. This mechanism acts prior to, and is separate from the negative feedback which prevents bi-and multi-allelic expression in olfactory neurons [29,30]. Since unknown mechanisms are believed to repress a large number of the OLFR genes all the time in most of the cell types, it is likely that these repression mechanisms are mediated through epigenetic modifications. Since YY1 is a well-known epigenetic regulator in the animal genome, it is plausible to propose that the identified YY1 binding motifs may play some roles in the predicted repression mechanisms for the OLFR genes [3]. In that regard, it is important to note that YY1 is one of the Polycomb group Gene (PcG) members found in both invertebrates and vertebrates [2]. Furthermore, recent studies hinted at the possibility that PcG-mediated repression mechanisms might be involved in the regulation of the OLFR genes [31]. In mutant mouse embryonic stem cells lacking the embryonic ectoderm development (EED) protein, some OLFR genes do not replicate asynchronously as they differentiate, suggesting a loss of the typical pattern of monoallelically expressed genes [32]. According to the results from another recent study, some cis-regulatory elements responsible for selecting one active OLFR copy in a given neuron cell are predicted to be YY1 DNA-binding motifs as translated dipeptides Figure 3 YY1 DNA-binding motifs as translated dipeptides. The four base pair core motif of the YY1 binding site is CCAT (forward) and ATGG (reverse). In principle, this core can code for two codons in each of the three frames. In the forward direction, CCAT can encode 29 possible dipeptides while in the reverse direction ATGG can encode 21 dipeptides, with one duplicate. These 49 dipeptides are a subset of the 400 possible dipeptides which can be found in a protein (20 total amino acids) 2 . The distribution frequency of these dipeptides has been examined to determine if any of the 49 YY1 binding sites is overrepresented in olfactory receptor genes.
located within the protein-coding regions of the OLFR genes [22]. This is intriguing and consistent with the observation of the current study in that some critical ciselements are located within the protein-coding regions of the OLFR genes. The genetic code is optimal for containing multiple layers of encoded information within protein-coding regions [33]. In sum, although further investigation is warranted, it is likely that the identified YY1 binding motifs are functionally important cis-regulatory elements for the regulation of the OLFR genes.
The YY1 binding motifs identified within the OLFR genes are unusual since they are localized within the protein-coding regions of these genes and are present at high density. This is quite different from the typical pattern in which transcription factor binding sites (cis-regulatory elements) are located in the genomic regions surrounding the protein-coding regions of genes. OLFR genes give a fitness advantage by duplication and divergence while remaining active, yet must also retain regulatory information. According to our statistical analyses ( Figure 3&4), the identified YY1 binding motifs within the OLFR genes likely represent evolutionarily selected cis-regulatory elements. Previous studies have shown high levels of YY1 binding affinity to the type of YY1 binding motifs found within OLFRs [1]. Though the functionality of the identi-Correlation of observed and expected dipeptides and statistical significance Figure 4 Correlation of observed and expected dipeptides and statistical significance. The mouse proteome was divided into olfactory genes (1,178) and all other genes (33,788) and the frequency of each dipeptide was tabulated. Of the possible 400 dipeptides, 49 can encode the core of the YY1 binding site. A) Observed vs. expected plot of 400 dipeptides (hollow diamonds) as compared to the subset of 49 YY1 binding site core forming dipeptides (filled diamonds). The subset does not deviate from the overall pattern. Pro-Met is indicated as having a much larger observed than expected count. B) Z-score of the observed and expected counts of the subset of YY1 binding site core forming dipeptides. All but three fall within 1 standard deviation of their expected values. Only Ala-Ile, and Pro-Leu deviate significantly from their expected values.
fied YY1 binding motifs remains to be demonstrated in vivo, it is plausible that some functional constraints serve to maintain the YY1 binding motifs within the proteincoding regions of the OLFR genes. In one scenario these motifs might be linked to the sudden expansion of this gene family in mammalian genomes. The copy number of the OLFR genes has increased dramatically in recent evolutionary times in mammalian genomes, providing a large number of receptor proteins for airborne scents [13]. The clustering of the OLFR genes in chromosomal regions also suggests this gene family may have been duplicated through in situ tandem duplications [13]. Tandem duplication is known to be the most frequent mechanism in increasing copy numbers for gene families [32]. However, it is not well understood how this mechanism carries over the proper information for the transcriptional regulation of duplicated gene copies. In the case of the mammalian OLFR genes, their protein-coding regions may have both information for codon and cis-regulatory elements so that the duplication of these genes would most likely guarantee their coding potential as well as associated transcriptional control. This might have been one functional constraint for the co-evolution of the YY1 binding motifs within the protein-coding potential of the OLFR genes.

Conclusion
The current study reports that an unusual enrichment of YY1 binding sites, 4-8 binding sites per gene, are located in the coding regions of olfactory receptor genes in mammals. According to statistical analyses, these YY1 binding sites most likely have been selected as cis-regulatory elements. Also, similar patterns are found in other mammals, but not in fish, suggesting a mammalian-specific phenomenon. This study further suggests YY1 or YY1related transcription factors as regulators of mammalian OLFR genes.

Visualization of YY1 binding sites in coding regions
A custom Perl script, matrix-bidirectional.pl, was run against mouse chromosome files available from Ensemble (version NCBIM37.49) ftp://ftp.ensembl.org/pub/ current_fasta/mus_musculus/dna/ (see Additional file 1) [34]. The program calculates score by matching a 10 bp window to a matrix of the likelihood for each position (Additional files 2 &3) along the 10 bp consensus sequence (the Position Weight Matrix, PWM). Each base pair is given a value equivalent to the decimal percentage of its match to the known YY1 binding motifs. The four base pair core, CCAT, is scored at 100% for each base plus 2. Flanking bases vary in score from 0 to 1 based upon their frequency in known YY1 binding motifs. The total score for each 10 bp window is calculated and compared to our cutoff score of 8.0 which indicates a good match.
Our previous work revealed that scores above 8.0 correlate with good YY1 binding in vitro [1,35]. Output was generated in the WIG format for each chromosome with each YY1 binding motif score represented by start and end position and bar height corresponding to the PWM score match. Position weight matrix scores and location information were loaded into the University of California, Santa Cruz (UCSC) Genome Browser for visualization of YY1 location (Figure 1) [36].

Motif finding and scoring
Our Perl script was run against the mouse, human and zebrafish mRNA available from NCBI ftp:// ftp.ncbi.nih.gov/genomes/. Each gene was scored for number and quality of YY1 motifs. Results were sorted by a YY1 density score, the combined score of YY1 motifs divided by the length of the gene. Predicted and hypothetical genes were removed from the mouse yielding 40,009 total genes (19,818 removed, 20,191 remaining), with 19,083 non-olfactory and 1,108 olfactory genes. Hypothetical genes were removed from the human transcriptome, yielding 24,886 total genes with 24,506 nonolfactory and 380 olfactory genes. Hypothetical genes were removed from the zebrafish transcriptome, yielding 9,092 genes with 9,067 non-odorant genes and 25 odorant receptors. We removed the hypothetical and predicted genes because they may not contain complete ORFs. The upstream 1 kb regions of 20419 Refseq genes in the mouse was obtained from UCSC http://hgdown load.cse.ucsc.edu/downloads.html.
Histograms were made in Microsoft Excel 2007 by sorting the genes by YY1 density from high to low, then assigning a count to order the genes. Olfactory receptor genes were separated from non-olfactory receptor genes and the count numbers were used as position information to make a histogram which shows the distribution of OLFRs along the range of YY1 containing genes ( Figure 2).

Protein motif correlation testing
The mouse proteome was downloaded from NCBI ftp:ftp.ncbi.nih.gov/genomes/M_musculus/protein/pro tein.fa.gz which contains 34,966 peptide sequences and nomenclature. Olfactory receptor proteins (1,178) were separated from non-olfactory proteins (33,788). A Perl script, dipep-singlefile.pl was used to generate a count of each of the 400 possible dipeptides in each of these groups (see Additional file 4).
A Z-test was performed according to the formula Z = (observed -μ)/δ where observed is the count of each possible dipeptide found in olfactory receptors, μ is the mean of the counts from the whole population of non-olfactory proteins, and δ is the standard deviation of the population count. Z-score units are given in standard deviations from the mean. Expected count was calculated by multiplying the frequency of each dipeptide from all non-olfactory receptor proteins by the total number of dipeptides seen in OLFR proteins. Figure 4 shows the plot generated in Microsoft Excel 2007 comparing the observed to expected ratio for all 400 possible dipeptides and the subset of 49 YY1 core-forming dipeptides which exhibits no difference in the distribution of the subset.