Cis-regulatory variations: A study of SNPs around genes showing cis-linkage in segregating mouse populations
© GuhaThakurta et al. 2006
Received: 07 June 2006
Accepted: 15 September 2006
Published: 15 September 2006
Skip to main content
© GuhaThakurta et al. 2006
Received: 07 June 2006
Accepted: 15 September 2006
Published: 15 September 2006
Changes in gene expression are known to be responsible for phenotypic variation and susceptibility to diseases. Identification and annotation of the genomic sequence variants that cause gene expression changes is therefore likely to lead to a better understanding of the cause of disease at the molecular level. In this study we investigate the pattern of single nucleotide polymorphisms (SNPs) in genes for which the mRNA levels show cis-genetic linkage (gene e xpression q uantitative t rait l oci mapping in cis, or cis-eQTLs) in segregating mouse populations. Such genes are expected to have polymorphisms near their physical location (cis-variations) that affect their mRNA levels by altering one or more of the cis-regulatory elements. This led us to characterize the SNPs in promoter (5 Kb upstream) and non-coding gene regions (introns and 5 Kb downstream) (cis-SNPs) and the effects they may have on putative transcription factor binding sites.
We demonstrate that the c is-e QTL genes (CEGs) have a significantly higher frequency of cis-SNPs compared to non-CEGs (when both sets are taken from the non-IBD regions, i.e. regions not identical by descent). Most CEGs having cis-SNPs do not contain these SNPs in the phylogenetically conserved regions. In those CEGs that contain cis-SNPs in the phylogenetically conserved regions, enrichment of cis-SNPs occurs both within and outside of the conserved sequences. A higher fraction of CEGs are also seen to harbor cis-SNP that affect predicted transcription factor binding sites, a likely consequence of the higher cis-SNPs density in these genes.
This present study provides the first genome-wide investigation of the putative cis-regulatory variations in a large set of genes whose levels of expression give rise to cis-linkage in segregating mammalian populations. Our results provide insights into the challenges that exist in identifying polymorphisms regulating gene expression using bioinformatic sequence analysis approaches. The data provided herein should benefit future investigations in this area.
Single nucleotide polymorphisms (SNPs) in the genomic sequence underlie susceptibility to or protection from diseases by affecting biological processes at the molecular level, such as protein structure, transcription, alternative splicing etc . There are a number of examples in which polymorphisms in the promoter regions, and those causing expression changes in the corresponding genes, have been found to be associated with disease [2–5]. In addition, genetic variation of gene expression has been utilized to identify causal genes for complex diseases [6–8]. However, the pattern of polymorphisms that underlie heritable variation of gene expression in segregating mammalian populations, as well as bioinformatic sequence analysis methods for identifying these regulatory polymorphisms, have not yet been investigated in a systematic way. Here we characterize the pattern of cis-SNPs that could cause quantitative genetic variations in mRNA levels in two mouse intercross populations.
We investigated the frequency and the potential role of the cis-SNPs for disrupting transcription factor binding sites (TFBS) around the genes whose expression levels in murine intercross populations gave rise to strong cis-acting eQTL. We focused on this set of genes for the following reasons: 1) a sizable fraction of genes whose expression varies in a segregating population show cis-linkage [9–12], 2) evidence for the medical importance of cis-regulatory variation has been demonstrated by positional cloning studies in which SNPs in susceptibility genes that were not located in the protein coding or splice-site regions were nevertheless shown to be associated with complex human diseases such as stroke, type 2 diabetes etc. [4, 13, 14], 3) the polymorphisms that affect the expression levels of these genes are either in the genomic region of the gene or in the nearby upstream or downstream region (cis-regulatory variation ), which significantly restricts the search space for these causal variations.
We found a significantly higher number of cis-acting e QTL genes (CEGs) were in regions that were not i dentical b y d escent (IBD) between the parental inbred mouse lines used to construct the mouse crosses. In considering the genes that fall outside of these IBD regions, we found that a significantly higher number of CEGs have cis-SNPs in their promoter (i.e. immediate 5' upstream sequence) and non-coding regions (i.e. introns and immediate 3' downstream sequences) compared to genes that do not give rise to cis-acting eQTLs (non-CEGs). The density of SNPs in these regions is also significantly higher in the CEGs compared to non-CEGs. In addition, the enrichment of cis-SNPs is not limited to the highly conserved sequences between mouse and human, and in fact in a majority of the CEGs the cis-SNPs do not overlap any conserved sequences in the promoter or non-coding regions, suggesting that the cis-SNPs in these genes do not perturb the highly conserved sequences in the immediate vicinity. A higher fraction of CEGs have cis-SNPs that perturb predicted transcription factor binding sites (TFBS) in non-coding regions, likely a consequence of the higher cis-SNP density in these regions resulting in an increased number of intersections between cis-SNPs and the TFBSs.
The implications of the above findings on the challenges related to the identification and annotation of genomic regulatory polymorphisms through bioinformatic sequence analysis methods are discussed. Our results suggest that the approaches that are commonly employed in identification of putative regulatory variants, such as searches for polymorphisms in the immediate upstream regions and cross-species conserved sequences, are unlikely to elucidate a significant fraction of the cis-regulatory variations responsible for causing changes in gene expression in genetically segregating mammalian populations.
mRNA expression data for multiple tissues in F2 animals from two mouse intercrosses constructed from C57BL/6J and DBA/2J (referred here as the BXD cross) [10, 15], and from C57BL/6J and C3H/HeJ inbred lines (referred here as the BXH cross) , were available for analysis (for details see methods). The BXD F2 population [10, 15] consisted of 111 female mice and comprehensive mRNA expression profiles were available for liver, while the BXH F2 population  contained 334 mice (169 female, 165 male) and expression profiles were available for four tissues, namely liver, white adipose, whole brain, and skeletal muscle. All of the expression data from the two crosses we have used here for analyses were generated and described previously [6, 10, 16].
In the same manner as classic phenotypic trait data, QTLs for gene expression levels can be computationally mapped using genetic linkage mapping strategies . mRNA levels of genes were treated as continuous variables and mapped to the genome using a standard interval mapping procedure  to identify e xpression QTLs (eQTLs). From the linkage results CEGs were defined as follows: 1) eQTL LOD score ≥ 4.3 (the threshold in an F2 mouse intercross for achieving a genome-wide p-value of 0.05, and a point-wise significance of 0.00005), 2) eQTL is near the physical location of the gene itself (within 10 Mb, equivalent to roughly 5 cM), 3) the eQTL explains more than 10% of the genetic variation of expression for the gene in the respective F2 populations.
Using the specific conditions described above, a total of 3,769 distinct CEGs were identified (roughly 20% of the genes represented on the array) over all four tissues in the BXH cross, and 338 CEGs were identified in BXD cross. Reasons for identification of significantly fewer CEGs in the BXD cross relative to the BXH cross include: 1) availability of mRNA profiles from only one (liver) tissue compared to four tissues in the BXH cross, and 2) a lower number of animals (111 in BXD compared to 334 in BXH) resulting in a reduced power to detect QTLs. The number of CEGs for the BXD cross given here is less than previously reported for this same cross [10, 11], given we employed a more conservative definition of cis-eQTLs in this preset study for the purpose of minimizing the false positive calls and working with the highest confidence CEGs. The CEGs from all tissues in both crosses are provided in the supplementary materials (Additional file 1 and Additional file 2). Later we have described how we have prepared a common set of CEGs and non-CEGs for analysis by combining the data from the two crosses.
Genomic segments in different mouse strains that are inherited from a common ancestor are referred to as i dentical b y d escent or IBD. The IBD regions can be considered to be largely homologous sequence blocks between two strains, while the non-IBD (nIBD) regions can be considered as polymorphic blocks. Most of the polymorphisms between mouse strains exist in sequence regions that are not in IBD , and reported variations that are in the IBD regions either represent sequencing errors or mutations that occurred in the strains after sub-speciation.
So that the readers can focus on the key findings of the manuscript, we present the details of the IBD map we have used here  and the reason for its selection in the Data and Methods section. However, it is worth mentioning here briefly that a very significant enrichment of CEGs was observed in the regions that were n ot in IBD (nIBD) (with the Fisher exact test p = 8.87 × 10-28 for CEGs from the BxD cross, and 4.08 × 10-296 for CEGs from the BxH cross). The analyses described in subsequent sections were performed with the set of genes that are not in IBD regions. This is because genes in IBD regions would be expected to have significantly fewer SNPs if any, in the surrounding regions, and therefore the comparison of patterns of polymorphisms in those genes with the CEGs, most of which contain cis-SNPs and are in nIBD, would not be appropriate.
To characterize the frequency and location of cis-SNPs in genes, we constructed a common set of CEGs and non-CEGs from the BXD and BXH crosses. Given the number of CEGs identified in the BXD cross was small (~10% of the total available from both crosses), these data on their own would not be as highly powered to identify cis-SNP patterns of interest in the CEGs. Therefore, we combined the CEGs from the BXD and BXH crosses to carry out all subsequent analyses. Combining the 338 CEGs from BXD cross with the 3,769 CEGs from the BXH cross, and only considering the CEGs within the nIBD-BXH regions, resulted in a set of 2,047 distinct CEGs (see Additional file 2). The inclusion CEGs from the BXD cross added 75 distinct CEGs into the BXH data set. For the purpose of comparison with the CEGs, we created a set of non-CEGs by considering genes that did not give rise to any cis-eQTL in either cross. To be consistent with the CEG set, only non-CEGs falling in the nIBD-BXH were considered. Thus, a combined set of 2,705 distinct non-CEGs was created by taking the intersection of the genes that did not show cis-acting eQTLs in either of the two crosses. It is of note that some of the non-CEGs defined here may show up as CEGs in other segregating mouse populations, in other tissues, or in other F2 populations constructed from B6, DBA, and C3H mice (increased number of mice in a F2 population would have higher power to detect eQTLs). As additional comprehensive sets of CEGs become available, the sets of CEGs and non-CEGs can be refined to produce more accurate positive as well as negative sets.
CEGs by definition are expected to contain genetic variations near their physical location on the genome which give rise to variations of their mRNA levels in a segregating population. We have therefore studied the frequency and density of SNPs in the promoters and non-coding regions (for definitions see below) of the CEGs and compared them to non-CEGs. These studies are described below.
In defining the promoters and non-coding regions, the gene boundaries and exons were first determined based on clustering of all mRNAs and cDNAs (including ESTs) aligning to a common genomic locus as described in detail earlier [22, 23] (see Methods). The promoter regions were then defined to be the 5 Kb or 2 Kb sequence upstream of the gene start coordinates. The non-coding regions comprised of the introns and 5 Kb sequence downstream of the genes. SNPs in the promoter and non-coding regions of genes are referred to here as cis-SNPs.
Although transcriptional regulatory elements are often found to be concentrated in the immediate promoter region, they are also located in the introns and downstream regions . On one hand examining only the promoter sequence would clearly be insufficient; on the other hand including the introns and down-stream sequences could dilute the density of regulatory elements (if in fact they were enriched in the immediate promoter regions of most genes under consideration in our study), thereby making it difficult to identify any relationship between SNPs and these elements. Therefore we analyzed the promoter and non-coding regions (NCR) separately. In addition to the immediate vicinity of the genes, regulatory elements such as enhancers or silencers can also be present at distances that are far away from the genes themselves ; we have not addressed these in our present study.
We analyzed cis-SNPs in regions that were most conserved between the mouse and human genomes. Functional non-coding sequences are often assumed to be under evolutionary selection pressure, and thereby conserved relative to the surrounding non-functional sequence. Consequently, phylogenetic footprinting has been widely used for the analyses of non-coding regulatory sequences [26–29]. Although phylogenetic footprinting methods have limitations (sequences from organisms that are too distant or too close can be uninformative), the alignments of rodent-human sequences have been demonstrated in many studies to be successful in identifying regulatory elements, and significant enrichment of known regulatory elements have been found in these regions [26–30]. We therefore investigated the presence of SNPs in the mouse-human aligned regions in the promoters and non-coding regions to see if a higher fraction of CEGs contain cis-SNPs in these conserved sequences. For this purpose the mouse-human genome alignments were taken directly from the UCSC genome annotation project , where the two genomes were aligned using the BLASTZ software  and post-processed to obtain the best alignments for each region (see Methods for details). These alignments represent the most conserved sequences between the mouse and human genomes and cover ~6% of the mouse genome, which is roughly the percentage of mammalian genome that is estimated to be under purifying selection .
Genes containing SNPs in promoters (Prom) and non-coding regions (NCR)
Number of Total Genes (CEGs+Non-CEGs)
Number of CEGs
Total Genes Containing SNPs
CEGs Containing SNPs
Ratio of Over-representation
Prom 2 Kb
Prom 5 Kb
Cons Prom 2 Kb
Cons Prom 5 Kb
SNP density in the promoter or non-coding regions of CEGs and non-CEGs
Number of CEGs
Number of Non-CEGs
(Normalized by total non-coding or promoter length)
(Normalized by Conserved or Non-Cons region length)
Mean SNP Density CEGs
Mean SNP Density Non-CEGs
Mean SNP Density CEGs
Mean SNP Density Non-CEGs
Full Combined Set
Prom 2 Kb
Prom 5 Kb
Genes with No
SNPs in Conserved
Prom 2 Kb
Regions (subset 1)
Prom 5 Kb
Genes having SNPs
Non Cons NCR
Regions (subset 2)
All Prom 2 Kb
Non Cons 2 Kb
Cos Prom 2 Kb
All Prom 5 Kb
Non Cons 5 Kb
Cons Prom 5 Kb
In order to compare the density of cis-SNPs in the conserved and non-conserved regions, genes were partitioned into two sub-sets, namely, those with no cis-SNPs in mouse-human conserved regions (subset 1), and those containing cis-SNPs in the conserved regions (subset 2) (Table 2). In subset 1, containing a majority of the CEGs, higher cis-SNP density was observed in both promoter and non-coding regions (p < 0.01). In subset 2, a higher cis-SNP density was observed in non-coding region (p < 0.01) only. Upon normalizing the number of SNPs by the length of the conserved or non-conserved sequence (instead of the total promoter or non-coding sequence length), significantly higher density was observed in both conserved as well as non-conserved non-coding region (p < 0.01, Table 2, subset 2). In the 5 Kb upstream promoter regions of genes in subset 2, significantly higher SNP density was observed only when the number of cis-SNPs in mouse-human aligned sequences was normalized by the length of these conserved regions.
In an attempt to study what effect the cis-SNPs in CEGs have on the transcription regulatory machinery, the perturbation of transcription factor binding sites (TFBSs) by cis-SNPs was investigated. All known mouse, rat and human TFBSs (a total of 2,528 sites) from the TRANSFAC® database  were first mapped to the mouse genome using BLASTN (for mapping details see Methods). However, none of the mapped sites overlapped with cis-SNPs of any of the CEGs. Consequently we investigated the overlap of predicted TFBSs with cis-SNPs.
The rationale and caveats for using predicted TFBSs are discussed below. It has been shown through experiments that the score of a transcription factor (TF) binding site, as computed from a position weight matrix (PWM) built from a collection of its known sites, can give a fairly accurate estimate of the in vitro DNA binding affinity of the transcription factor to that site (e.g. [35–37]). This observation and the thermodynamic principles behind it forms the basis of most of the generic bioinformatic methods that are in use today to predict TFBSs in genomic DNA (Reviewed in [36, 38]). Compared to in vitro, the TF-DNA binding events are definitively more complicated in vivo since TF binding to DNA in eukaryotes is context dependant (e.g. dependant on other TFs which bind nearby DNA sites, local DNA structure), and influenced by factors like chromatin remodeling and concentration of the TF. But such contextual and other relevant information are available only in rare cases and cannot be generally leveraged in the prediction of TFBSs . Therefore, although the change in TFBS score may not be accurately predictive of the binding of a transcription factor to its target DNA site (and of the change to the target gene's expression) in vivo, in the absence of other specific information such as chromosomal regions that are open for the regulatory proteins to bind, the DNA binding partners for a given TF, concentration of TF etc., the approach we have taken here (i.e. looking for base changes that lead to perturbation of the binding sites predicted with models built out of previously known sites for TFs) is a reasonable strategy (and the only generic strategy at this time) that one can use to examine how SNPs may affect TF binding to putative TFBSs. This is a common strategy that has also been used by others for the prediction of TFBSs as well as prediction of putative regulatory SNPs that could perturb TF-DNA binding and cause changes to expression of the target gene [39–42].
In our study TFBS predictions were made with PWMs representing the transcription factor DNA binding sites available from the TRANSFAC® (v. 6.3) database  using the MATCH™ software . Only PWMs generated from the collections of vertebrate DNA binding sites were used. 30 bp regions were taken around all cis-SNPs (a total 61 bp including the SNP nucleotide) and scored with the PWMs (for details see Methods). Both the B6 and DBA alleles were scored, since as explained earlier, these were the variants that were used in the analysis of data from both crosses. If a cis-SNP location overlapped with a predicted binding site, and a difference was observed in the predicted binding site score due to the two alleles, the change in score was noted, and the predicted TFBS was considered to be perturbed by the SNP.
Since the TFBSs are typically short and degenerate, predictions using PWMs are known to contain a large percentage of false positives [28, 36, 44]. Therefore, orthogonal data such as co-regulation of the target genes with the transcription factors or phylogenetic footprinting, are commonly used to increase the specifiCity of these predictions [28, 44, 45]. Although the transcription factors and their target genes may not co-regulate at the mRNA level, it is generally assumed that genes that do co-regulate across a diverse set of conditions may belong to the same regulatory pathway [46–48]. To reduce the number of false positive predictions for TFBSs we employed a similar strategy, requiring that the transcript levels of the TFs and their putative target genes (based on the TFBS predictions) be significantly correlated across a diverse set of mRNA profiling experiments (see Additional file 3 and 4). Using expression profiles available from a set of 145 diverse mouse tissues and cell-lines [49, 50] (referred to here as the 'body-atlas' data set), we determined the Spearman rank-order correlation (with p < 0.01) between all genes and the 282 distinct vertebrate transcription factors in the TRANSFAC® (v. 6.3) database which have known gene symbols as well as PWM models for their DNA binding sites. It is of note that the body-atlas expression data set was used instead of the BXD and BXH F2 populations given others have shown that significant correlation between any two genes in a segregating population can result from closely linked eQTLs as opposed to biologically relevant co-regulation . These effects can be amplified in cases where genes give rise to strong cis-eQTLs.
Perturbation of TFBSs by cis-SNPs and summary gene counts where TFBS predictions are affected by cis-SNPs
Total Genes Containing SNPs
CEGs Containing SNPs
Total Genes with TFBS Score Changed by SNPs
CEGs with TFBS Score Changed by SNPs
Prom 2 Kb
Prom 5 Kb
Cons Prom 2 Kb
Cons Prom 5 Kb
In order to determine whether the cis-SNPs in CEGs perturb predicted TFBSs with an increased frequency relative to the non-CEGs, we compared the fraction of cis-SNPs affecting TFBSs in CEGs versus non-CEGs using the Fisher exact test. The fraction of cis-SNPs in CEGs affecting predicted TFBSs was not observed to be higher (at the 0.01 significance level), suggesting that a higher rate of TFBS perturbation by SNPs in CEGs is likely due to the increased density of cis-SNPs in these genes relative to non-CEGs.
To illustrate how high-density SNP data may be intersected with eQTL data to identify putative candidate quantitative trait nucleotide (QTN) underlying the eQTLs (and also to illustrate the different types of data we have used in our analyses), we highlight one example of a CEG with a cis-SNP in its promoter region perturbing a predicted TFBS (Figure 2). The gene Casc4 (cancer susceptibility candidate 4) gives rise to a strong cis-acting eQTLs (LOD score ≥ 10) in a number of tissues in the BXH cross (Figure 2a). As Casc4 is in a nIBD-BXH region, the polymorphisms between B6 and C3H in this region should be identical to those between B6 and DBA. There are five SNPs in the promoter (upstream 5 Kb) of this gene; only one SNP (mCV23866990), which is located close to the 5' end of Casc4 (-701 bp), perturbs the predicted binding-site for a transcription factor, Hand1 , whose mRNA level is correlated with that of the gene. The mCV23866990 SNP genotype shows a significant association with the expression level of Casc4 (p < 0.0001 using a standard one-way ANOVA) (Figure 2b). A conserved position in the binding site for the Hand1 is perturbed by mCV23866990 (Figure 2c). By affecting the Hand1 binding site in the promoter, this cis-SNP could be responsible for differential expression of Casc4 in the BXH F2 population.
Isolating the specific causative regulatory mutations underlying eQTLs that are responsible for variation of gene expression in a segregating mouse population is difficult. This is especially true in an F2 population, where regions of linkage disequilibrium are very large in any given region (given an F2 population is constructed from intercrossing a single F1 founder). Determination of the actual functional role of the causative polymorphisms is even more challenging, since there are several different molecular mechanisms through which mRNA levels in cells can be regulated. Although such challenges exist, putative candidate polymorphisms that affect transcription of a given gene may be prioritized for experimental validation, and hypotheses can be generated for the possible biological roles of candidate regulatory cis-SNPs based on examination of the data, as illustrated by the example above. For such candidates the gold standard is to introduce the polymorphism in question onto the background of a wild-type mouse and then compare changes in the in vivo activity of the gene and phenotypes to the wild-type mouse.
SNPs are often used as markers for disease, and as noted earlier, there are now several examples where cis-regulatory variants are associated with disease [5, 13, 14, 42, 53, 54]. Computational approaches for identifying the cis-regulatory polymorphisms would therefore be useful in prioritizing the candidate polymorphisms that a play causative role in disease, reducing the laborious experimental process of testing multiple candidate variants in vivo, selecting biologically meaningful SNPs for association studies, and ultimately in generating testable hypotheses for elucidating the molecular basis of a given disease. However, little bioinformatics research has been done in a systematic way to build predictors of variations that are likely to affect gene-expression in segregating mammalian populations [39–42, 55]. In this study we have investigated the frequency and potential biological role of the polymorphisms underlying genes whose expression give rise to strong cis-linkage in segregating mouse populations. The study provides the first investigation of putative regulatory SNPs around genes showing cis-linkage, and insight into the challenges associated with identifying the causative regulatory variants in such populations through bioinformatic sequence analyses methods. In addition, the data we provide here (CEGs from four different tissues in two crosses, cis-SNPs, and predicted TFBSs affected by those cis-SNPs) should benefit further investigations in this area.
There have been a few previous studies surveying the role of cis-polymorphisms and haplotypes in promoter regions of sets of human genes, and identifying those that change expression [5, 40, 51, 53, 56–58]. These studies considered a relatively small sampling of genes (<300) and assessed the promoter SNPs that affected the expression in a limited number of cell lines with reporter gene assays [51, 57–59]. Since the changes in expression due to the polymorphisms were tested in cell-lines, it is not known if the SNPs that caused expression changes in the in vitro assays are responsible for varying levels of expression in vivo in genetically segregating populations. We started with large-scale genetic linkage data of gene expression and investigated the frequency of polymorphisms in the genes showing cis-linkage. Therefore the work we present here (namely, investigation of SNPs in the vicinity of genes with cis-expression linkage in segregating populations in as many as four different tissues), is complementary to the previous work, and provides a different approach to investigating cis-regulatory SNPs with murine populations.
One recent study reports the mapping of cis-regulatory variants in a small set of genes to haplotype blocks in human samples , and another recent study reports the investigation of cis-regulatory variations in 3' UTRs of a set of genes showing cis-acting regulation in a panel of mouse recombinant congenic strains . However, to the best of our knowledge, the present study represents the first large scale genome-wide survey of cis-SNPs in genes that give rise to strong cis-eQTL in a mammalian population, and an investigation of their potential role in disrupting putative cis-regulatory elements. We observe a significantly higher fraction of CEGs to contain cis-SNPs compared to non-CEGs, and that the density of these SNPs is significantly higher in the CEGs. We have not conclusively proven the functional role of any polymorphism in regulating the expression of CEGs through in vitro or in vivo experimental validation, and many of the SNPs in the vicinity of the CEGs could be neutral (i.e. have no consequence on the expression levels). However, CEGs by definition should have variations near the genes themselves affecting their expression in a segregating population, and based on earlier work on human genes with promoter polymorphisms, it has been estimated that a sizable fraction of cis-variants (about one third of the SNPs in the promoters) may alter gene-expression [5, 51]. Therefore, it is reasonable to infer that the higher density of cis-SNPs in CEGs is associated with changes in expression of those genes, and one or more of the cis-SNPs would be responsible for causing variation of expression in a large fraction of the CEGs in the mouse F2 populations, although we do not exactly know how many causal regulatory cis-SNPs are in this set of CEGs.
We investigated the effect of a few other relevant biological factors which could give rise to cis-eQTLs in our data-set instead of (or in addition to) cis-SNPs in the promoters and non-coding regions, such as non-sense mediated decay (NMD), polymorphisms in exons, and genomic segmental duplications. We estimated that a very small fraction of our set of CEGs may arise due to NMD and segmental duplications (see Methods section for details). A higher fraction of CEGs was seen to contain SNPs in their exons (1081 out of 2047, p = 2.5 × 10-12 with Fisher exact test) relative to the non-CEGs. But an increased fraction of the CEGs that contained exonic SNPs also contained promoter and non-coding cis-SNPs relative to non-CEGs (p < 10-4 with Fisher exact test). Therefore, we did not exclude these genes from the analyses presented here (the purpose of which was to investigate potential regulatory SNPs in the non-coding and promoter regions). It is worth noting however that our analyses with the set of CEGs and non-CEGs which did not contain any exonic SNPs yielded results that were very similar to those obtained from the full set of CEGs and non-CEGs.
One of our objectives was to investigate the challenges involved in identification of the causative cis-regulatory SNPs through bioinformatic sequence analysis approaches. To that end, we have examined the propensities of the SNPs in potentially functional non-coding sequences (namely, mouse-human conserved sequences) and predicted transcription factor biding sites. A few recent studies have suggested the use of mouse-human alignments to identify putative candidate regulatory SNPs [40, 41], and although SNPs falling in these conserved regions have been shown to affect transcription , it is unclear whether a majority of the regulatory variants lie in these regions. In our analyses we find that the sequence variations around the CEGs are not specifically enriched in evolutionarily conserved non-coding and promoter sequences, and in fact in the majority of the CEGs all of the SNPs are outside of these highly conserved regions. It is possible that the causative regulatory SNPs lie further away in these CEGs (sequences that are > ± 5 Kb away from the genes) where they alter conserved regulatory elements (such as silencers or enhancers). However, the higher cis-SNP density in the immediate vicinity of these CEGs (relative to non-CEGs) suggests that a significant fraction of cis-regulatory SNP could lie outside of regions that are most conserved in mammalian evolution. As comparative genomics and phylogenetic footprinting approaches are frequently utilized in the searches for functional regulatory elements in mammalian genomes, and computational prediction of transcriptional regulatory elements is a difficult problem [28, 36, 44], the above observations imply that the identification of cis-regulatory variations in genetically segregating populations is likely to be difficult using sequence-driven bioinformatic approaches alone. Since the information for transcriptional regulatory networks is in part hard-wired in the genomic DNA itself through the array of regulatory elements , one can hypothesize that for CEGs where the cis-regulatory SNPs are not present in the highly conserved promoter or non-coding sequences, variations of expression may not cause a significantly perturbation of the transcriptional networks that have been conserved in the mammals. Experimental validation of a set of SNPs affecting putative regulatory sites in conserved and non-conserved regions (both in the immediate vicinity of the genes, and also some distances away), will be ultimately required to fully understand how the in cis-SNPs affect gene expression in segregating populations. This would consequently lead to a better understanding of the bioinformatic approaches that would be effective in identifying cis-regulatory variants.
Below we describe the examination of a set of characterized human cis-regulatory SNPs that are associated with inherited diseases. A collection of these rare examples is available from the rSNP_Guide database [42, 63]. From the examples presented in this database (see ) we collected a set of 33 cis-regulatory SNPs in 5 distinct genes (PROC, TNF, HGB2, GP1BB, and F7) that are: a) known to be underlying or associated with inherited human diseases, b) known or predicted to disrupt transcription factor binding sites, and c) have flanking sequences available from the Human Gene Mutation Database  so that they could be used for mapping reliably to the human genome assembly. An examination of the locations of these regulatory SNPs indicates that 10 SNPs in 2 genes are in the human-mouse conserved regions, while the remaining 23 SNPs in 3 genes are in non-conserved regions. In addition to the above cases that were taken directly from rSNP_Guide database, from two recent publications we examined the non-coding SNPs (three in total) within or around two genes, namely INSIG2  and TCF7L2 , that represent extremely rare examples of variants associated with complex disease (obesity and diabetes respectively in this case) that have been validated across diverse and multiple human cohorts. These causative SNPs were also outside of human-mouse conserved regions. This is a small sample of genes to draw concrete conclusions from; however this observation with the well characterized human SNPs supports our similar finding from the mouse data and suggests that a large fraction of the causative cis-regulatory SNPs, including those that are associated with inherited disease, could be outside of the sequences that are highly conserved in mammalian evolution.
A higher fraction of CEGs had TFBS predictions perturbed by SNPs in the non-coding region. No significant difference was observed in the fraction of total SNPs affecting binding sites between CEGs and non-CEGs, suggesting that the higher fraction of perturbations of TFBSs in the CEGs was a consequence of increased cis-SNP density.
Several factors are likely to have confounded our study with predicted TF binding sites: 1) it is possible that the false positive rate of TFBS prediction is high, even after requiring correlations of TFs to putative target genes; 2) at the time of our analysis binding site models for only ~300 vertebrate TFs were available from the TRANSFAC database, whereas the number of distinct TFs in mammals are estimated to be around 2,000 ; therefore prediction of DNA binding sites for the majority of TFs was not possible; 3) although often enriched in the immediate promoter region, transcription regulatory elements in mammals can be spread over large distances (sometimes more than 100 Kb ) whereas in this present study we considered only the immediate vicinity of the genes (± 5 Kb); 4) transcriptional regulatory elements frequently act in conjunction with others forming regulatory modules where multiple TFs bind DNA (involving both protein-DNA as well as protein-protein interactions); consequently the change in score of one individual binding site by an overlapping cis-SNP may often fail to reflect the extent to which transcription is affected by that mutation. It would clearly be useful to re-analyze the data when significantly more experimentally verified murine transcription factor binding sites are available in order to obtain a better understanding of how cis-variations specifically affect the transcriptional machinery, and whether a larger fraction of SNPs in the CEGs perturb known binding sites.
It is of note that in a recent study of Saccharomyces cerevisiae segregants it was shown that genes having cis-linkage contained a higher frequency of SNPs in promoters and 3' UTR sequences . The study also found moderate evidence for enrichment of SNPs in TFBS sequences that were mapped using the ChIP-chip technology . The yeast study with TF binding sites was not confounded by some of the factors mentioned above, since most of the regulatory sequences were experimentally determined, and a comprehensive set of DNA binding sequences for almost all of the yeast TFs were available .
Obviously, in addition to TFBSs, other classes of regulatory sequences, e.g. those affecting transport from the nucleus, mRNA stability or decay, RNA mediated regulation, and those potentially involved in epigenetic regulation of gene expression, could be affected by cis-SNPs, which we have not studied here. Coding SNPs that cause changes in the protein structure can act in trans to influence expression through a feedback loop as shown recently for the AMN1 gene in S. cerevisiae . Such cases were also not studied here.
The specifiCity of the identification of cis-acting eQTLs was unknown. Recently Doss et al. , gives a lower bound estimate of the true positive rate for the BXD cross (64%); we have used more stringent thresholds for identification of putative cis-eQTLs in our study to increase the true positive rate, so we anticipate the true positive rate would be higher than 64%, but the exact number is not known. The specifiCity of TFBS predictions is unknown; moreover binding sites for the majority of the TFs could not be predicted because TRANSFAC® PWMs were unavailable. It is likely that additional SNPs exist between the strains we studied that had not been identified in the databases we used in our study . Even with these unknown factors in our current study, several observations have been made that shed light on the nature of variation in genes showing cis-linkage in segregating populations, as well as the bioinformatic challenges that are involved in characterizing the non-coding cis-regulatory polymorphisms using computational sequence analysis strategies.
The problem of identifying and annotating the functional cis-regulatory polymorphisms is a difficult one that will require various experimental as well as computational approaches to address. Our understanding of cis-regulatory variations and their biological role would benefit from in-vivo experimental evaluation of the contribution of polymorphisms around CEGs towards changes in gene expression, characterization of more regulatory elements in the genome (which is severely limited at this time), examination of the multi-species genome alignments, and more accurate prediction of the (transcriptional and other) regulatory elements. The data on CEGs and cis-SNPs that we supply here (supplementary information) will provide a valuable resource for further exploration in this area.
The analyses of cis-SNPs in the promoters and non-coding regions around cis-acting eQTL genes (CEGs) in mouse F2 populations indicate that a significantly higher fraction of CEGs contain cis-SNPs compared to non-CEGs. CEGs also contain higher SNP density in the promoters and non-coding sequences relative to the non-CEGs. Since non-coding sequences that are conserved in mammalian evolution are often biologically functional, the propensity of cis-SNPs in the promoter and non-coding regions that are most conserved between mouse and human was examined. A majority of the CEGs having cis-SNPs did not contain any cis-SNP in these conserved regions, and in the CEGs that contained cis-SNPs in conserved regions, the enrichment of cis-SNPs occurred both in conserved as well as non-conserved regions. This suggests many of the cis-regulatory SNPs underlying eQTLs and responsible for causing gene-expression changes in segregating populations could lie outside of the sequences that are most highly conserved in mammalian evolution. To investigate the possible biological role of the cis-SNPs in disrupting the transcriptional regulatory elements, we studied the perturbation of the predicted transcription factor binding sites (TFBSs) by the cis-SNPs. Relative to non-CEGs, a significantly higher fraction of CEGs harbor cis-SNPs that perturb the predicted TFBSs. However the fraction of cis-SNPs in the CEGs affecting the binding sites is not higher, suggesting that the increased incidence of TFBS perturbation in the CEGs is due to the higher cis-SNP density. These observations imply that the identification and annotation of cis-regulatory variations in genetically segregating populations is likely to be difficult using sequence-driven bioinformatic approaches alone.
The UCSC mouse genome assembly mm4  (NCBI build 32) was downloaded and used for all mapping purposes. All mouse mRNAs, cDNAs and ESTs were aligned to the mm4 assembly and clustered to produce gene and exon coordinates as described in detail previously [22, 23] (for gene and exon coordinates see Additional file 5). Celera (release 3.4)  and public reference (dbSNP build 120 ) mouse SNPs were mapped onto the mm4 assembly using BLASTN as described in . Those SNPs between the strains C57BL/6J and DBA/2J that mapped uniquely to the autosomes, had allele count > 1, and allele frequency ≥ 10%, were used for analysis. SNPs in repeat regions were removed (repeat coordinates in mm4 were downloaded from UCSC annotation server), which left a total of 484,727 Celera and 24,332 dbSNPs. 16,809 SNPs between the Celera and dbSNP databases were identical in genomic location, leaving a unique, non-redundant set of 492,250 SNPs. This data set was used for analyses of cis-SNPs.
Genomic regions that are identical by descent (IBD) between mouse strains were taken directly from Cervino et al. . Cervino et al. used a window of 50 Kb which was moved through the genome at 10 Kb intervals; regions in which fewer than five consecutive SNPs were observed between two strains were identified as blocks that were IBD. Genes were taken with 5 Kb flanking regions (± 5 Kb), and if the gene or its flanking region overlapped with regions of IBD, the gene was considered to be in an IBD region.
Mouse (UCSC assembly mm4) and human (UCSC assembly hg16) genome alignments (axtTight track) were downloaded from the UCSC mouse-human alignment download site . In the axtTight track, mouse and human genomes were aligned using BLASTZ  and post-processed to obtained the best alignments for each region. The amount of mouse genome covered in axtTight track is ~6%.
Both the crosses under study here have been described earlier [15, 16]; we simply describe the key features of those crosses in brief here for the benefit of the readers. In the BXD cross, an F2 population consisting of 111 mice was constructed from a cross of two inbred strains of mice, C57BL/6J and DBA/2J [10, 15]. Only female mice were maintained in this population. At 16 months of age the mice were euthanized and their livers extracted for gene expression profiling. The mice were genotyped at 139 microsatellite markers uniformly distributed over the mouse genome to allow for the genetic mapping of the gene expression and disease traits. The BXH F2 mouse population was constructed from C57BL/6J ApoE null (B6.ApoE-/-) and C3H/HeJ ApoE null (C3H.ApoE-/-) mice [16, 19]. F1 mice were generated from reciprocal intercrossing between B6.ApoE-/- and C3H.ApoE-/-, and F2 mice were subsequently bred by intercrossing F1 mice. A total of 334 (169 female, 165 male) were bred. Mice were sacrificed at 24 weeks and four tissues (liver, white adipose, whole brain, muscle) were extracted for mRNA profiling. Genomic DNA was isolated from kidney. A linkage map for all 19 autosomes was constructed using 1032 SNPs markers, giving rise to a genetic map with an average density of 1.5 cM. Genotyping was conducted by ParAllele using the molecular-inversion probe multiplex technique .
All of the expression data from the two crosses we have used here for eQTL analyses were generated and described previously [6, 10, 16]. For the BXD and BXH crosses expression measurements were available for 21,740 and 21,640 transcripts, respectively, representing 18,774 and 19,197 distinct coding genes which mapped uniquely to the 19 autosomal chromosomes. 12,597 genes were common between the microarrays used to profile each cross.
Although the actual mRNA profiling experiments were described earlier [10, 16], a summary of the method is given below for the reader's information. Total RNA from the BXD and BXH samples was purified from 25-mg portions using an RNeasy Mini Kit according to the manufacturer's instructions (Qiagen, Valencia, CA, USA), as previously described for the BXD set . Fluorescently labeled cRNA (5 mg) from each F2 animal in each cross was hybridized against a pool of RNAs specific to each cross. The RNA pools for each cross were constructed from equal aliquots of RNA from all animals in the BXD cross and 150 randomly selected animals in the BXH cross. Array images were processed as previously described to obtain background noise, single-channel intensity, and associated measurement error estimates . Expression changes between two samples were quantified as log10(expression ratio), where the "expression ratio" was taken to be the ratio between normalized, background-corrected intensity values for the two channels (red and green) for each spot on the array. An error model for the log ratio was applied as previously described to quantify the significance of differential expression between two samples .
Expression level for each gene was treated as a continuous variable and mapped to the genome using interval mapping. QTL mapping in the BXD cross was done as described [6, 10]. For BXH, QTL mapping was done with the QTL-Cartographer suite of programs  using the established interval mapping  procedure. Since each experiment was hybridized in flour reverse pairs, log10 of the two expression ratios was taken and averaged to get the expression level (called ml-ratio for mean log ratio) for a gene. When males and females were treated as one combined group (in order to increase the power to detect linkages with increased number of animals), the gender effect on expression was accounted for by subtracting the gender-specific mean from each expression value. Specific thresholds for selecting CEGs are given in the results section.
For some genes, the probes on the microarray, when mapped to the mouse genome (UCSC mm4  or NCBI build 32), overlapped with SNPs. For these genes (which consisted roughly 4.6% of the total number of genes represented on the microarrays), cis-eQTLs could simply arise due to polymorphisms in the probe sequences influencing hybridization of the mRNA to the microarray, rather than non-coding cis-variations influencing their expression. Such genes were therefore removed from the list, in order to minimize the false positive calls on CEGs.
Nonsense mediated decay (NMD) is a mechanism of mRNA surveillance that ensures rapid degradation of transcripts with premature stop codons . Therefore some CEGs may not have non-coding cis-regulatory variation but instead contain nonsense mutations that result in NMD, which is detected as a cis-eQTL event. The Celera mouse SNP database  (from which most of the SNPs for our analysis of cis-variants were taken) provided the annotation on SNPs that cause nonsense mutations; 63 distinct genes represented on the microarrays used for the crosses were annotated as having nonsense mutations. Only 4 CEGs from the BXH cross (~0.1% of all CEGs from that cross) and none of the CEGs from the BXD cross were annotated as having SNPs resulting in nonsense mutations. The enrichment of genes having SNPs annotated as causing NMD in our list of 2,047 CEGs (Table 1) was not significant (p = 0.33 with the Fisher exact test). Since this fraction was small and the p-value not significant, we are confident that NMD did not introduce any bias in our results. In addition, the genes containing nonsense SNPs could still have non-coding cis-variations affecting their expression. In our analysis we therefore did not exclude the genes with nonsense SNPs.
In addition to the factors discussed above, variations in segmental duplications in the genome may affect expression. A rough map of genomic duplications is available for the B6 strain , however no map of the variations of genomic duplications between mouse strains used to construct the F2 crosses is available, and it is not known whether significant variations exist between mouse inbred strains in terms of genomic duplications. The analysis that one can do with SNPs (which gives variations between strains) is therefore not possible with the genomic duplications. Nevertheless, we looked to see if CEGs contained an increased number of genes that were in the duplicated regions using the available map of segmental duplications from B6 . From the genomic coordinates of the segmental duplications [74, 75], we obtained the list of genes that were contained within these regions. A total of 123 distinct genes that were represented on our microarrays were within the duplicated regions. We then checked whether these genes are over-represented in the CEGs using the Fisher exact test. Only 11 of the 2,047 CEGs were in regions that underwent segmental duplications, and we found no evidence of enrichment of these genes in our list of CEGs (p = 0.45 with Fisher exact test). Since the maps of duplications in other mouse strains are not available, it is not possible to check whether the CEGs are enriched for genes contained in the duplicated regions of C3H or DBA. It is of note however, even if the duplicated regions in the C3H and DBA did not overlap those in the B6 strain (but the extent of duplications remained roughly the same between the different mouse strains), we would still observe a very small fraction of CEGs to be in these regions.
In addition to checking for the enrichment of genes known to be located within segmental duplication regions in our list of CEGs, we employed a different strategy to check whether a significant fraction of CEGs could arise due to variations in duplications. This analysis was based on the hypothesis that if a certain region on the genome, containing multiple CEGs, was duplicated in one of the parental strains involved in a cross but not the other (i.e. variations of duplication between parental strains), and if this duplication was responsible for giving rise to cis-eQTLs in an F2 population, then the CEGs contained in the duplicated region would all show the same sign of the additive component of their eQTLs. This is because the F2 mice containing the duplicated region would always be expected to have higher levels of mRNA (having multiple copies) for all the genes contained within that region. In other regions of the genome, where duplication was not responsible for differential mRNA levels between the parental strains or the F2 animals, the signs of the eQTLs for tandem genes on the genome would be expected to be random, and follow a simple Binomial distribution. In the BxH cross, which contained the majority of CEGs, we observed the instances of 3 or 4 tandem CEGs having the same sign of their cis-eQTLs. If the signs of the eQTLs for CEGs were completely random, the probability of 3 tandem CEGs having the same sign of their eQTLs would be 0.25, the probability of 4 tandem CEGs having the same sign of their eQTLs would be 0.125. Using a Binomial distribution we did not observe any deviation from these probabilities in our data (p > 0.1). This suggests that in our set of CEGs, variations in long duplication regions between strains (containing 3 or more CEGs), was not an important factor in giving rise to a significant fraction of the cis-eQTLs.
The IBD map used in this study  was constructed by looking at 50 Kb sequence windows in the different strains (moving the window through the genome at 10 Kb steps), and identifying as IBD regions the windows that had fewer than five consecutive SNPs . Although this provides a comprehensive IBD map for multiple mouse strains for which complete genome sequences are not yet available, the IBD segments defined in this way are coarse as they have been derived using a ~10,300 SNP genotype map, whereas there are more than 2.5 million SNPs reported in the different mouse strains . While extensive genome sequence coverage is available for the B6 and DBA strains [20, 33], allowing for a high-resolution IBD map to be constructed between these two strains, the BXD cross provides a small fraction (338 out of 4,107) of the total CEGs considered in this study. On the other hand, the BXH cross provides a far richer set of CEGs (3,769 out of the 4,107 considered in this study), but the complete C3H (C3H/HeJ) genomic sequence is not available. Therefore, we chose to leverage one of the previously published comprehensive, lower resolution maps based on a consistent set of SNPs genotyped in the B6, DBA, and C3H strains [19, 76]. The utility of the IBD map used here  has been demonstrated by its successful application in the identification of a causal disease gene , and we anticipate that our conclusions will remain the same (although some of the specific numbers presented here may change) with a finer IBD map that will become available at a later date.
With the IBD map used here  (built with a set of around ~10,300 SNPs), we examined the amount of genomic sequence, and the numbers of SNPs and genes falling within and outside of the IBD blocks. Of the 2.5 Gb mouse genomic sequence , 1.14 Gb (~45.6%) fell within IBD blocks between B6 and DBA, whereas 1.01 Gb (~40.4%) fell within IBD blocks between B6 and C3H. Of the 492,250 SNPs compiled largely from the Celera mouse sequence database  that were outside repeat regions and polymorphic between B6 and DBA, 404,095 (~82.1%) fell in regions that were not in IBD (nIBD). Of the 18,774 autosomal genes represented on the BXD microarray, 10,259 (54.6%) were contained within regions that were nIBD between the two parental strains B6 and DBA (which is according to expectation, since 100–45.6 or 54.4% of the total genomic sequence is in nIBD regions between these two strains, as noted above), whereas 279 of the 338 CEGs (82.5%) were in nIBD (p = 8.87 × 10-28 as determined by the Fisher exact test). Of the 19,197 autosomal genes represented on the microarray for the BXH study, 11,628 (60.6%) genes were in nIBD regions between B6 and C3H (again according to expectation since 100–40.4 or ~59.6% of mouse genomic sequence is in nIBD regions between these two strains), whereas 3,219 out of the 3,769 CEGs (85.4%) were nIBD between these two strains (p = 4.08 × 10-296, by the Fisher exact test). The significant enrichment of CEGs in nIBD regions is expected since there should be cis-variants near the CEGs that result in cis-linkage by altering one or more regulatory elements, and by definition these variants are likely to be largely biased towards the nIBD regions.
For finding the overlap of known transcription factor binding sites with SNPs, 30 bp sequences around the SNPs (total 61 bp) were taken, and the experimentally determined human, rat and mouse binding sites from the TRANSFAC® database  were mapped to these sequences (both B6 and DBA alleles) using BLASTN. Only sites which mapped with a threshold of 95% identity were kept. 16 distinct binding sites mapped in this way overlapped with SNPs between B6 and DBA. However of these, none overlapped with cis-SNPs that were in the promoter or non-coding region of any of the genes.
Consequently, transcription factor binding site predictions were made on the 61 bp sequences around the SNPs as described below. Although TFBSs are often short, we took a length of 30 bp on either side of the SNPs, since in the TRANSFAC® database the longest vertebrate position weight matrix was 30 bp. Site predictions were made with the vertebrate position weight matrix models (PWMs) available from the TRANSFAC® database (version 6.3)  using the MATCH™ software . The individual TFBS prediction cutoff scores were given by TRANSFAC® (based on an algorithm that minimizes the sum of the false positive and false negative rates of predictions using known sites). With the application of the TRANSFAC thresholds, the scores of individual sites ranged from 0.751 to 1. For each SNP, two sequences were generated containing the B6 and DBA alleles. Both sequences were scored and the change in score of the binding site prediction due to a SNP was recorded. In the analyses we have reported, we did not chose a threshold for the score difference due to a SNP intersecting with a TFBS prediction. Once a binding site was predicted with the threshold given by TRANSFAC, any change to the score of that site was considered as a potential perturbation that could represent an alteration of binding of the TF to that site leading to a change in expression. We also used an increased threshold of 0.01 for score changes to a site by an SNP, and that provided very similar results and identical conclusions (data not shown).
Expression profiles of all known mouse genes were determined over 145 tissues and cell lines (called 'Body atlas' data set) as described previously [49, 50]. Spearman (rank order) correlations were determined between the mRNA levels of each of the known transcription factors and all other genes and correlates with p-value < 0.01 were stored for analysis.
We thank John Lamb and Thomas Drake for helpful discussions. Aldons J. Lusis is thanked for helpful comments on the manuscript. Barmak Modrek, Archie Russell, Jin Ma and Jun Zhu thanked for technical support. This work was supported in part by NIH grants HL30568 and DK 071673.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.