cis sequence effects on gene expression

Background Sequence and transcriptional variability within and between individuals are typically studied independently. The joint analysis of sequence and gene expression variation (genetical genomics) provides insight into the role of linked sequence variation in the regulation of gene expression. We investigated the role of sequence variation in cis on gene expression (cis sequence effects) in a group of genes commonly studied in cancer research in lymphoblastoid cell lines. We estimated the proportion of genes exhibiting cis sequence effects and the proportion of gene expression variation explained by cis sequence effects using three different analytical approaches, and compared our results to the literature. Results We generated gene expression profiling data at N = 697 candidate genes from N = 30 lymphoblastoid cell lines for this study and used available candidate gene resequencing data at N = 552 candidate genes to identify N = 30 candidate genes with sufficient variance in both datasets for the investigation of cis sequence effects. We used two additive models and the haplotype phylogeny scanning approach of Templeton (Tree Scanning) to evaluate association between individual SNPs, all SNPs at a gene, and diplotypes, with log-transformed gene expression. SNPs and diplotypes at eight candidate genes exhibited statistically significant (p < 0.05) association with gene expression. Using the literature as a "gold standard" to compare 14 genes with data from both this study and the literature, we observed 80% and 85% concordance for genes exhibiting and not exhibiting significant cis sequence effects in our study, respectively. Conclusion Based on analysis of our results and the extant literature, one in four genes exhibits significant cis sequence effects, and for these genes, about 30% of gene expression variation is accounted for by cis sequence variation. Despite diverse experimental approaches, the presence or absence of significant cis sequence effects is largely supported by previously published studies.

We used previously generated genomic resequencing data and, for this study, quantified in vitro transcript levels from thirty unrelated individuals at several hundred candidate genes commonly studied in cancer research. We identified a subset of candidate genes with abundant sequence and gene expression variation. We evaluated potential cis sequence effects using individual single nucleotide polymorphisms (SNPs), all SNPs at a candidate gene considered jointly and haplotype phylogenies and diplotypes. We compared our findings to the published cis sequence effects literature and to the existing gene expression regulation literature available for those candidate genes that exhibited cis sequence effects.

Results
Gene expression data quality Thirty lymphoblastoid cell lines drawn from the SNP500Cancer resource were cultured in triplicate and total RNA extracted [see Additional file 1]. Gene expression profiling was performed on the N = 90 samples using a custom Illumina Sentrix ® Array Matrix-96 microarrays containing 50 mer probes targeting 697 genes relevant to cancer research [see Additional file 2]. Gene expression data from one array were excluded from further analysis due to a within individual cell line linear r 2 correlation for all genes of <95%, while all remaining within individual cell line correlations were ~99%. This correlation statistic reflects variation at all levels of the experiment: cell culture, RNA extraction, RNA labeling, and array performance. Based on these high quality data, further analysis used normalized mean gene expression signal data from the three replicate arrays.

Selection of genes for analysis of potential cis effects
We used two threshold criteria to select genes from the N = 697 genes for further analysis: a signal amplitude of ≥100 normalized units (38% of the genes); and a between cell line coefficient of variation (CV IC ) of ≥20% (32% of the genes). These criteria identified N = 95 genes with sufficient, and sufficiently variable, gene expression for further analysis. We used the criterion of ≥2 SNPs per candidate gene in the N = 30 DNAs to identify N = 522 candidate genes with sufficient sequence variation for analysis. We then compared the N = 95 genes with sufficient gene expression variance and N = 522 genes with sequence variation derived from the SNP500Cancer resequencing program, a component of the Cancer Genome Anatomy Project of the National Cancer Institute [28,29] and identified a subset of N = 32 genes (4.6% of the total sample) for the analysis of potential cis sequence effects (Table 1).

Gene expression of selected genes
The mean (standard deviation) rank-invariant normalized gene expression signal of the subset of N = 32 genes was 580 (175) units (Table 1), the mean (standard deviation) intra cell line replicate culture correlation coefficient was 0.78 (0.11), and the mean (standard deviation) between cell line coefficient of variation (CV IC ) was 0.33 (0.15). Mean noise (3 standard deviations of 20 negative control probes built into the array) was 20 units +/-4 units. PCNA had the largest mean gene expression, TP73L had the smallest gene expression and the largest CV, and PTEN had the smallest CV.

Selection of SNPs at genes for association analysis
The number (mean, standard deviation) of polymorphic SNPs per gene ranged from 2 to 50 (9.9, 10.6) and the number of tag (minor allele frequency minimum of >5% and with an r 2 threshold ≥ 0.80) and singleton SNPs per gene ranged from 0 to 15 (4.3, 3.7) in the group of N = 32 genes selected for association analysis (Table 2). Two genes (EGR1 and GADD45A) were excluded from further analysis as there were no SNPs at these genes with minor allele frequencies >5%. Two SNPs with genotype completion (attempted/completed) rates of 63% and 77% were excluded. After these exclusions, there were N = 126 tag and singleton SNPs at 30 genes available for analysis with a mean (standard deviation) genotype completion rate of 98.3% (1.1%) [see Additional file 3]. The distribution of SNP genotypes in the N = 30 cell lines was evaluated for Hardy Weinberg Equilibrium using asymptotic and exact tests; there were no SNPs exhibiting exact test P values < 0.01, and two exhibiting exact test P values < 0.05, both were at TP73L. The distribution of the flanking, untranslated region, coding and intronic SNPs was 27%, 15%, 12% and 46%, respectively. Four genes (IRF1, MSH2, MYC, OAS1) had only one informative tag or singleton SNP and were excluded from haplotype-based analyses, leaving N = 26 genes with a range of 2 to 14 tag and singleton SNPs for haplotype based analyses. The gene with the largest number of tag and singleton SNPs (CYP1B1, N = 14 tag and singleton SNPs) generated too many terms for the implementation of the SAM model to compute. The remaining N = 25 genes, comprised of N = 108 tag and singleton SNPs (range 2-10 SNPs/gene), have results from all SNP and haplotype based methods. The number of SNPs before and after selection of tag and singleton SNPs were uncorrelated (Spearman's rho) with mean normalized signal, the standard deviation, intra cell line correlation coefficient, or inter cell line coefficient (data not shown).
Significant cis sequence effects identified via three association methods N = 10 individual SNPs at six genes were significantly associated with gene expression signal, three genes exhibited significant association to gene expression in the single additive model, and three genes exhibited one or more statistically significant haplotype partitions (Table 3,  . The distribution of flanking, untranslated region, coding and intronic SNPs significantly associated to gene expression (30%, 10%, 10% and 50%, respectively) was not significantly different from the distribution of SNPs not exhibiting association to gene expression (Fisher exact test, data not shown). Three (BIC, FCGR2B and NBS1) of the six genes identified using individual SNPs were identified by one or both of the two gene-based association methods. Three genes (BIC, MYBL2, and PCNA) exhibit significant association in one or two methods and only a trend towards statistical significance in another method.

SNPs in hybridization probes
We evaluated whether the 50 base pair oligonucleotide probe sequences for the eight genes that exhibited significant cis sequence effects exhibit sequence variation in the N = 30 individuals in this study [see Additional file 1]. There were two probes on the array for each gene, and all but two of the sequences (MYBL2 probe IDs = 2099, 3155) have been sequenced in the SNP500Cancer DNAs. Two of the sixteen probe sequences exhibit variation in dbSNP (probe ID = 2168 at CYP1B1 and probe ID = 3562 at PCNA), however neither exhibits variation in the N = 30 DNA samples investigated here.

Literature data on cis sequence effects at candidate genes in this study
We reviewed the multi-gene cis sequence effects literature that has investigated either allelic imbalance within individuals (AI), generally defined as expression ratios ≥1.5 or ≤0.67 at transcribed heterozygous SNPs, or SNP-wise linkage or association in related or unrelated samples of individuals, to gene expression [12][13][14][15][16][17][18][19][20][21][22][23][24]. We compared results at the N = 14 genes studied in common by this study and at least one additional multi-gene cis sequence effects study [18,[20][21][22]24] (Supplementary File 4). To avoid issues of publication bias, we did not compare our results to the single gene cis sequence effects literature.

The proportion of genes exhibiting cis sequence effects
We identified nineteen publications that have investigated cis sequence effects in a multi-gene approach, one of which summarizes results from six other publications [12][13][14][15][16][17][18][19][20][21][22][23][24]. These thirteen publications report results at an average of 457 genes (median of 247, range of 13 to 2,726 genes, standard deviation of 724). The weighted average proportion of genes tested in these thirteen multi-gene studies considered by their authors to exhibit allelic imbalance or statistically significant cis sequence effects is 26.2% (unweighted average is 25.7%). We observe statistically significant cis sequence effects at eight of thirty genes (26.7%) in our study, which is similar to that observed in the literature.

The proportion of gene expression attributable to cis sequence effects
The literature presents a variety of linkage and association techniques to estimate the proportion of gene expression variation accounted for by individual SNPs [15,17,18,22], expressed either as an allelic ratio [15,18] or as a proportion of variance explained [17,22]. Mean r 2 estimates from N = 14 [17] and N = 62 [22] genes exhibiting significant cis sequence effects are 35% and 27%, respectively. The average proportion (standard deviation) of gene expression variance explained by individually significant SNPs in the SNP-wise regression analysis [see Additional file 3] in this study was 21% (7%). The average proportion of gene expression variance (standard deviation) explained by the most significant haplotype partitions at FCGR2B, LMO2 and PCNA was 26% (7%). Thus, current technical approaches suggest that approximately one-quarter to *There was one SNP with maf > 5% at these two genes; tag SNP analysis was not performed. The single SNP with maf > 5% at each gene was used for SNP-based analyses.
one-third of gene expression variation is attributable to cis sequence effects.  [32,33] identified three genes; one gene, FCGR2B, was identified in common. The latter two methods incorporate correction for multiple tests in their estimates of the significance of association. Rather than apply any formal correction(s) for the multiple comparisons at a gene after regression analysis on SNPs, or after using multiple methods, we compare and contrast the results obtained from using the three methods. Differences between results obtained analyzing individual SNPs and the two methods that apply multiple test correction at the level of the gene suggests that much of the evidence for significant cis sequence effects in this sample of N = 30 LCLs is too weak to survive multiple test correction, emphasizing the necessity to apply multiple test corrections to avoid elevated Type I error [22]. We also observed two examples of a gene exhibiting significant cis

Biological relevance of cis sequence effects -FCGR2B as an example
The focus of this study is to evaluate association between variation in DNA sequence and in vitro RNA transcription in a group of candidate genes commonly studied in cancer research. We briefly review some of the recent functional genomics literature for the candidate gene FCGR2B [see Additional file 5] and suggest below how a review of relevant genomic data and our cis sequence effects findings at FCGR2B might inform our understanding of this literature, as an example of how our findings might influence future FCGR2B SNP association or functional analyses. The SNP500Cancer program resequenced portions of IVS1, Exon 2, IVS2, IVS6 and Exon 8 of FCGR2B, yielding N = 9 FCGR2B SNPs available with a minor allele frequency of ≥5% for analysis of gene expression variation. After selecting one tag and five singleton SNPs to reduce the number of statistical tests performed with minimal loss of information [34], significant cis sequence effects were observed at FCGR2B SNP rs17412751 (IVS1-91C>T) ( Table 3 and Additional file 4). The minor allele frequency of rs17412751 in our sample of N = 30 was 10%, similar to the minor allele frequencies of FCGR2B promoter and transmembrane SNPs previously studied, however, it should be noted that rs17412751 is a singleton SNP, i.e., not strongly associated with the other SNPs available at FCGR2B. Linkage disequilibrium (LD) within the FCGR2B genomic region in Caucasian samples extends from the 3' end of IVS 1 distally in a punctate fashion, and there is some evidence for a separate 5' region of LD proximal to Exon 1 (data not shown). rs17412751 has been genotyped by Hinds et al., 2005 as afd1120510 [35] with a minor allele frequency of 11%, and within their sample of N = 24 European American DNAs, this SNP exhibits strong LD with one additional SNP (rs17404379, afd1120529). However, both of these SNPs map to both intronic regions within the FCGR2B locus and also within the FCGR3B/FCGR3A intergenic region some 70 kbp proximal, suggesting that high sequence homology may be interfering with accurate map assignment. There are two recent reports of copy number variation (CNV) in the region that are relevant: CNV of the FCGR3B locus is associated with nephritis in a rat model and in human patients [36], and analysis of SNP genotypes and genomic hybridization with the HapMap sample has identified a 256 Kbp region as human copy number locus CNV_ID_62 containing the FCGR2A, HSPA6, FCGR3A, FCGR2B and FCRLM1 loci [37].
Haplotype phylogenies and significant haplotype partitions at FCGR2B, PCNA and LMO2 Figure 1 Haplotype phylogenies and significant haplotype partitions at FCGR2B, PCNA and LMO2. Haplotype phylogenies are represented together with the SNP allele configuration and the count of haplotypes in the sample. Statistically significant haplotype partitions in the phylogeny are indicated by a vertical or horizontal bar, while an arrow indicates a SNP that exhibits significant association via regression analysis. The haplotypes at FCGR2B, LMO2 and PCNA were constructed using the following SNPs: rs12145988, rs17412751, rs922087, rs2298020, rs1674761, rs844; rs17352 and rs25406; and rs3740616, rs3740617, rs2273797, rs2038602, rs9282776, rs3781577, rs3758640, rs3758641. Our data at FCGR2B is concordant with data generated from both in vitro and ex vivo experimental strategies that sequence variation in the promoter [38,39] and the coding region [24] is associated with gene expression differences. Further, our data contributes to the evidence that a minor allele frequency of ~10% characterizes the SNPs that are associated with FCGR2B gene expression differences. The inconsistent directionality of effect of the minor allele may be due to high sequence homology at the CNV_ID_62 locus affecting the physical and linkage disequilibrium mapping of the region, or may be due to incomplete linkage disequilibrium between the analyzed SNP and a unanalyzed SNP that may be causing the effect. Individuals wishing to investigate regulation of gene expression at FCGR2B in the future should include approaches necessary to characterize the physical, linkage disequilibrium, transcriptional and copy number architecture of the region.

Strengths and limitations
Strengths of this study include: high quality gene expression data from triplicate cell cultures for each lymphoblastoid cell line (LCL), with standardization of culture, RNA extraction, labeling, amplification and hybridization conditions; the use of sequence-verified SNPs and the resequencing of nearly all expression array probe sequence regions; the use of multiple methods to evaluate evidence for significant cis sequence effects; and comparison of the results observed in this study to the published cis sequence effects literature at both the gene and SNP levels.
Limitations of this study include: the use of Epstein-Barr virus-transformed LCLs; the modest number of LCLs used; the association paradigm; and the absence of genetic assays that evaluate copy number in our sample. Limitations of LCLs as reagents for the investigation of gene expression regulation include gene expression in a virally transformed surrogate ex vivo tissue, which may influence and potentially eliminate Epstein-Barr virus infection associated genes from gene expression investigation [40,41]. Systematic comparison of gene expression from LCLs and non-surrogate, minimally processed tissues, e.g., peripheral blood lymphocytes, could be an approach towards validation of gene expression findings made in LCLs [40]. The modest number of cell lines used in this study limits statistical power, and is less than the number of cell lines used by some researchers [20][21][22], however, the number of genes evaluated was also modest. The association paradigm suffers from well known limitations [42]. Some publications testing large number of genes for cis sequence effects do not provide complete lists of genes tested or of genes exhibiting significant cis sequence effects on gene expression, therefore, we could not identify all genes studied in these reports. Also, due to the variety of approaches used in the literature, most comparisons are between categorical results of specific assays, i.e., it is generally not possible to compare quantitative data from different studies. The MYBL2 probe sequence regions were not resequenced in the N = 30 Caucasian DNA samples and thus the positive regression and TreeScan results at MYBL2 could potentially be a false positive result due to an unidentified SNP within the sequence complementary to these probes.

Conclusion
We tested for significant association between gene sequence variation and gene expression variation at N = 30 candidate genes in DNA and RNA from N = 30 LCLs. Significant association between cis sequence and gene expression variation was observed in 8 out of 30 genes, and accounted for 26% of gene expression variation in three genes evaluated using an analysis of variance approach. We utilized additive and analysis of variance (guided by haplotype phylogeny) analytical approaches, and suggest that approaches that permit modeling of allelic effects may identify effects missed by additive models, although larger multi-gene studies would clarify the relative utility of the two approaches. We reviewed the multi-gene cis sequence effects literature and found data on N = 14 of the candidate genes that we evaluated; most of that data is concordant with our results. Investigators using current technologies should expect to find cis sequence effects at about one quarter of candidate genes evaluated: these effects will explain about one quarter of gene expression variance. SNPs associated with gene expression can be preferentially selected for genotyping and analysis in genetic association studies, or nominated for functional genomic investigations to characterize their role in the regulation of gene expression.

Preparation of total RNA
For this study, we cultured N = 30 Coriell Cell Repository LCLs [see Additional file 1], in triplicate under standardized conditions; when cells per milliliter exceeded 2 × 10 7 , the cells were harvested, the pellets were washed once with PBS and frozen at -80°C. Same-lot cell culture reagents were used, with three technicians each dedicated to culturing N = 10 of these cell lines, with replicate cell cultures cultured in series. Total RNA was prepared from frozen cell pellets using the Qiagen RNeasy Midi-kit (Valencia, CA). Ten cell culture pellets were extracted at a time by a single technician with a single assistant; cell pellets were removed one-at-a-time from the -80°C freezer, quick thawed by rubbing between gloved hands, and Qiagen denaturant immediately added. Ethanol was added to each sample, vortexed, and the samples applied to Qiagen Midi columns, washed as specified, treated with RNasefree DNase "on-column", followed by additional washes before elution of the RNA with the provided buffer. After elution, sample volume was determined by weight, sodium acetate was added to 0.3 M, the sample was split and ethanol added at 3× volume to each aliquot and stored at -20°C.  [43], using the MessageAmp aRNA kit (Ambion, Inc., Austin, TX). Specific conditions of labeling, as well as array hybridization, washing and staining, and data extraction and processing were performed as described in Kuhn et al., 2004 [26], as were array processing and data extraction and processing. Array hybridization intensity signals were adjusted using a global background subtraction and rankinvariant normalization algorithm. All gene expression data generated for this experiment has been deposited in the Gene Expression Omnibus with Series accession ID = GSE8394.

SNP, haplotypes and haplotype phylogeny
We utilized sequence verified SNP data from N = 30 individuals from the Caucasian subsample of the SNP500Cancer database [28,29], which currently contains sequence data from >750 genes that have been partially resequenced in a sample of N = 102 DNAs. We selected SNPs as either tag SNPs or singleton SNPs using a minor allele frequency minimum of >5% and with an r 2 threshold ≥ 0.80. We reconstructed haplotypes using PHASE [44] using the tag and singleton SNPs with the following parameters: number of iterations = 10,000; thinning interval = 1; burn-in = 10,000. We performed haplotype phylogeny reconstruction using neighbor-joining with a uniform model of genetic distance in MEGA version 3.1 [45]. We searched for SNPs in genomic sequence complementary to probe sequence using Genewindow [46], using data from NCBI genome build 35.1, dbSNP build 125, and the SNP500Cancer resequencing program.

Association analysis with gene expression
We managed gene expression and SNP and haplotype data and performed descriptive analysis in STATA 9.0 (StataCorp LP, College Station, TX). We evaluated normality of gene expression using the Shapiro-Wilk test, logtransformed gene expression and used log-transformed values in each analysis method. The coefficient of variation between cell lines (CV IC ) was calculated as follows CV IC = (SD IC /μ)*100, where SD IC is the standard deviation between cell lines estimated from a one-way analysis-ofvariance model, and μ is the mean expression of the gene. The first analysis used linear regression, modeling gene expression as a function of each SNP separately, using an additive model to test for a trend across genotypes. The second analysis included all SNPs in a gene simultaneously and compared that model to a model without any predictors by means of a likelihood ratio test [30,31]. Third, we partitioned a haplotype phylogeny of each candidate gene to construct partition diplotypes and performed one-way ANOVA analyses of the quantitative gene expression trait associated with these partition genotypes to search for partitions that explain a statistically significant proportion of gene expression variation [32], using the software TreeScan [33]. We report the proportion of the gene expression variance explained by the partition diplotypes and the P value from the F statistic after correction by permutation and enforcement of monotonicity. While TreeScan performs a second round of testing for significant partitions conditional on partitions identified in the first round of analysis, no additional significant partitions were identified upon conditional analysis in this dataset. In this study, all P values are two-sided and must be <0.05 to be considered significant.

Authors' contributions
AWB designed the experiments, nominated genes for potential inclusion in the array, participated in data analysis of the gene expression data, and drafted and revised the manuscript. AB performed data analysis of gene expression data and helped revise the manuscript. TM selected the custom Illumina Sentrix ® Array Matrix-96 array gene content and participated in experimental design and data analysis. KK performed the gene expression experiments and participated in experimental design and data analysis. JK designed the array probes. RP provided guidance on all statistical procedures and revised the manuscript. PB participated in experimental design and directed cell culture, RNA extraction and quality control analysis of total RNA. KJ provided analytical support on tag and singleton SNP selection and HWE testing. BP provided downloads of the SNP500Cancer database for probe design of the gene expression array and for selection of genes with sequence variation in the SNP500Cancer Caucasian sample. SC participated in experimental design, nominated genes for potential inclusion in the array, and helped revised the manuscript. MY participated in experimental design, nominated genes for potential inclusion in the array, performed haplotype and phylogenetic reconstruction of SNP500Cancer candidate gene SNP data and helped interpret the results of tree scanning association analysis.