G-spots cause incorrect expression measurement in Affymetrix microarrays
© Upton et al; licensee BioMed Central Ltd. 2008
Received: 19 June 2008
Accepted: 18 December 2008
Published: 18 December 2008
High Density Oligonucleotide arrays (HDONAs), such as the Affymetrix HG-U133A GeneChip, use sets of probes chosen to match specified genes, with the expectation that if a particular gene is highly expressed then all the probes in that gene's probe set will provide a consistent message signifying the gene's presence. However, probes that contain a G-spot (a sequence of four or more guanines) behave abnormally and it has been suggested that these probes are responding to some biochemical effect such as the formation of G-quadruplexes.
We have tested this expectation by examining the correlation coefficients between pairs of probes using the data on thousands of arrays that are available in the NCBI Gene Expression Omnibus (GEO) repository. We confirm the finding that G-spot probes are poorly correlated with others in their probesets and reveal that, by contrast, they are highly correlated with one another. We demonstrate that the correlation is most marked when the G-spot is at the 5' end of the probe.
Since these G-spot probes generally show little correlation with the other members of their probesets they are not fit for purpose and their values should be excluded when calculating gene expression values. This has serious implications, since more than 40% of the probesets in the HG-U133A GeneChip contain at least one such probe. Future array designs should avoid these untrustworthy probes.
Microarrays are commonly used to measure gene expression. One of the most popular microarray platforms is the Affymetrix GeneChip. In GeneChip arrays probe sequences with a nominal length of 25 bases are created by photolithography. The probes are arranged in pairs: a so-called Perfect Match (PM) probe and a mismatch (MM) probe that is identical to the PM probe with the exception that the 13th base is the complement of that in the PM probe. Each pair of probes belongs to a probe set (typically of 11 or 16 probe pairs) with each probe set being intended to provide information concerning the expression of a single gene. For some genes there may be more than one dedicated probe set.
There are a number of alternative software tools for calculating a single measure of gene expression for a probe set: e.g. MAS5, dChip, RMA and GCRMA. To calculate the value of the expression measure, all the probes (or at least all the PM probes) in a probe set are used. However, if there are probes that are known to be liable to provide misleading information, then these should be excluded from the analysis so as to give more accurate estimates of gene expression. The existence of large datasets such as that contained in the NCBI Gene Expression Omnibus (GEO) repository  provides an opportunity to identify such probes. We report an analysis that unambiguously identifies a large class of more than 10 000 probes whose behaviour is not that intended. More than 40% of probesets contain one or more members of this family.
We have focused on the GeneChip oligonucleotide microarrays manufactured by Affymetrix. Since a major application of microarrays has been for the study of human diseases, we have concentrated our effort on data from the most popular human GeneChips, the HG-U133A arrays, though the results apply to all GeneChip arrays.
Within a probe set, subsets of probes may be measuring different exons and thus, potentially, different transcripts, implying that biological signals such as alternative splicing would need to be taken into account. We have therefore focused on groups of probes which map uniquely to the same exon. We have downloaded from GEO more than 6000 data sets relating to about 300 Gene expression Series (GSEs) and have then calculated the standard Pearson's (product-moment) correlation coefficients between relevant probe pairs.
Figure 1 shows the heatmap for the 16 perfect match probes in the probeset 31846_at all of which relate to the same exon. Evidently probe 6 is not behaving in the same way as the other probes since its values have near-zero correlation coefficients when matched with the values of these other probes. Looking at other heatmaps we found that such 'misbehaving' probes were not unusual. By listing the base sequences of these unusual probes we observed that a frequent feature was a sequence of four or more guanines. That such probes are typically poorly correlated with other members of their probe set had already been noted  who suggested that this might be due to the formation of G-quadruplexes . We will show that, although these probes are ill correlated with others in their probeset, they are well correlated with each other. This suggests that their behaviour, varying from array to array, must be a consequence of the method of preparation of the array.
We suggest that probes with sequences of four or more guanines should be ignored when analyzing the results of current GeneChip designs. We also suggest that future GeneChip designs should avoid including probes containing such sequences.
Results and Discussion
Correlation coefficients between pairs of probes
Correlation coefficients between the values of probes containing sequences of guanines
Upon listing the probes most highly correlated with probe pm6 from the 31846_at probe set (base sequence TCCTGGACTGAGAAAGGGGGTTCCT) it becomes apparent that there is a common theme: each probe contains a sequence of four or more consecutive Gs. For example, the pm1 probe in 219297_at (used in Fig. 3) begins with six Gs (GGGGGGATAGTCTTGTTTCTAGCTT). By contrast, in 31846_at, probes pm5 (GAACTCCACTGCAACAGACGGGCGC) and pm16 (TTCCCACCTGTCATACTGGTAACTG) contain sequences of only 3Gs and 2Gs, respectively.
Given that the high values of the correlation coefficients are a consequence of sequences of guanines in the probe sequence, two questions that immediately arise are 'Is the location within the probe of the consecutive run of guanines relevant?' and 'Is the number of consecutive guanines relevant?'. To get clear answers to these questions we focus on probes that have only one sequence of two or more guanines. We will refer to the location of the sequence within the probe as the G-spot and we now examine how the values of the inter-probe correlation coefficients are affected by the location and length of the G-spot.
The effect of the location of the G-spot
In 4G-probes, the effect of the location of the G-spot on the average value of the correlation coefficient.
Location of G-spot, l
Number of probes
Location of G-spot, l
Number of probes
Table 1 shows distinct design preferences on the part of Affymetrix since probes starting GGGG are relatively common (an unfortunate choice under the circumstances) whereas cases where the GGGG sequence straddles the central probe (i.e. probes with the GGGG sequence commencing at one of locations 10 to 13) are relatively infrequent. For all values of l the average value of the correlation coefficient for pairs of probes with G-spot at l is significantly greater than zero indicating the pervasive nature of these unwanted correlations. The overall maximum is at l = 1 corresponding to the G-spot being at the 5' end (the free end) of the probe.
The effect of the length of the G-spot
Average values of the correlation coefficient between pairs of probes that each has its single sequence of k Gs starting with the first base
Length of starting sequence, k
Number of probes
Correlation coefficients for probes having different locations for their G-spots
Average value of the correlation coefficient between pairs of 4G-probes where one probe has its G-spot at location 1 and the other has its G-spot at location l
G-spot in 2nd probe, l
G-spot in 2nd probe, l
Other types of array
It seemed unlikely that the effect was related in any way to the organism under investigation. To confirm this we analysed data from a set of ATH-121501 GeneChips (for Arabidopsis thaliania): as anticipated the average value of the correlation coefficient between probes with four Gs at their free ends was very high (0.86).
The previous section has demonstrated that probes containing a G-spot of four or more bases are very likely to be highly correlated with many other probes not in their own probe set. The phenomenon is evidently not related to genetics, so that it is clear that the pragmatic solution is simply to eliminate G-spot probes from future array designs. However, we cannot resist making some suggestions concerning the possible causes of the G-spot effect. In particular, we believe the G-spot effect results from probe-probe interactions occurring on GeneChips.
The potential for the formation of G-quadruplexes
The high density of synthesis sites on the surface of Affymetrix GeneChips leads to crowded conditions on the array surface. Assuming a stepwise synthesis yield for probes of 95% per base and that the density of initiation sites for probe synthesis is 5 × 1017 molecules/m2, the average distance between full-length 25 mer probes is about 3 nm. As the lengths of the probes may be up to 22 nm, it is thus likely that probes can come into contact.
The high density of probes results in considerable differences between the rates and efficiencies of hybridisation for probes in solution and for probes tethered to a surface. These differences may be due to electrostatic repulsion of the high charge density on arrays resulting from the phosphate backbones of the probes. The electrostatic effects act to reduce the stability of a probe-target duplex and it has been suggested that probe-probe associations involving only a few residues will be able to compete with the formation of probe-target duplexes. There have been initial attempts to model probe-probe duplexes. However, a full model is not computationally tractable  and there are presently no theoretical results which describe under what conditions probe-probe interactions occur. We believe the co-ordinated behaviour of G-spot probes results not from a probe-probe dimer but from a higher-order binding of four DNA strands.
The Hoogsteen hydrogen-bonded guanine (G)-tetrad is a four-stranded DNA spiral stack held together by eight hydrogen bonds per level. Even G-quadruplexes formed by quite short runs of Gs along the 4 DNA strands can be thermally stable up to 90°C . G-quadruplexes are stabilised by positive sodium or potassium cations centrally placed between adjacent (G)-tetrads. The cations are thus close to four electronegative oxygens in the (G)-tetrad above and four more in the (G)-tetrad below and act to reduce the repulsion of the oxygen atoms via the formation of cation-dipole interactions. We suggest that probes in close proximity which contain a run of four or more contiguous guanines, may sometimes interact to form a G-quadruplex.
It has been argued that probes do not form G-quadruplexes on GeneChips because the probes are immobilised and so it must be the targets that form quadruplexes which cause G-spot probes to show abnormal binding. However, since the probes are sufficiently close to each other, and attached via linkers, they have enough flexibility to interact closely. Moreover, because the probes run in parallel and contain identical sequences, we believe that this provides an ideal opportunity for G-quadruplexes to form where there are runs of contiguous guanines. The coherence between all G-spot probes leads us to suggest that the problem lies with the probes and the GeneChip technology rather than the incoherently randomly segmented targets themselves.
Brightness and chip-to-chip variability of the G-spot probes
The formation of a G-quadruplex will result in four probes having their guanines facing inwards towards the quadruplex. Thus these bases will not be available to hybridise with targets. Yet probes starting with GGGG are on average about twice as bright as other strongly correlated probes whilst containing only an average number of Cs and Gs.
We suggest the fact that G-spot probes tend to be bright may be due to the nature of the hybridisation on the surface of GeneChips resulting from the high packing density of probes. Models of the hybridisation dynamics of surface-immobilised DNA show that as probes interact more strongly so the nucleation sites available are modified with resulting changes in the hybridisation affinity related to the packing density of probes. When further apart the affinity between probe and target increases rapidly. The effective association rate is proportional to (probe density)-1.8. We suggest that, on the surface of a chip, in a G-spot region, there will be a number of probes that form G-quadruplexes. The G-quadruplex acts to bind four probes together and these probes do not hybridise to the target. This means that the remaining probes have more space and will have increased target affinity due to a lower probe density (c.f. Figures 4 and 5 of ) Indeed the run of Gs on the remaining probes is available to act as an efficient nucleation site for hybridisation. This could encourage non-specific binding of labelled targets.
Implications for the use of existing GeneChips
Our findings have several implications. The extent to which a particular 25-base sequence will form probe-probe interactions may depend upon a range of factors which vary from experiment to experiment. Thus probe-probe interactions need to be taken into account when modelling the affinity of the probe. We have detected the G-spot effect from studying the values of the correlation coefficient for pairs of probes. We have thereby identified thousands of genetically unrelated probes whose values change coherently from sample to sample. We suggest that there is one or more aspect to the preparation of each GeneChip and/or sample which may affect the extent of the formation of G-quadruplexes across the whole GeneChip. There are many things which effect the stability of quadruplexes. These include monovalent cations. Potassium has a larger affinity for a quadruplex than sodium. (However sodium is likely to be the dominant cation during hybridisation). Conversely lithium acts to destabilise G-quadruplexes. Molecular crowding also helps to induce quadruplex formation. (However we suggest this should be constant from chip to chip). Ethanol has recently been shown to be a better inducer of quadruplexes than even potassium cations (ethanol is used in the preparation of nucleic acids). Even the life-history of the chip, such as whether it has been stored at low/high temperatures, or preheating the Chip prior to hybridisation, may all alter the population of quadruplexes on the surface of the chip.
We have shown that probes containing G-spots are typically highly correlated with each other and uncorrelated with the other members of their own probesets. When one of these probes has a high intensity it is therefore likely that other G-spot probes have high intensities. Thus a high intensity for a G-spot probe cannot be regarded as evidence that its target gene is highly expressed. Of course the G-spot probe will stand out as an outlier if the other probes in its probeset give a contradictory impression and this will cause no problem. However, if a G-spot probe has a high value (because on this array G-spot probes happen to have high values) and the other probes in the probe set have high values (because the gene is well expressed) then the G-spot probe will not be excluded in calculations of overall gene expression, even though its value is not affected by the gene, but by the conditions under which the array was treated. It is important not to rely on outlier detection procedures to throw out misleading G-spot values. The truly misleading values are those that appear to tentatively support others in their probe set. Designers of future high-density oligonucleotide arrays need to avoid runs of contiguous guanines and any other such sequences that act to stabilise probe-probe interactions between pairs of otherwise unrelated probes. For existing designs, G-spot probes should be eliminated from consideration before the analysis commences.
During 2007 we downloaded CEL files from the NCBI Gene Expression Omnibus (GEO) repository. By the end of that year we had tens of thousands of CEL files, including 6685 examples of the most popular GeneChip produced by Affymetrix for the human genome: the HG-U133A array. These CEL files, from 162 separate GEO series (GSE) of experiments, were created between January 2002 and February 2006 and subsequently uploaded to GEO by many independent experimenters.
The next step was to create "heatmaps" (such as Fig. 1) illustrating the values of the correlation coefficients (in the log space) between all pairs of probes within each probe set. These were created using information from all the 6685 CEL files. Each CEL file was separately log normalised and potential spatial flaws identified. To avoid problems with results being dominated by a few outliers we excluded data for each probe if they were more than three standard deviations from the probe's mean. Secondly we excluded not only data flagged as potentially part of a spatial flaw but also data within 60 μ m of a spatial flaw. Even after this ultra-cautious treatment, we had many thousands of data for each of approximately half a million probes. The resulting 22 299 visualisations are at http://bioinformatics.essex.ac.uk/users/wlangdon/HG-U133A. Inspection of the heatmaps provided an efficient method for identifying probes that, despite having reasonable average magnitudes, had low values of the correlation coefficient when paired with other probes included in their subset.
When calculating the correlation coefficient between probes containing runs of Gs, firstly only PM and MM probes with a single sequence of 2 or more Gs were selected. These were then divided into subgroups according to the length of the run of Gs and the location of the first G in the sequence. To avoid inflating the average correlation coefficient by including probes that would have been expected to be correlated in the absence of the G-spot effect, within each subgroup only the first probe in any probe set was used. Similarly where probes have identical sequences, only one was include in the averages. In Tables 1, 2 and 3 all possible correlation coefficients between pairs of probes were calculated and averaged.
WBL was funded through research grant BBE0017421 from the Biotechnology and Biological Sciences Research Council. We acknowledge the vital contribution of Dr Renata da Silva Camargo, who created the Essex archive and identified the single-exon subsets. We thank the reviewers of a previous version for their useful suggestions.
- Affymetrix: Microarray suite users guide. 2001, 5Google Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences. 2001, 98: 31-36. 10.1073/pnas.011404098.View ArticleGoogle Scholar
- Irizarry R, Bolstad B, Collin F, Cope L, Hobbs B, Speed T: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research. 2003, 31: e15-10.1093/nar/gng015.PubMedPubMed CentralView ArticleGoogle Scholar
- Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F: A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association. 2004, 99 (468): 909-917. 10.1198/016214504000000683.View ArticleGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Research. 2007, D760-D765. 10.1093/nar/gkl887. 35 DatabaseGoogle Scholar
- Stalteri M, Harrison A: Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips. BMC Bioinformatics. 2007, 8: 13-10.1186/1471-2105-8-13.PubMedPubMed CentralView ArticleGoogle Scholar
- Wu C, Zhao H, Baggerly K, Carta R, Zhang L: Short oligonucleotide probes containing G-stacks display abnormal binding affinity on Affymetrix microarrays. Bioinformatics. 2007, 23: 2566-2572. 10.1093/bioinformatics/btm271.PubMedView ArticleGoogle Scholar
- Burge S, Parkinson G, Hazel P, Todd A, Neidle S: Quadruplex DNA: sequence topology and structure. Nucleic Acids Research. 2006, 34: 5402-5415. 10.1093/nar/gkl655.PubMedPubMed CentralView ArticleGoogle Scholar
- Glazer M, Fidanza J, McGall G, Trulson M, Forman J, Suseno A, Frank C: Kinetics of oligonucleotide hybridization to photolithographically patterned DNA arrays. Analytical Biochemistry. 2006, 358: 225-238. 10.1016/j.ab.2006.07.042.PubMedView ArticleGoogle Scholar
- Burden C, Pittlekow Y, Wilson S: Adsorption models of hybridisation and post-hybridisation behaviour on oligonucleotide microarrays. Journal of Physics: Condensed Matter. 2006, 18: 5545-5565. 10.1088/0953-8984/18/23/024.Google Scholar
- Peterson A, Heaton R, Georgiadis R: The effect of surface probe density on DNA hybridization. Nucleic Acids Research. 2001, 29: 5163-5168. 10.1093/nar/29.24.5163.PubMedPubMed CentralView ArticleGoogle Scholar
- Vainrub A, Pettitt BM: Coulomb blockage of hybridization in two-dimensional DNA arrays. Physical Review E. 2002, 66: 041905-10.1103/PhysRevE.66.041905.View ArticleGoogle Scholar
- Forman J, Walton I, Stern D, Rava R, Trulson O: Thermodynamics of duplex formation and mismatch discrimination on photolithographically synthesised oligonucleotide arrays. Molecular Modeling of Nucleic Acids, ACS Symposium Series, Am Chem Soc. Edited by: Leontis N, SantaLucia J. 1998, 682: 206-228.Google Scholar
- Blume S, Guarcello V, Zacharias W, Miller D: Divalent transition metal cations counteract potassium-induced quadruplex assembly of oligo(dG) sequences. Nucleic Acids Research. 1997, 25: 617-625. 10.1093/nar/25.3.617.PubMedPubMed CentralView ArticleGoogle Scholar
- Hagan M, Chakraborty A: Hybridization dynamics of surface immobilised DNA. J Chem Phys. 2004, 120 (10): 4958-4968. 10.1063/1.1645786.PubMedView ArticleGoogle Scholar
- Kan Z, Yao Y, Wang P, Li X, Hao Y, Tan Z: Molecular crowding induces telomere G-quadruplex formation under salt-deficient conditions and enhances its competition with duplex formation. Angew Chem Int Ed. 2006, 45: 1629-10.1002/anie.200502960.View ArticleGoogle Scholar
- Vorlickova M, Bednarova K, Kypr J: Ethanol is a better inducer of DNA guanine tetraplexes than potassium cations. Biopolymers. 2006, 82: 253-260. 10.1002/bip.20488.PubMedView ArticleGoogle Scholar
- Langdon WB, Upton GJG, da Silva Camargo R, Harrison AP: A Survey of Spatial Defects in Homo Sapiens Affymetrix GeneChips. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2008, [To appear]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.