Experimental optimization of probe length to increase the sequence specificity of high-density oligonucleotide microarrays
© Suzuki et al. 2007
Received: 26 April 2007
Accepted: 16 October 2007
Published: 16 October 2007
Skip to main content
© Suzuki et al. 2007
Received: 26 April 2007
Accepted: 16 October 2007
Published: 16 October 2007
High-density oligonucleotide arrays are widely used for analysis of genome-wide expression and genetic variation. Affymetrix GeneChips - common high-density oligonucleotide arrays - contain perfect match (PM) and mismatch (MM) probes generated by changing a single nucleotide of the PMs, to estimate cross-hybridization. However, a fraction of MM probes exhibit larger signal intensities than PMs, when the difference in the amount of target specific hybridization between PM and MM probes is smaller than the variance in the amount of cross-hybridization. Thus, pairs of PM and MM probes with greater specificity for single nucleotide mismatches are desirable for accurate analysis.
To investigate the specificity for single nucleotide mismatches, we designed a custom array with probes of different length (14- to 25-mer) tethered to the surface of the array and all possible single nucleotide mismatches, and hybridized artificially synthesized 25-mer oligodeoxyribonucleotides as targets in bulk solution to avoid the effects of cross-hybridization. The results indicated the finite availability of target molecules as the probe length increases. Due to this effect, the sequence specificity of the longer probes decreases, and this was also confirmed even under the usual background conditions for transcriptome analysis.
Our study suggests that the optimal probe length for specificity is 19–21-mer. This conclusion will assist in improvement of microarray design for both transcriptome analysis and mutation screening.
High-density oligonucleotide microarrays allow analysis of the genome-wide expression of genes in living organisms  and for genome-wide screens of genetic variation and disease-causing mutations [2, 3]. The Affymetrix GeneChip system is one of the most commonly used high-density oligonucleotide microarray systems because each probe is synthesized in the precise location and millions of probes can be contained on an array. In the Affymetrix GeneChip system, the expression of each transcript is measured using a set of probe pairs, i.e., a perfect match (PM) probe that matches a fragment of the corresponding gene exactly and a mismatch (MM) probe containing a single nucleotide mismatch in the center. It is generally assumed that the MM probe provides a measure of cross-hybridization to corresponding PM probes, and thus subtracting the signal intensities of MM probes from those of PM probes allows canceling of the effect of cross-hybridization .
However, it has been pointed out that around 30% of probe pairs consistently give negative signals, which means that the difference between PM and MM probe intensity does not always reflect the true target amounts [5, 6]. This contradiction of PM and MM probe intensities is the main factor making expression analysis unreliable, especially when the target concentration is low. Such contradictions will occur when, for example, the difference in the amount of target specific hybridization between PM and MM probes is smaller than the variance in the amount of cross-hybridization. Therefore, to improve the measurement of target amounts using the pairs of PM and MM probes, one possible strategy is to enhance the specificity for single nucleotide mismatches, i.e., changes in signal intensity caused by a single nucleotide mismatch. In the present study, we focused on this discrimination capability of single nucleotide mismatches and performed evaluation using the signal intensity ratio of PM to MM probes. The enhancement of specificity for single nucleotide mismatches is not only required for the improvement of the original Affymetrix analysis method (MAS5.0), it is also useful for development of other analysis models, such as dChip  or Robust Microarray Analysis (RMA) [8, 9], which do not make use of MM probes, as it will reduce noise from targets of similar sequence to the desired target sequence. The specificity is also important for analysis of single nucleotide polymorphisms (SNPs) using microarray technology [10, 11].
Several previous studies on microarray technology investigated the specificity for single nucleotide mismatches experimentally. For example, with regard to probe length, a previous study was performed using an oligonucleotide microarray where 25-, 30-, and 35-mers were printed on glass slides . In addition, previous studies investigated the dependence of specificity on the type of mismatched nucleotide and position of the mismatch [13, 14]. However, as these experimental studies were performed using samples spiked into the transcriptome, i.e., mixtures of thousands of transcripts, a certain amount of cross-hybridization is inevitable. Thus, in such analyses, quantification of a small difference in signal intensity between PM and MM probes can be difficult due to the presence of cross-hybridization, and thus evaluation of specificity for single nucleotide mismatches is difficult at low target concentrations.
In the present study, to quantify the specificity for single nucleotide mismatches, we (i) designed a set of artificial random 25-mer sequences, (ii) synthesized oligodeoxyribonucleotides of these random sequences as targets, and (iii) designed a custom microarray with PM probes completely matching the oligodeoxyribonucleotides and MM probes considering all possible single substitutions, i.e., all possible one-base substitutions for all possible positions. The use of artificially synthesized oligodeoxyribonucleotides only allows us to quantify the absolute signal intensity without the effect of cross-hybridization and then to evaluate the specificity for single nucleotide mismatches even when the applied target concentration is low. Another advantage of the use of oligodeoxyribonucleotides as targets is that we can analyze the specificity of single nucleotide mismatches without effects of target variation, such as variations in target length due to random fragmentation [15, 16]. Furthermore, to evaluate the effects of probe length and position of mismatch on the hybridization behavior, we designed a custom microarray with PM and MM probes of several different lengths from 14- to 25-mer and all possible single mismatches.
Using this custom array, we investigated how the specificity for single nucleotide mismatches depends on the probe length and mismatch position. Our results indicated that, under standard hybridization conditions, the specificity for single nucleotide mismatches becomes maximal at 19~21-mer, which is shorter than the length used on popular high-density oligonucleotide microarrays. With regard to the mismatch position, we confirmed that the specificity with a single nucleotide mismatch decreases at both ends of the probe, as reported previously [13, 14, 17, 18]. In these analyses, as the conditions without the source of cross-hybridization are quite different from those of standard microarray analysis, we performed the experiments with the source of cross-hybridization by adding a mixture of cDNAs generated from Escherichia coli total RNA. The same results were obtained, which indicated the possibility of improving measurements of gene expression and genome sequence by microarray analysis by reducing the probe length.
To investigate the effects of probe length and position of mismatch for target-specific hybridization comprehensively on a high-density oligonucleotide array, we designed a custom array on which a number of probes were arranged in length, mismatch position, and types of mismatched nucleotide, using Maskless Array Synthesizer platform with the Affymetrix NimbleExpress program [19, 20].
For each basic 25-mer probe described above, we generated various probes by changing their length, mismatch position, and type of mismatched nucleotide as follows. First, we initially shortened the probe length by one nucleotide from 25- to 14-mer at the 5' end, to investigate the optimum probe length (Fig. 1 middle sequences). Second, we arranged 3 + 3n probes for a probe in each n-mer length. The first three probes were the same and complementary to the target sequences, which corresponded to perfect match (PM) probes. In the remaining 3n probes, we provided all possible substitutions - i.e., each mismatch position and each type of mismatched nucleotide (Fig. 1 right sequences) (note that the design of mismatch (MM) probes on this array was not equivalent to that of Affymetrix catalog GeneChip. We use this term for all these probes with nucleotide substitutions). Collectively, there were 738 probes of various sequences (36 PMs and 702 MMs) for each basic 25-mer probe sequence. Sequences of all probes are given in additional file 1.
To characterize the absolute signal intensities of PM and MM probes, artificially synthesized 25-mer oligodeoxyribonucleotides complementary to the PM probes were applied to the custom array. The target oligodeoxyribonucleotides were applied in tenfold dilutions such that the target oligodeoxyribonucleotides would yield final concentrations of 1.4 nM to 1.4 fM. The use of synthesized oligodeoxyribonucleotides only as hybridization target enabled us to analyze the precise behavior of hybridization without the effect of cross-hybridization. In fact, the signal intensities of PM probes to which no complementary target oligodeoxyribonucleotides were applied were much lower and negligible compared to those of PM probes to which the targets were added in all ranges of oligodeoxyribonucleotides concentration used (data not shown).
The MM probe was designed to quantify the signal intensities of cross-hybridization embedded within the PM signal. In the standard protocol in the Affymetrix GeneChip system, to measure the amounts of complementary DNA/RNA molecules hybridized to PM probes, the MM probe signal was subtracted from that of the PM probe to compensate for cross-hybridization. The background concept of the MM probe is that the amount of cross-hybridization to the PM and MM probe pair is nearly identical, although the specific hybridization between intact target and MM probe is expected to be less due to the mismatched base pairing. Therefore, in this procedure, a pair of PM and MM probes with high specificity for single nucleotide mismatch is desirable, as a small difference in intensity between PM and MM probes can be overcome easily by experimental error and the difference in amount of cross-hybridization between the PM and MM probe.
Effects of type of nucleotide mismatch on the PM/MM ratio
In this study, the results showed that the saturation of intensities occurs even in the low target concentration range of 14 fM and 140 fM, probably due to the finite availability of target molecules in the hybridization solution. It was described in the Expression Analysis Technical Manual (Affymetrix, 2004) that the number of probe molecule contained in each probe cell is on the order of 106, which is approximately comparable that of the target molecules at about 10 fM. In this study, the finite availability of target molecules was observed even at 1.4 pM. These observations suggest that the effective amounts of target molecules are decreased due to nonspecific hybridization to other probes, non-biospecific adsorption on the array surface, competitive hybridization between the probes that share the same target, etc. Thus, the effect of finite availability cannot be neglected when we measure the target of low concentration quantitatively. The results indicated that, in this microarray system, as long as the analysis is based on the intensity change by single nucleotide mismatch, the longer probes (>23-mer) are not suitable for accurate measurement of cDNA/cRNA amount due to their low specificity. On the other hand, when we use shorter probes for genome-wide analysis, it should be take into account that probes that are too short may match too many subsequences in the genome. To check the uniqueness of an oligodeoxyribonucleotide sequence within the whole genome sequence, we sampled random subsequences from the E. coli genome, and searched for their matching alignments throughout the genome. When we changed the length of sample subsequences from 14- to 25-mer, both the number of matching copies and the probability that a probe sequence is not unique increased markedly below 15-mer, while more than 97% of sequences were unique at longer than 18-mer (data not shown). Thus, the optimal probe length for the specificity for single nucleotide mismatches is 19–21-mer, which was also supported by the probability of reversal of PM and MM probe intensity p(IPM < IMM) as shown in Figure 8. The observation that optimization of probe length had a marked impact on the specificity for single nucleotide mismatches is important to improve probe design for accurate analysis of gene expression and genetic variation on microarrays. Of course, the experimental conditions in the present study were different in several respects from those of typical microarray experiments. For example, in standard genome-wide analysis of gene expression, cRNA targets into which biotinylated ribonucleotides are incorporated are randomly fragmented to 50–200 bases in length. That is, the standard target samples have dangling ends varying in both length and sequence not hybridized to the probes. Recently, it has been reported that the dangling ends reduce the specificity of probe-target hybridization . This report suggests that the fragmentation pattern has an influence on the accuracy of typical GeneChip analysis. Therefore, further studies are necessary to optimize probe length for accurate measurement of DNA/RNA amounts under the standard conditions used in genome-wide analyses of gene expression and genetic variation.
Cross-hybridization is problematic for GeneChip analyses because it adds background intensity, which is not related to the true amounts of target DNA/RNA. Although MM probes are provided on GeneChips to evaluate the amount of cross-hybridization, GeneChip analyses have shown that a number of MM probes possess greater fluorescence intensity than their cognate PM probes [5, 6]. A previous study indicated that the reversal of PM and MM probe intensities was due to cross-hybridization . As shown in Figure 8, the probability p(IPM < IMM) increases significantly as the target oligodeoxyribonucleotide concentration decreases. Especially at the lowest target concentration of 1.4 fM, the probability reached 0.5. These results also suggested that the increase in probability was caused by cross-hybridization, because the relative amount of cross-hybridization increases with decreasing target concentration. These findings clearly indicated that the use of MM probes for assessment of cross-hybridization is unreliable. Therefore, data analyses have been carried out without using the signal intensities of MM probes in Robust Microarray Analysis (RMA), which is one of the most commonly used algorithms for GeneChip systems [8, 9]. On the other hand, our findings suggested that a well-designed probe would enable us to make efficient use of MM probes in GeneChip data analysis. Thus, it would be possible to achieve further improvement of the algorithms for GeneChip systems.
As shown in Figure 4, we found that the PM/MM ratios of relatively long (22- and 24-mer) probes were slightly greater at positions slightly out-of-center, particularly on the 5' side, than at the center. Although the 5' bias was thought to be due to the array surface, the observation that slightly out-of-center mismatches provide better specificity than those at the center is puzzling. Although several factors may influence this observation, such as steric hindrance and synthetic errors of probes, one possible cause of this phenomenon may be a cooperative relationship of instabilities between the ends of probes and the mismatched base pair. That is, each unstable end and mismatched base pair may destabilize hybridization of three or four neighboring base pairs, preventing hybridization of whole base pairs between the end and mismatched position. Although this speculation is also supported by the observation that the PM/MM ratios of the mismatched positions 6–8 from the end showed the best specificity with most probe lengths, further studies are necessary to understand the effects of the mismatched position on hybridization behavior on microarrays.
In the present study, we used the standard conditions for hybridization and washing processes generally used in the GeneChip system and did not address the effects of hybridization temperature and stringency. It has been shown that hybridization conditions affect the specificity for single nucleotide mismatch, such as hybridization temperature , time [35, 36], washing stringency [37, 38], and dimethylsulfoxide, formamide, etc., included in hybridization solution . Certainly, a change in hybridization conditions can result in a change in optimal probe length on a high-density oligonucleotide array. An important point of our study was that longer probes that show strong hybridization can cause saturation of signal intensities in the range of both high and low target concentrations, which results in low specificity for single nucleotide mismatch. This factor of saturation should be considered in the optimization process of microarray analysis by changing hybridization conditions.
We designed a custom array on which probes were arranged according to length, mismatch position, and type of mismatched nucleotide to investigate the specificity for single nucleotide mismatch, which is important both for gene expression analysis and mutation analysis in microarray experiments. We applied only oligodeoxyribonucleotides as targets to characterize the probe-target specific hybridization without the effects of cross-hybridization. Using this custom array, we investigated empirically how the specificity for single nucleotide mismatch depends on the probe length and mismatch position by measuring both PM and MM intensities. The results showed that the specificity for single nucleotide mismatch is generally lower in the case of relatively longer probes (23- to 25-mer) than shorter probes. This is due to saturation of signal intensities in the case of such longer probes in all ranges of applied target oligodeoxyribonucleotide concentration.
The dependency of specificity on the position of the mismatch in the probe will allow us to improve the existing position-dependent nearest neighbor model for more precise estimation of binding affinity. Carlon and Heim proposed a thermodynamic theoretical model of oligonucleotide hybridization and explained the behavior of MM probes by taking into account the mismatch penalty on binding free energy . We are now extending the existing nearest neighbor model on microarray to explain these behaviors more correctly. In addition, we also observed that target-target interaction would reduce the concentration of free target in solution. Although it is well known that cross-hybridization to probe increases the signal intensity, it should be considered that cross-hybridization to target can reduce the signal intensity. The adapted hybridization model on microarray and detailed analysis based on this model will be published shortly. Further investigation along this line will facilitate a fundamental understanding of the behavior of microarray probes and provide a promising method to improve the precision of measurement of gene expression levels.
One hundred fifty 25-mer oligodeoxyribonucleotides were synthesized with sequences complementary to the probes of random sequences set as targets. The sequences of 150 oligodeoxyribonucleotides are given in additional file 7. Titration experiments were performed in three technical replicates using three different biotin-labeled target oligodeoxyribonucleotides separately prepared for each sample. This helped minimize technical noise associated with oligodeoxyribonucleotides labeling efficiency. Although the methods for target preparation described in the Expression Analysis Technical Manual (Affymetrix, 2004) were followed, each method for target preparation differed slightly from the others. Briefly, for data set 1, aliquots of 100 pmol of each synthetic oligodeoxyribonucleotides target were labeled independently at the 3' end with 0.3 mM GeneChip DNA Labeling Reagent (Affymetrix) using 60 U of Terminal Deoxynucleotidyl Transferase, Recombinant (TdT; Promega, Madison, WI) at 37°C for 1 h. After TdT had been stopped by addition of EDTA to a final concentration of 9.6 mM, the 150 labeled oligodeoxyribonucleotides were mixed. As the results derived from data set 1 showed that the signal intensities corresponding to four target oligodeoxyribonucleotides, 012, 064, 072, and 091, were anomalous, we re-synthesized these four oligodeoxyribonucleotides. For data set 2, we replaced the anomalous target oligodeoxyribonucleotides with new oligodeoxyribonucleotides and labeled 150 target oligodeoxyribonucleotides as described above. For data set 3, 150 synthesized oligodeoxyribonucleotide targets were mixed before terminal labeling. The total of 100 pmol of 150 mixed targets (0.67 pmol each) was labeled as described above. Use of the Terminal deoxynucleotidyl Transferase (TdT) end labeling method canceled out fluctuations in labeling efficiency depending on the target sequences caused by in vitro transcription using biotinylated UTP and/or CTP, because the activity of TdT does not depend on the sequence of the target [41, 42]. Therefore, the efficiency of labeling was the same among the target oligodeoxyribonucleotides.
For all experiments that included background cDNA as a source of cross-hybridization, aliquots of 10 μg of E. coli total RNA were used. Briefly, E. coli K-12 strain W3110 was grown overnight with shaking at 37°C in 5 ml of liquid Luria-Bertani medium. To maintain logarithmic growth, the overnight cultures were diluted to an optical density at 600 nm of 0.05 in 5 ml of fresh liquid Luria-Bertani medium. Then, cultures were grown with shaking at 37°C to an optical density at 600 nm of 0.8. Cells were harvested by centrifugation and stored at -80°C prior to RNA extraction. Total RNA was isolated and purified from cells using an RNeasy mini kit with on-column DNA digestion (Qiagen, Hilden, Germany) in accordance with the manufacturer's instructions. For preparation of cDNA background samples, standard methods for cDNA synthesis, fragmentation, and end-terminus biotin labeling were carried out in accordance with the Affymetrix protocols. Titration experiments with cDNA background were performed in duplicate using different biotin-labeled target oligodeoxyribonucleotides and cDNA background prepared separately for each sample.
Hybridization, washing, staining, and scanning were carried out according to the Expression Analysis Technical Manual (Affymetrix). Briefly, the 150 labeled target oligodeoxyribonucleotides were diluted in hybridization cocktail containing 1× manufacturer's recommended buffer (100 mM MES, 1 M NaCl, 20 mM EDTA, and 0.01 Tween-20), 50 pM B2 Control Oligo, 0.1 mg/mL herring sperm DNA, and 0.5 mg/mL BSA, such that each labeled target oligodeoxyribonucleotide would yield final concentrations ranging from 1.4 fM to 1.4 nM in tenfold dilutions. In the experiment that included cDNA background, 3 μg of labeled cDNA was added to the hybridization cocktail. The labeled and diluted target oligodeoxyribonucleotide samples with or without background cDNA were hybridized to our custom microarrays at 45°C for 16 h in a Hybridization Oven 640 (Affymetrix) set at 60 rpm under standard conditions. After hybridization, a Fluidics Station 450 (Affymetrix) was used for the washing and staining procedures with ProkGE_WS2_450 fluidics script (Affymetrix) under standard conditions. Following washing and staining, the arrays were scanned using a GeneChip Scanner 3000 (Affymetrix). Absolute signal intensities of every probe in every sample were generated using GCOS 1.0 software (Affymetrix). The raw signal intensities of all probes for each experiment are given in additional file 8.
The extracted GeneChip data were analyzed using R software . The signal intensities were replicated very well among the three replicates (correlations were 0.98~0.99). The signal levels were computed by taking the arithmetic mean of the three replications on a log scale.
This work was supported in part by Grants-in-Aid (nos. 15657035, 15207020, and 15013235) from the Japan Society for the Promotion of Science, "The 21st Century Center of Excellence Program", "Special Coordination Funds for Promoting Science and Technology: Yuragi Project" and "Global COE (Centers of Excellence) Program" of the Ministry of Education, Culture, Sports, Science, and Technology, Japan.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.