Sequence polymorphism can produce serious artefacts in real-time PCR assays: hard lessons from Pacific oysters

Background Since it was first described in the mid-1990s, quantitative real time PCR (Q-PCR) has been widely used in many fields of biomedical research and molecular diagnostics. This method is routinely used to validate whole transcriptome analyses such as DNA microarrays, suppressive subtractive hybridization (SSH) or differential display techniques such as cDNA-AFLP (Amplification Fragment Length Polymorphism). Despite efforts to optimize the methodology, misleading results are still possible, even when standard optimization approaches are followed. Results As part of a larger project aimed at elucidating transcriptome-level responses of Pacific oysters (Crassostrea gigas) to various environmental stressors, we used microarrays and cDNA-AFLP to identify Expressed Sequence Tag (EST) fragments that are differentially expressed in response to bacterial challenge in two heat shock tolerant and two heat shock sensitive full-sib oyster families. We then designed primers for these differentially expressed ESTs in order to validate the results using Q-PCR. For two of these ESTs we tested fourteen primer pairs each and using standard optimization methods (i.e. melt-curve analysis to ensure amplification of a single product), determined that of the fourteen primer pairs tested, six and nine pairs respectively amplified a single product and were thus acceptable for further testing. However, when we used these primers, we obtained different statistical outcomes among primer pairs, raising unexpected but serious questions about their reliability. We hypothesize that as a consequence of high levels of sequence polymorphism in Pacific oysters, Q-PCR amplification is sub-optimal in some individuals because sequence variants in priming sites results in poor primer binding and amplification in some individuals. This issue is similar to the high frequency of null alleles observed for microsatellite markers in Pacific oysters. Conclusion This study highlights potential difficulties for using Q-PCR as a validation tool for transcriptome analysis in the presence of sequence polymorphism and emphasizes the need for extreme caution and thorough primer testing when assaying genetically diverse biological materials such as Pacific oysters. Our findings suggest that melt-curve analysis alone may not be sufficient as a mean of identifying acceptable Q-PCR primers. Minimally, testing numerous primer pairs seems to be necessary to avoid false conclusions from flawed Q-PCR assays for which sequence variation among individuals produces artifactual and unreliable quantitative results.


Background
During the last decade, quantitative real time PCR (Q-PCR) has been widely employed in many fields of biological research (medicine, biotechnology, microbiology) and is considered to be the most sensitive and reliable method of quantifying mRNA transcripts [1]. In contrast to more traditional methods using image analysis to measure band intensity on gels and thus quantify PCR products at the final phase of the reaction, real time PCR exploits the kinetics of the PCR reaction [2], specifically the exponential phase of amplification during which the amount of the PCR product is theoretically proportional to the initial quantity of template [3]. Fluorescent reporter dyes and/or gene-specific probes allow for the detection and quantification of cDNA amplicons produced during each Q-PCR cycle. By either assuming perfect amplification efficiency in the reaction, or alternatively estimating amplification efficiency empirically from the data, it is possible to estimate with accuracy the concentration of the targeted nucleic acid sequence in the initial sample.
As Q-PCR technology has evolved and its use expanded, diverse protocols using chemistry ranging from non-specific reporter dyes to sequence specific probes and diverse instrumentation have been developed [4,5]. The specific chemistry and quality of the reaction components play an important role in optimizing Q-PCR reactions, underlining the requirement for critical evaluation in order to overcome subjectivity inherent to the Q-PCR assay [6]. As a consequence, Q-PCR can be a somewhat "fragile" assay because its accuracy depends on numerous factors such as template preparation [7], reagents [8,9], operator influence [8] and the mathematical/statistical validation procedure(s) used [10,11]. Furthermore, due to the exponential nature of the signal and typically the reduction of the kinetics of the signal to a single number (C T , the cycle number when sample fluorescence exceeds a chosen threshold above background fluorescence) which is used as an exponent in the estimation procedure, rigorous optimization of Q-PCR assays is especially critical. Even seemingly minor errors and artefacts are greatly magnified by exponentiation.
Numerous studies have examined the potential problems and pitfalls of Q-PCR assays [6,8,12], however, the influence of the primer (or probe) design on the accuracy of the assay has been directly addressed only rarely. While it is known that regions of low-complexity sequence can create problems for designing primer and probe sequences specific to the target sequence [13], the influence of polymorphism within the targeted sequence has received little attention even though this is particularly important when Q-PCR is used to complement and validate whole transcriptome analyses, such as differential display, suppressive subtractive hybridization (SSH) or cDNA-AFLP (complementary DNA Amplification Fragment Length Polymorphism). In these applications, Q-PCR assays generally target relatively short sequences, ranging from approximately 100 to 800 bp. In some cases, template sequence discrepancies or inaccuracies can lead to failed assays caused by poor or no binding of primers and probes and/or non-specific binding resulting in multiple PCR products. It is therefore critically important to verify the targeted sequence and to check for the presence of polymorphisms in the biological material under study. Unfortunately, one of the attractions of whole transcriptome analyses such as SSH or cDNA-AFLP is that they are designed for genome-wide expression analysis with no prior sequence information required, making this step difficult or even impossible in non-model organisms. Furthermore, even though DNA microarrays normally use known EST sequences, typically only in model organisms is sufficient sequence information available to examine levels of polymorphism although this is rapidly improving as more sequence information becomes available for non-model organisms.
In this study, we report on how sequence polymorphism impacts Q-PCR assays based on cDNA-AFLP analyses of mRNA transcription in Crassostrea gigas, a marine bivalve known for its high level of genetic variability [14,15]. Unlike SSH, cDNA-AFLP can be used directly for quantitative detection because the intensity of each fragment on a gel theoretically reflects the expression level of the gene [16]. However, Q-PCR is a valuable method to support the trends observed with cDNA-AFLP, especially since false positives are likely to occur using cDNA-AFLP.
We evaluated the expression of one EST [GenBank: EX956386] taken from a cDNA-AFLP library (Taris, unpublished data), and one EST taken directly from Genbank [GenBank: AJ565694]. We used Q-PCR to quantify the expression levels of these two ESTs. We designed and evaluated 14 primer pairs for each EST sequence and then used 6 and 9 primer pairs respectively that melt curve analysis indicated were suitable for Q-PCR analysis. Results are discussed in light of the impacts of sequence polymorphism on the results of Q-PCR quantification assays.

Biological material
We exposed fifty individuals from each full-sib family from a 50-family cohort of full-sib Pacific oysters to heat shock (43°C, 1 h) and subsequent starvation at ambient temperature and monitored their survival for 8 days post heat shock during November 2005. Based upon the percentage that survived following this stress challenge, we classified the families as either high surviving (H) or low surviving (L). We then chose four of the most extreme families (two with high and two with low survival) for further study. Sibs of the tested animals from these extreme families were over-wintered in flow-through seawater troughs to minimize the effect of estuarine environment on stress responses, and transcriptome analyses were conducted in summer 2006.

Experimental design
Heat shock consisted of immersing twelve two-year-old oysters from each of the four families in sea water at 40°C for 1 h. Oysters were then returned to 17°C sea water in flow-through tanks. We collected gill tissue 6 h after the shock from six randomly chosen oysters per family.

RNA extraction
We extracted total RNA from gill tissue using the RNeasy Mini Kit (QIAGEN) according to the manufacturer's instructions. Pieces of gill (~30 mg) were excised, and disrupted in 700 μl of RLT buffer (QIAGEN). Samples were treated with DNAse I (QIAGEN, RNase-Free DNase Set). We quantified RNA by measuring absorbance using a NanoDrop ® ND-1000 UV-vis spectrophotometer (Nano-Drop Technologies). First-strand cDNA was synthesized from 1 μg of total RNA template using random hexamers according to the high capacity cDNA archive kit (Applied Biosystems).
For primer testing, we pooled equal cDNA sub-samples from individual oysters from each family (6 individuals/ family) and used 10 ng of this pooled cDNA in each Q-PCR reaction. For each pool, Q-PCR assays were performed in triplicate using SYBR ® Green PCR Master Mix (Applied Biosystems) in 25 μl reactions containing cDNA (diluted in 5 μl) and 50 nM (final concentration) of each primer. Each Q-PCR reaction plate included a non-template negative control to ensure the absence of contamination and the data was normalized using Elongation factor 1 α where E represents the empirically determined efficiency estimated for each reaction using LinRegPCR software [18]. Options selected to fit the window-of-linearity were a number of data points between five and six and the best correlation coefficient.

Statistical analyses
The level of cDNA (relative to the reference gene) was analyzed for significant differences between families using Proc GLM [17]. The model was as follows: where Y ij is the dependant variable (Ct values), μ is the overall mean, rep j the replicate effect nested with family, fam i is the family effect and ε ij the residual error. The analysis of variance was followed by Tukey's multiple comparison procedure whenever a family effect was significant. Significance was assumed for P < 0.05.

Results
Out of the 14 primer pairs tested per EST, 6 and 9 for [EX956386] and [AJ565694] respectively showed a single product in the melt curve analysis and were thus considered to be worth further consideration and testing. All primer pairs that produced multiple products were eliminated from further consideration. Melt curve analyses, raw data, and statistical outcomes are summarized in figures 2, 3 and 4. We found statistically significant family effects for all primer pairs used, but no significant variation among technical replicates. To more closely examine these significant family effects, we used Tukey's range test to perform multiple comparisons of the four families studied ( Table 2).
For EST [EX956386], three different statistical outcomes were observed (respectively named A, A' and B). For primer pairs 2, 3, 10 and 11, the level of cDNA (relative to Elongation factor 1 α mRNA) was significantly higher for Family 65 than for the three other families (pattern A), which belong to the same statistical group (group b as shown on figure 2). In contrast, primer pair 1 distinguishes Family 65 from families 25 and 34, but not from  f13  TATGGGTCCCAAATCAGGTCA  177  21  59  48  51  r13  TGGAAGCAACTCTGGAAACGAT  227  22  60  45  f14  CGTTTCCAGAGTTGCTTCCAC  208  21  58  52  51  r14 AATGTGTTCGAATCATGGTCGTT 258 23 59 39  Outcomes of level of cDNA expression across families for primers (3,5,6,7,8,9) technically validated through the analysis of the melting curve of the EST [AJ565694]

Discussion
The variation in expression patterns among families that we observed for the same EST fragment using different primer pairs highlights the complexity of interpreting Q-PCR results and raises serious questions regarding the use of Q-PCR to validate the results of whole-transcriptome screening procedures such as cDNA-AFLP. For both ESTs, depending on the primer pairs used, statistical comparisons of the estimated levels of gene transcription across the four families leads to three different statistical outcomes with different biological implications. Using standard criteria, all of the primer pairs selected would be acceptable insofar as they all produce a single product according to the melting curve analysis. However, different statistical results are obtained with different primers, and it is impossible with these data alone to determine which of these outcomes, if any, is correct.
To address this question more rigorously, it is necessary to look more closely at the plot of PCR cycle number against PCR product amount (figure 2) and the resulting values of Ct (table 2).
Focusing first on EST [EX956386], it is interesting to observe the similarity of Ct values (28 ± 0.5) across families for primer pair 1 and 3 (table 2). The profiles generated by these two primer pairs are distinguishable from those generated by primer pair 2, 10 and 11, but even so the final outcomes show significantly higher level of cDNA expression for Family 65 compared to the other families. For primer pairs 2, 10 and 11, the mean Ct values of Family 65 are respectively 29.82, 35.93, and 30.56, but the mean Ct value of the three other families are at least 4 cycles greater. We hypothesize that the presence of null alleles (i.e. poor primer binding) for Family 4, 25 and 34 but only for primer pair 2, 10 and 11 explains these results Outcomes of level of cDNA expression across families for primers (12,13,14) technically validated through the analysis of the melting curve of the EST [AJ565694] Primer pair 14  To test this hypothesis, we sequenced 16 clones of the original fragment from the original cDNA-AFLP library. This cDNA is the result of a normalized pool of cDNA collected from 16 individual oysters. An examination of the 16 sequences underlines the presence of polymorphism ( figure 5). We observed five of the eight SNPs in more than one clone, making it unlikely, although not entirely impossible, that they include amplification enzyme errors. The polymorphism observed is notably located in the priming site of primers 2, 10 and 11 (figure 1) but also potentially affect the priming site of primer 1, 3 and 6 as well. The case of primer pair 6 is more difficult to interpret. Ct values are close across families. However, the level of cDNA appears to be higher in Family 34. As mentioned before, variation in PCR efficiencies must be accounted for and the raw Ct values cannot be compared directly unless it can be assumed that all PCR reactions had equal efficiencies. This underscores the importance of directly estimating PCR efficiencies because this correction can have substantial impacts on the estimates obtained. In this regard, the use of the Log (fluorescence) versus cycle number plot in the linear regression approach [18] can be viewed as a reliable measure of PCR efficiency. In contrast to the method of serial dilutions based solely on Ct estimates, LinRegPCR analyzes the kinetics of individual Q-PCR reactions and includes a number of data points belonging to the log-linear phase of the PCR reaction (i.e. the exponential phase). Moreover, the method of dilution series results in only one value of efficiency for all dilutions, even though efficiency varies as the input concentration changes [19].
Overall, primer pairs 1 and 3 seem to be unaffected by the observed polymorphisms while primer pairs 2, 10 and 11 under-estimate the level of expression of Families 4, 25 and 34 relative to Family 65 due to null alleles caused by sequence variation in the priming regions even though all of these primer pairs produce a single product in the meltcurve analysis and are thus acceptable by standard criteria. . Once again, it seems reasonable to conclude that null allele issues in families 34 and 65 that depend on the primer pair used have profound impacts on the estimates and result in an underestimation of the level of expression in the affected families. Finally, pattern C (primer pair 14), is intermediate, presumably affected to a lesser extent by the null allele issue. Overall, patterns B and C seem to be driven by artefacts rather than biology. Pattern A is not only the most frequent (5/9) but also the one corresponding to the most logical explanation.
There are few examples of how sequence polymorphism affects Q-PCR results in the published literature, but, in a recent study, Stevenson et al. [20] demonstrated how SNPs within a probe-binding region can adversely influence the sensitivity of real time PCR assays. The idea is that the presence of mismatches (SNPs) between a probe and a sequence target will lower the melting temperature. This conclusion was drawn by using probes for detection of herpes simplex virus. In the present case, such a statement might be applicable as well, even though SYBR Green chemistry is known to be sequence-independent. Sequence polymorphism among alleles in the different families influences the efficiency of primer binding and therefore the overall efficiency of the assays.

Conclusion
Our study demonstrates that careful and rigorous primer optimization and an examination of sequence variation among families or individuals is a critical step before real time PCR assays are used to complement whole transcriptome analyses, especially when dealing with short fragments such as those generated by differential display techniques. Statistical outcomes can be profoundly influenced by polymorphisms in the sequence under study if they cause poor binding of primers or poor amplification. These artefacts cannot be detected using standard meltcurve analyses because they have purely quantitative rather than qualitative effects. For this reason, it is strongly recommended when working with genetically diverse biological material, to test multiple primers and, if at all possible, to examine the sequences investigated for polymorphisms in priming regions to avoid erroneous conclusions.