Comparison of unamplified- and amplified targets
A typical sample for amplified Super SMART™ PCR-product yields a distribution of sizes from 500 bp-6000 bp with a peak centered at 900 bp (Clontech, Super Smart PCR cDNA Synthesis Kit User manual). A typical sample for amplified Message Amp™ aRNA yields a distribution of sizes from 250 nt-5500 nt with a peak centered at 1000–1500 nt (Ambion, Catalog #1752). The distributions of our amplified material agree well with the manufacturers' data (See Additional file 1 and 2). It has been reported that PCR amplification requires less RNA, is more reproducible and generates better target transcripts than linear amplification [5, 13], at least if the sequences are limited to the 3'- end. Linear T7 amplification has however been widely used when starting material is limiting. Recently some researchers have reported bias in their data. In some studies the bias is said to be of minor importance, systematic and reproducible, affecting all the samples in the same way and therefore potentially controllable in the normalization (e.g. to calculate fold change) [10, 14]. In other studies the bias from different amplification protocol is affecting the general ratios of gene expression [5, 12]. Part of the bias may arise from the T7 RNA polymerase's intrinsic nucleolytic activity that appears during extended incubation [15]. Other bias is maybe introduced owing to the characteristics of the individual transcripts.
We have found a preferential amplification of certain nucleotide sequences by the Super SMART™ PCR relative to a nonamplified target in earlier membrane array experiments, where the targets were prepared from the samples of lignified planings and nonlignified xylem scrapings (data not shown). The correlation (R2 ) between transcript abundance using unamplified and Super SMART™ PCR amplified targets was 0.77 for scrapings and 0.68 for planings.
Comparing five lines of Picea abies shoots where the first biological replicate consisted of unamplified targets and the second biological replicate consisted of targets amplified with T7 transcription we obtained a correlation of R2 = 0.74 (data not shown). Ambion has reported R2 = 0.87 [16] between technical repeats.
Plots of the individual gene transcript abundance of unamplified versus amplified target should give a straight line of slope 1 if the overall expression is preserved. However, there is some nonlinear behavior in both cases. For unamplified versus PCR amplified target the curve is generally nonlinear and lower abundance transcripts are under-represented and highly expressed transcripts are amplified better than average. For the unamplified versus T7 amplified targets a very small minority of highly expressed transcripts do not follow the linear slope of around 1.
For the comparison between unamplified and PCR amplified targets the 95% confidence intervals for the fold-changes were as follows: For unamplified material: Downregulation, 2.3–8.0; upregulation: 1.1–1.3. For PCR-amplified material: Downregulation, 1.3–1.7; upregulation, 1.0–3.0. The differences between unamplified and PCR amplified targets were statistically significant.
For the highly significant (p < 0.0001) differentially expressed genes between lines in each of the ten comparisons of unamplified and T7 amplified targets, the 95% confidence intervals for the fold-changes were as follows. For unamplified material: Downregulation, 1.5–3.2 (all), and 2.7–7.5 (top); upregulation: 1.3–2.8 (all), and 2.6–4.4 (top). For T7-amplified material: Downregulation, 1.0–2.7 (all), and 2.5–4.2 (top); upregulation: 1.2–3.0 (all), and 2.3–4.3 (top). The differences between unamplified and T7 amplified targets are generally not statistically significant although the fold change for the unamplified targets were greater than for the T7 amplified targets indicating that some small bias may still exist when using T7 amplified relative to unamplified targets, especially for highly expressed transcripts.
However, in many situations there is no possibilty of using unamplified targets and amplification is required. Thus, starting with small amounts of secondary xylem tissue we compared PCR and T7 RNA polymerase amplification methods directly to investigate if, and how, the biases differ from each other.
Expression characteristics of transcripts amplified by PCR or T7 transcription
The two methods of amplification were compared to each other four times and twice to themselves in a fully balanced flip dye experimental design including technical repeats (Figure 1A). Only few spots were flagged as bad and excluded from further analysis. The percentage of detectable spots (above background) on each array and in each channel was 88% using T7 amplification and 71% using PCR amplification. The percentage of saturated spots was around 1% in all cases.
After normalization the correlation of transcript abundance for each gene between technical repeats was very high, R2 = 0.98, after both PCR- (Figure 1B) and T7 amplification (Figure 1C). In contrast, the correlation between the two different amplification methods for both technical repeats was considerably lower, R2 = 0.52, (Figure 1D), indicating bias in one or both amplification techniques. As previously mentioned the correlation between unamplified and amplified transcript abundance was intermediate, indicating that both amplification methods have bias and that these biases are different from each other.
The genes present on the microarray were divided into two groups according to whether the PCR amplified transcripts (S') or the T7 amplified transcripts (M') were more abundant. The S' group was 9% larger than the M' group.
A relative frequency distribution plot of expression levels revealed a narrower peak for S' than for M' transcripts (Figure 2A). The arithmetic expression values showed a significantly greater mean for M' (1.76) than for S' (1.64) and a higher variance although the coefficient of variation was lower for M' (81.6%) than for S' (86.6%). The distribution of the data implies a broader population of transcript species present in the T7 amplified target.
Using the criteria for statistical significance described in methods, 309 ESTs (14%) showed different expression levels between the two amplification methods with 131 ESTs in the S' group and 178 ESTs in the M' group. The arithmetic mean of the S' group (3.40) was statistically higher than the M' group (2.95) and the S' group had higher variance (Figure 2B). The coefficient of variation was lower for M' (33.4%) than for S' (36.3%). The reason for the opposite trend observed for this subset of genes may reflect the differences in detectable spots and the amplification kinetics between PCR and T7 transcription.
Transcript characteristics amplified by PCR or T7 transcription
As shown above, out of the genes (309 ESTs) showing statistically significant abundance differences between the amplification methods, 36% more were found in the M' group than in the S' group. One possibility for why 36% more were found in the M' group is that the complexity of the T7 amplified transcripts is greater. To assess this we analyzed the length of the sequences on the array. Previous analyses of protein sequences showed about half of Pinus taeda ESTs on the array have an apparent homolog in Arabidopsis thaliana (increasing with length up to 90%). For these ESTs the sequence similarity is typically distributed over the full length of the contig indicating a substantial conservation of genes between these two species, suggesting a common functional genome [17]. From the BLASTn™ (nucleotide level) and BLASTx™ (amino acid level) searches relating the contig data to Arabidopsis thaliana homologs, the corresponding Pinus taeda full-length cDNAs were estimated. The contig lengths constitute on average about 45 % of the total cDNA lengths spotted on the array. For both the nucleotide and the amino acid levels there was a highly significant 60% greater variance in length of the M' group than of the S' group. At the amino acid level there was a significant 26.9% greater mean length of the M' group (1580 bp) than the S' counterpart (1245 bp). The maximum length of transcript present was also considerably greater in the M' group than the S' group (Figure 2C). In contrast to the contigs the singleton ESTs in the S' group (482 bp) had a significantly greater mean sequence length than those in the M' group (428 bp). The reason for this discrepancy is unclear but could reflect a difference in efficiency of the sequencing polymerase resulting from difference in the amount of secondary structures in the sequences from the two sets. The M' group contained 60% of the ESTs and contigs with nucleotide and amino acid homology to Arabidopsis thaliana reflecting both an initially greater transcript population as well as differences in transcript lengths. In conclusion, the possibility of getting transcripts of greater length and larger variability is considerably higher when using T7 amplification rather than PCR amplification.
Importance of GC content for amplification
Comparison of the selected genes (309 ESTs) differentially represented in the two amplification methods, the GC content of the ESTs, contigs and Arabidopsis thaliana cDNAs (on a nucleotide level) there was a significantly greater mean GC content for the sequences of the S' group than for those of the M' group. The difference was 2.7 percentage units for ESTs, and 1.4 percentage units for the corresponding contigs (Figure 2D). There was a similar difference for the cDNAs although only about 10% of the contigs were found to have a BLASTn™ score above 100 bits. Interestingly, for a smaller group of 80 contigs (40 from S' and 40 from M') showing the greatest fold changes between methods, the difference in GC content increased from 1.4 to 2.2 percentage units, due to an increase in GC content for the S' group. Additionally, the mean length of the 40 ESTs from the S' group (1428 bp) was significantly greater than the mean length of the 40 ESTs from the M' group (1275 bp). It appears that transcripts with a high GC content are amplified faster by PCR than by T7, often overriding the effect of length. If the GC content is nearer the average, long transcripts are favored by T7 amplification. The GC effect is presumably explained by the temperature of extension, which is 68–72°C for Taq polymerase and 37°C for T7 polymerase; high temperature favors polymerization through GC-rich areas. Evolution has in general tuned the cellular machinery, including polymerases, to fit the temperature environment of an organism. This might be reflected in the GC content and the temperature environment of the original organism for each polymerase. The GC content of a Pinus species genome is about 40%, which is considerably closer to the 48% GC content of T7 phage (or the 50% GC content of Escherichia coli, the typical host of T7 phage), than for the 67% GC content of Thermus aquaticus [18–20]. It implies that T7 transcription of the Pinus taeda transcriptome or consequently other transcriptomes with similar GC content in most cases is a better choice than PCR based techniques.