Comparing gene discovery from Affymetrix GeneChip microarrays and Clontech PCR-select cDNA subtraction: a case study

Background Several high throughput technologies have been employed to identify differentially regulated genes that may be molecular targets for drug discovery. Here we compared the sets of differentially regulated genes discovered using two experimental approaches: a subtracted suppressive hybridization (SSH) cDNA library methodology and Affymetrix GeneChip® technology. In this "case study" we explored the transcriptional pattern changes during the in vitro differentiation of human monocytes to myeloid dendritic cells (DC), and evaluated the potential for novel gene discovery using the SSH methodology. Results The same RNA samples isolated from peripheral blood monocyte precursors and immature DC (iDC) were used for GeneChip microarray probing and SSH cDNA library construction. 10,000 clones from each of the two-way SSH libraries (iDC-monocytes and monocytes-iDC) were picked for sequencing. About 2000 transcripts were identified for each library from 8000 successful sequences. Only 70% to 75% of these transcripts were represented on the U95 series GeneChip microarrays, implying that 25% to 30% of these transcripts might not have been identified in a study based only on GeneChip microarrays. In addition, about 10% of these transcripts appeared to be "novel", although these have not yet been closely examined. Among the transcripts that are also represented on the chips, about a third were concordantly discovered as differentially regulated between iDC and monocytes by GeneChip microarray transcript profiling. The remaining two thirds were either not inferred as differentially regulated from GeneChip microarray data, or were called differentially regulated but in the opposite direction. This underscores the importance both of generating reciprocal pairs of SSH libraries, and of real-time RT-PCR confirmation of the results. Conclusions This study suggests that SSH could be used as an alternative and complementary transcript profiling tool to GeneChip microarrays, especially in identifying novel genes and transcripts of low abundance.


Background
Gene expression profiling has become an invaluable tool in functional genomics. Since the mid-1990's, DNA microarrays [1][2][3], cDNA subtraction [4][5][6][7] and Serial Analysis of Gene Expression (SAGE) [8] have emerged as the leading transcript profiling technologies in the global analysis of biological systems. One of the high throughput technologies, high-density oligonucleotide Gene-Chip ® microarrays, manufactured by Affymetrix [1,3,9], makes it possible to simultaneously measure the relative abundance of thousands of mRNAs in a cell. However, DNA microarray technology is limited by its insensitivity to transcripts of low abundance [10]. A similar low sensitivity was also seen with SAGE [10]. However, recently a PCR-select cDNA subtraction method (called suppressive, subtractive hybridization, or SSH) was developed by Clontech, which, due to a normalization step to equalize the abundance of cDNAs within the target population, makes it possible to detect some low abundance transcripts [4,7]. Although custom DNA microarrays have been used in combination with the cDNA subtraction technology in identifying differentially expressed genes [11][12][13][14][15], no direct comparison of the sensitivity and bias of the SSH and GeneChip technologies has been done so far.
In order to comparatively evaluate the SSH and GeneChip technologies, we explored the similarities and differences in regulated genes discovered using SSH and GeneChip microarrays. We compared the regulated genes identified through SSH with the genes found to be differentially regulated using the GeneChip microarrays in a human dendritic cell (DC) differentiation paradigm. We regard this as a "case study" of the potential for novel gene discovery using SSH methodology, that would not be accessible using Affymetrix profiling alone. The same RNA samples isolated from immature DC (iDC) and RNA samples isolated from monocytes were used for GeneChip microarray probing and SSH library construction. Overall, about two thirds of the transcripts identified using SSH methodology were not identified using GeneChip microarrays alone. These results suggest that SSH could be used as an alternative and complimentary transcript profiling tool to Affymetrix GeneChip microarrays, especially in identifying novel genes or transcripts of low abundance.

Genes not represented on Affymetrix GeneChip microarrays can be identified through SSH
Reciprocally subtracted cDNA libraries between immature dendritic cells (iDC) and monocytes were generated using the SSH technology developed by Clontech (H56 stands for iDC minus monocytes, and H57 stands for monocytes minus iDC). Deep sequencing of the SSH cDNA libraries were carried out by randomly picking 10000 clones from each library for DNA sequence analysis. After accumulating more than 8000 lanes of successful sequences for each library, a total of 1940 transcripts were identified for H56. These transcripts were further extended in silico as described in Methods (Figure 1), and the extended transcripts were mapped to Affymetrix GeneChip microarray qualifiers. About 73% of the transcripts identified in H56 were represented on the U95 series GeneChip microarrays ( Figure 2). However, about 17% of the transcripts identified through deep sequencing of the cDNA subtractive library were not represented on those chips hence could not have been identified in a study based purely on Gene-Chip microarray analysis. This number did not change significantly when the newer U133 series GeneChip microarrays were used (data not shown). In addition, about 10% of the transcripts identified appeared to be novel sequences without any match in the three cDNA databases we searched, and it is unlikely that these transcripts were represented on GeneChip microarrays.

Certain genes present in the SSH libraries and represented on GeneChip microarrays were not detected through GeneChip microarray analysis
To find out how many of the SSH detected genes with probes on the GeneChip microarrays can actually be detected by the Affymetrix technology, we used the subtracted cDNA to synthesize labeled complimentary RNA (cRNA) targets for the GeneChip microarrays. A T7 promoter in the SSH PCR primers allows us to perform in vitro transcription with the SSH cDNA. Since the SSH library is derived from cDNA primed from the 3' end, we assume that any transcripts detected by sequencing the library are potentially represented by 3' fragments, whether or not the sequenced fragment is localized to the 3' end of the transcript. This is critical because Affymetrix probes are designed to interact with the 3' regions of the targeted transcripts. When the GeneChip microarrays were screened with targets made from the SSH cDNA, 571 out of the 1409 transcripts were given "absent" calls, suggesting that no positive signals can be detected on the GeneChip microarrays for these transcripts, even though the presence of these 571 transcripts had been confirmed by sequencing (Figure 3a). Next we asked the question whether the transcripts undetectable by the GeneChip microarrays were limited only to transcripts of low abundance. Although the abundance of genes had been normalized in the SSH, the frequency of each gene in the SSH cDNA library can still be used as a relative indicator for its abundance because the same SSH cDNA samples were used to label the cRNA targets for the GeneChip microarrays. Here we used the number of sequenced cDNA clones belonging to each transcript as a measurement of the copy number of this transcript in the SSH cDNA library. As shown in Figure 3b, the transcripts scored as "absent" using GeneChip microarrays include genes with high and low copy numbers. The distribution pattern of the copy numbers of this group was very similar to the group of transcripts scored as "present" by the GeneChip microarrays. This suggests that there are some inefficiencies in the GeneChip microarray technology that are independent of transcript abundance.

Discrepancy between genes identified through SSH and genes identified through GeneChip microarray analysis
Among the transcripts that were found in the H56 library (iDC minus monocytes) and were also detectable on the GeneChip microarrays, about a third were concordantly discovered as up-regulated in iDC based on GeneChip microarray profiling using non-subtracted RNA samples (labeled as "GeneChip Concordant" in Figure 4a). The remaining two thirds were either not inferred as differentially regulated (labeled as "GeneChip no Change" in Figure 4a), or were called down-regulated in iDC according to GeneChip microarray profiling data, contradictory to their presence in the iDC minus monocytes SSH cDNA library (labeled as "GeneChip Contradictory" in Figure  4a). As expected, the fraction appearing to contradict the results of GeneChip microarray profiling is diminished somewhat by subtracting out the 408 genes that appeared in both of the two reciprocally subtracted libraries H56 and H57 (Figure 4b). This underscores the importance of generating reciprocal pairs of SSH libraries.

Real time RT-PCR analysis of selective genes identified through SSH
To find out how genes with conflicting SSH data and GeneChip microarray data are differentially expressed, we used the more sensitive real time RT-PCR (TaqMan ® analysis) to quantitate the RNA levels of selected genes in the iDC and monocyte samples used for both the SSH and GeneChip microarray analysis. As shown in Table 2, Annotation and mapping of SSH sequences to GeneChip ® microarray qualifiers (see Methods)

Figure 1
Annotation and mapping of SSH sequences to GeneChip ® microarray qualifiers (see Methods).

Map to GeneChips® qualifiers
Complete set of extended sequences with annotation, copy numbers

Novel Sequences Public Database
Yes among the 4 genes that appeared only in the H56 library (iDC minus monocyte), but were suggested by GeneChip microarray profiling to be upregulated in monocytes, three of them have higher levels of expression in iDC, while one of them has higher level of expression in monocytes. These data suggests that there are false positives in both SSH data and GeneChips ® profiling data. By using SSH in addition to GeneChip microarray profiling, we can identify some differentially expressed genes with false GeneChip microarray profiling results. However, more sensitive RNA quantitative measures, such as real-time RT-PCR analysis, are needed for more reliable verification of these differentially expressed genes. Since all RNA quantitation methods, including real-time RT-PCR, have their limitations, further validation of the differential gene expression pattern might need to be carried out. For examples, Northen blot may not be as sensitive as real-time RT-PCR, but the size of the bands on the blot may be used as indications for the specificity of the signals. If antibodies are available for the gene products under study, Western blot, flow cytometry and other protein analysis tools may also be used to verify the differentially gene expression pattern.

Discussion
In this study, we evaluated the similarities and differences in genes discovered using SSH and GeneChip microarrays by comparing the genes found to be differentially expressed during DC differentiation from monocytes using these two technical approaches. Our results showed  Comparing the SSH data with GeneChip microarray data using subtracted samples as targets Figure 3 Comparing the SSH data with GeneChip microarray data using subtracted samples as targets. The GeneChip microarrays were screened with cRNA targets made from the same subtracted cDNA used for SSH. (a) Number of transcripts in the H56 SSH library identified as "present" or "absent" on the GeneChip microarrays. (b) The copy number of each transcript in the SSH library plotted against its detectability on the GeneChip microarrays. Each dot represents a distinct transcript identified in the H56 SSH cDNA library. The transcripts that can be detected by the GeneChip microarrays were given "present" calls, while the transcripts that cannot be detected by the GeneChip microarrays were given "absent" calls.

(a)
Comparing the SSH data with GeneChip ® data using non-subtracted RNA as targets Figure 4 Comparing the SSH data with GeneChip ® data using non-subtracted RNA as targets. The GeneChip microarrays were screened with cRNA targets made from un-modified iDC and monocyte RNA samples, and the concordance of the SSH data with GeneChip data was shown when (a) all the transcripts in the H56 library were considered, or (b) when only the transcripts unique to H56 were considered, after the transcripts appeared in both H56 and the reciprocally subtracted H57 libraries were excluded.  that among the genes identified in the SSH libraries, more than half of those genes would not have been identified as differentially expressed by using GeneChip microarrays alone. Some of these genes were either novel or not represented on the GeneChip microarrays. However, a significant number of genes were missed by GeneChip microarray analysis despite the presence of probe sets for these genes on the microarrays; whether this number could be lower if the new and improved U133 series GeneChip microarrays were used remains untested.

GeneChip
DNA microarrays are powerful tools that enable the global analysis of a variety of complex biological systems. The expression levels of thousands of genes can be monitored simultaneously by using this high throughput, cost effective technology. However, this technology is also limited by its insensitivity to identify transcripts of low abundance, i.e. genes expressed at low levels or in a small fraction of the cells studied. Even some transcripts of high abundance could be missed by DNA microarrays as well due to the poor hybridization between the probes and the labeled cRNA targets. One factor that could affect the hybridization step is the sequence targeted by the Gene-Chip probes. Since the GeneChip probes are 3'-biased to match the target generation characteristics of the sample amplification method, the sensitivity of some probes could be compromised either due to their positioning toward the 5' region, or the poor in vitro transcription efficiency caused by the complexity of their sequences. The complexity of these targeted sequences may also affect the hybridization efficiency between the labeled cRNA targets and the GeneChip probes. On the other hand, the normalization step in the SSH protocol equalizes the abun-dance of cDNAs within the target population and the subtraction step excludes the common sequences between the target and driver populations. So a comprehensive analysis of at least 5000 to 10000 clones isolated from the SSH cDNA libraries may enable the detection of some transcripts of low abundance that would not be revealed by other transcript profiling protocols. Genes not represented on the DNA microarrays, including some genes with novel identities may also be identified through sequencing the SSH cDNA libraries. However, the construction and sequencing of subtractive cDNA libraries is time consuming and labor intensive. These restrictions will limit the number of samples that can be surveyed by this technology in each study.

Conclusions
In practice, we suggest DNA microarrays as the preferred approach for transcript profiling of a large number of samples. This is especially true when the RNA is derived from homogenous cell populations. [10]. However, in a number of cases, such as clinical tissues, the relevant cell type may be difficult to purify or in low abundance. In these cases, normalized subtractive cDNA libraries are preferable. Our results indicate that even though DNA microarrays and SSH may each be preferred in distinct situations, neither technique can adequately identify all regulated genes. Thus, even when homogenous cell populations were examined as we did in this study, more than half of the genes discovered through sequencing the SSH libraries would not have been identified by using GeneChip ® technology alone. In conclusion, using normalized cDNA subtraction as an alternative and complementary transcript profiling tool to DNA microarrays will Toll-like receptor 4 (TLR4) help identify novel genes and low abundance transcripts, therefore achieving a more comprehensive global view of the transcriptome in the biological system studied.

Affymetrix GeneChip ® Microarray studies
The cRNA labeling and hybridizations were performed according to protocols from Affymetrix Inc. (Santa Clara, CA). Briefly, the mRNA in 5 µg of total cellular RNA was converted to double-stranded cDNA using Superscript (Gibco-Invitrogen) with a T7-(dT) 24 primer containing T7 RNA polymerase promoter. The cDNA was in vitro transcribed to biotinylated complementary RNA (cRNA) by incorporating biotin-CTP and biotin-UTP using Enzo Bio-Array High Yield RNA labeling kit (Enzo Diagnostics, New York, NY). Biotinylated cRNA from each sample was fragmented to approximately 40-100 bases and 10 µg of the fragmented cRNA were hybridized to the Affymetrix human U95 probe array series (A, B, C, D, and E) for 16 h at 45°C with constant rotation at 60 rpm. Following washes, the hybridized chips were sequentially stained with streptavidin-phytoerythrin (Molecular Probes, Eugene, OR), biotinylated goat anti-streptavidin (Vector Laboratories, Burlingame, CA) and another streptavidinphytoerythrin for signal amplification. After a series of washes, chips were scanned with an argon-ion laser confocal microscope (Hewlett-Packard, Palo Alto, CA) for fluorescence signal detection. All washes and staining procedures were performed on an Affymetrix Fluidics station. The raw expression data derived from Affymetrix Microarray Suite 4.0.1 software gave each transcript an absolute expression level (signal intensity) and a "present" or "absent" call based on the signal/noise ratio. The data were analyzed on two levels. At the detection level, a call of "present" suggests that positive signal is detected for a probe, while a call of "absence" suggests that negative signal is detected for a probe. Gene expression ratio of different samples for each donor was inferred using the PFOLD algorithm [19] that employs a Bayesian estimation scheme for estimating the fold-change of gene expression and also the significance of the change (Pvalue). The comparison level analysis of the iDC and monocytes defines a gene as up-regulated if the signal log ratio between the iDC and monocyte samples is larger than 1 (equals a 2-fold increase) and the target sample is present. RNA samples from 3 individuals were analyzed.

The construction and sequencing of subtraction suppression hybridization (SSH) cDNA libraries
SSH libraries were generated using the reagents and protocols provided by Clontech (Clontech, Palo Alto, CA). In one SSH library (H56), the RNA from iDC was used as "tester" and the RNA from monocytes was used as "driver". In another SSH library (H57), the RNA from iDC was used as "driver" and the RNA from monocytes was used as "tester". In both cases, the starting RNA material was a pool of the RNA samples from 3 individuals used in the microarray experiment. RT-PCR analysis of the SSH products showed that the level of the house-keeping gene GAPDH decreased more than 1000 fold in both H56 and H57 cDNA when compared with unsubtracted cDNA (data not shown), suggesting that the subtraction procedure was very effective. 10000 clones from each SSH library were sequenced with M13 primers using the ABI BigDye Terminator v2.0 Cycle Sequencing Kit (Applied Biosystems, Foster City, California) and ABI 3700 DNA Analyzers (Applied Biosystems), according to the manufacturers' protocols and manuals. The SSH cDNA was also used to prepare the cRNA for GeneChip microarrays. In vitro transcription was carried out from the T7 promoter in the PCR primers for SSH. The cRNA generated was used for GeneChip microarray hybridization as described above.