Integrating multiple genome annotation databases improves the interpretation of microarray gene expression data
© Yin et al. 2010
Received: 8 September 2009
Accepted: 20 January 2010
Published: 20 January 2010
Skip to main content
© Yin et al. 2010
Received: 8 September 2009
Accepted: 20 January 2010
Published: 20 January 2010
The Affymetrix GeneChip is a widely used gene expression profiling platform. Since the chips were originally designed, the genome databases and gene definitions have been considerably updated. Thus, more accurate interpretation of microarray data requires parallel updating of the specificity of GeneChip probes. We propose a new probe remapping protocol, using the zebrafish GeneChips as an example, by removing nonspecific probes, and grouping the probes into transcript level probe sets using an integrated zebrafish genome annotation. This genome annotation is based on combining transcript information from multiple databases. This new remapping protocol, especially the new genome annotation, is shown here to be an important factor in improving the interpretation of gene expression microarray data.
Transcript data from the RefSeq, GenBank and Ensembl databases were downloaded from the UCSC genome browser, and integrated to generate a combined zebrafish genome annotation. Affymetrix probes were filtered and remapped according to the new annotation. The influence of transcript collection and gene definition methods was tested using two microarray data sets. Compared to remapping using a single database, this new remapping protocol results in up to 20% more probes being retained in the remapping, leading to approximately 1,000 more genes being detected. The differentially expressed gene lists are consequently increased by up to 30%. We are also able to detect up to three times more alternative splicing events. A small number of the bioinformatics predictions were confirmed using real-time PCR validation.
By combining gene definitions from multiple databases, it is possible to greatly increase the numbers of genes and splice variants that can be detected in microarray gene expression experiments.
Microarrays are widely used to profile gene expression patterns in samples of biological material. Affymetrix GeneChips are a popular oligonucleotide microarray platform, using probe sets formed by 11-20 pairs of 25 mer probes. The probe pairs include a perfect match probe (PM) and a single base mismatch (MM) probe targeting the gene transcripts. These probes were originally selected from the consensus sequence alignments of expressed sequence tag (EST) sequences. Over the past 10 years, the sequences and annotations of the main genomes have changed significantly. One consequence has been that some of the EST sequences were re-annotated or even removed from the databases. Thus, probes targeting these ESTs are no longer accurate . This problem has created a need to remap the probes using information from the most up to date genome sequence databases. Since the GeneChips were introduced, the importance of alternative splicing, especially in vertebrates, has become more and more apparent. The annotation of these transcripts has changed considerably over recent years and this has also increased the importance of using the latest and most comprehensive genome annotation databases to map probes to specific transcripts.
Several probe-remapping protocols have been developed, generally by regrouping the probes to the target genes or transcripts according to the current version of genome annotation [1–5]. A crucial consideration in probe remapping is the annotation database usage. Dai et al. provided several probe remappings, each using a different database, e.g. Unigene, RefSeq and Ensembl . However, this may lead to difficulty and confusion when interpreting microarray results. The problems come from i) genes annotated in one database but not in the other databases; ii) genes having longer transcripts or more transcripts in one database, but shorter or smaller in another. This usually means that the expression level or alternative splicing events of the gene can only be detected using one database but not using the other. The remapping results from Lu et al., Lee et al., and Moll et al. either used the Refseq and Aceview databases, or the Ensembl database [1–3, 5]. The results may be very different if the database is changed.
RefSeq and Ensembl are two widely used genome annotation databases [6, 7]. Though the data are regularly transferred between these databases, incompatible transcripts or genes are discarded during the transfer thus leading to discrepancies. Aceview provides comprehensive genome annotation by integrating data from RefSeq, dbEST and GenBank . However, only five species were annotated in Aceview, which means that Aceview annotation cannot be easily used in other species. ZFIN is a highly accurate, manually corrected zebrafish genome annotation database, integrating RefSeq and GenBank transcripts . Its strict criteria, however, may result in the loss of a certain amount of transcript data. Furthermore, ZFIN does not provide cross-referenced transcript information to Ensembl transcript data. These variations in the genome annotation may lead to difficulties in interpreting gene expression results. Thus, there is a need for a comprehensive and unbiased genome annotation. The UCSC genome browser provides a comprehensive genome annotation for more than 40 species . The data from the UCSC genome browser can be easily accessed and used to provide customized genome annotations.
A well established remapping method is used in AffyProbeMiner and several other protocols [2–5], which regroup the probes into a probe set if they all match the same set of transcripts. This transcript level probe remapping provides the possibility to detect alternatively spliced transcripts. However, it does not provide an appropriate method to measure the levels of alternative splicing events. Mainly due to the recent development of exon arrays, algorithms predicting alternative splicing have been developed (see  for a recent review). The exon array algorithms should be carefully used for 3' gene expression microarray data, however. The oligo-dT based amplification method used in 3' gene expression microarray has a strong position effect rendering a signal bias towards the probes targeting the 3' ends of genes. A normalized intensity based method, such as Splicing Index [12, 13], is more appropriate to avoid this signal bias.
Here we report a new probe remapping protocol and demonstrate its use with the zebrafish genome. It is based on a combined zebrafish genome annotation by integrating transcripts from the Ensembl, RefSeq and GenBank databases using information downloaded from the UCSC genome browser. A transcript level probe remapping is applied by aligning the probes to the genome, removing the nonspecific probes, and grouping the probes according to the set of transcripts they map to. We explore the impact of using different databases for gene and transcript annotations. We also used the Splicing Index [12, 13] as an indicator of alternative splicing events. The advantage of using a comprehensive database in the probe remapping was demonstrated as more genes and more alternative splicing events are detected. Using two different zebrafish gene expression experiments, we show the benefits of using the more comprehensive remapping and confirm the improvement using real-time PCR validation.
To identify probes that have genome location issues, we aligned all 249,752 probes from the Affymetrix Zebrafish Genome Array to the genome of the zebrafish (genome version 7 (Zv7)) using Exonerate . 19,585 of the probes were identified with multiple matches and about 40,000 probes were identified with no match to the current genome. Further details and analysis of probes matching multiple genome locations is provided in Additional File 1. Thus, about 24% of probes were nonspecific for the genome. By removing these problematic probes, we are left with about 190,000 genome specific probes. The number of probes having genome location issues is strikingly high. The reason is mainly because Affymetrix originally designed the probes based on Expressed Sequence Tags (ESTs) from a number of databases e.g. Unigene, GenBank, dbEST . Some of these EST sequences were erroneous and their removal from updated databases results in the loss of the probes. We regard the removal of probes having genome location issues as an acceptable loss of signal in order to avoid erroneous mapping of the probes to unannotated exons or genes.
Number of transcripts and genes from each database and number of alignable genes and transcripts in the UCSC genome browser
No. of transcripts
No. of genes
No. of alignable transcripts
No. of alignable genes
Single data source
Multiple data sources 1
Summary of probe remapping results using different databases
Probes with multiple alignments to the genome
Probes with no alignment to the genome
Probes matching multiple genes
Probes matching intergenic region
Probes matching intron region
< 3 probes per probe set
Percentage of good probes
Probe remapping using the UCSC database allowed the highest number of probes to be retained, mainly because it integrates the GenBank database. The Affymetrix Zebrafish Genome Array was originally designed from GenBank . This shows the necessity of integrating GenBank in the probe remappings, which has been neglected by some probe remapping protocols [1, 16].
Summary of genes and transcripts matched by the probes using different databases
Number of transcripts matched by the probes
Number of genes matched by the probes
Number of genes matched by ≥ 2 probe sets
Average number of transcripts per probe set
Average number of probe sets per gene
Differences in the probe remapping results will affect the interpretation of results from microarray data analysis. This can be seen in lists of differentially expressed genes. For genes to have the same probe annotation using different databases we require the probes targeting this gene to be the same and to be clustered into identical probe sets. Genes have different probe annotations using different databases, either because genes in a subset of databases have more probes in the probe set, or the original probe set is split into separate probe sets to represent alternative splicing transcripts.
Comparison of genes represented by the remapped probe sets using different databases
Comparison of genes represented by the remapped probe sets using different gene definitions
Summary of gene definition and probe remapping results using different gene definition methods
Number of transcripts in the database
Number of genes defined by the method
Probe remapping result
Probes having genome location issues
Probes matching multiple genes
Probes matching no gene
< 3 probes per probe set
Percentage of Good Probes
The differentially expressed genes that were identified using Ensembl were mostly included in the gene list generated using UCSC. This shows that probe remapping using UCSC gives more extensive gene lists when searching for differentially expressed genes. A further benefit of using the multiple source database, UCSC, is the ability to predict more alternative splicing events. Exclusively more genes were identified showing alternative splicing patterns using UCSC than using Ensembl.
Several protocols have been published to improve probe remapping of Affymetrix microarrays. The general protocol is to remove the problematic probes and group the remainder by the gene or transcript that they target. The protocol that we implement here has been optimised to use a combination of features from existing protocols [1, 3–5]. Casneuf et al. reported that expression of nonspecific probes was highly correlated with off-target genes . Thus, we removed the genome-nonspecific probes in order to minimize the off-target probe pairing. This is more stringent than previous published methods. AffyProbeMiner has provided transcript consistent and gene consistent probe sets by clustering probes matching the same set of transcripts and genes . However, probes matching multiple genes should be avoided. Moll et al. aligned the probes to the transcriptome [1, 2]. Yet this may include probes that match unannotated transcripts, or transcripts which are non-alignable to the current genome.
How to group the probes into probe sets is a further concern in probe remapping. The transcript level probe remapping should reveal differences in gene splice isoforms. Dai et al. provided transcript targeted probe sets by grouping probes targeting individual transcripts  but this generated redundant probes in the remapping. The transcript level probe remapping used in this study is more appropriate as it clusters probes when they match the same set of transcripts [2–4]. Moll et al. applied a similar method and validated the splice variants by real time PCR . Lu et al. also reported that this transcript level remapping reduced the platform variance between microarrays . None of them, however, provided any method to measure the expression variation of gene splice isoforms.
Li et al. developed an ANOVA model based method to calculate the variance of the sibling Affymetrix probe sets . This model is widely used with Affymetrix exon microarrays to detect splice isoform variation . However, 3' gene expression microarrays have a strong signal bias towards probes targeting the 3' ends of genes. The probe hybridization efficiency with targeting genes may also affect the signal strength. These signal biases may affect the ANOVA model by giving false positive p-values. The Splicing Index calculates the tissue specific expression by pair-wise comparison of the normalized intensities [12, 13] and therefore, is more appropriate for 3' gene expression microarrays. Thus, in this work, we used the Splicing Index to measure the alternative splicing patterns in 3' gene expression microarrays.
Database usage is a major concern in microarray data analysis. As shown by Dai et al., the difference in probe set content using different databases caused a 30-50% difference in differentially expressed gene lists . Moll et al. compared their remapping results for HG-U133A with the AffyProbeMiner mapping . AffyProbeMiner defined 10,226 probe sets using RefSeq, whereas Moll et al. yielded 7,941 probe sets using Ensembl. Only 3,412 probe sets were identical between the two approaches. Our study reported a similar result. The number of genes having the same probe sets across databases ranged from 3,551 to 5,442 when compared between single source databases and the multiple source database UCSC (Table 4). However, none of the previous protocols provided any appropriate method to reconcile the large differences between the databases. Dai et al. provided downloadable remapping results using each individual database . AffyProbeMiner reported that using RefSeq and GenBank together may improve the mapping of the probes . Unfortunately, the database integration method used by AffyProbeMiner was computationally intensive using BLAT alignments of GenBank transcripts to the genome sequences. Furthermore, they did not integrate transcript information from Ensembl.
Here, we provide a more practical genome annotation method by downloading transcript information from UCSC, and clustering the transcripts by overlapping coding exons (exlink). This protocol was implemented in a Perl script and the genome reannotation can be finished within minutes. This protocol can be applied to any of the more than 40 species deposited in the UCSC genome browser to rebuild the genome annotation . Exlink provides a biologically meaningful annotation, and can easily be applied to all species with published genome sequences. It should be pointed out, however, that the gene definition methods compared in this manuscript, exbd, itbd and exlink, from RefSeq, Aceview and Ensembl respectively, all involve manual correction. Thus it is impossible to fully repeat their work in our study.
The issues described here become even more serious during the analysis of data from next-generation RNA sequencing . In these analyses, 20-25% of the good quality reads with unique matches to the genome cannot be mapped to annotated genes in Ensembl or Eldorado. It suggests our knowledge of the genomes is still limited and much work still needs to be done to improve the genome annotation. The annotation provided in this study is a combination of the information from three databases. This is easily applied and is essential in fully interpreting such large-scale data sets.
We developed an improved probe remapping protocol based on mapping probes to the genome sequence, removing nonspecific probes and grouping the probes into transcript level probe sets. The protocol is based on a combined zebrafish genome annotation by integrating the Ensembl, RefSeq and GenBank databases together. This integrated genome annotation will reduce database variation bias in large scale gene expression studies. The data analysis protocol used in this study improves the interpretation of gene expression data. This approach could easily be applied to other species and gene expression measurement platforms such as exon microarrays or RNA seq.
Zebrafish transcriptome cross reference files were downloaded from Ensembl (Zv7, July 2007), RefSeq (Zv7 Build3, July 2008), GenBank (October 23, 2008), Biomart (Zv7, Ensembl 52 genes) and ZFIN (October 23, 2008) (Table 1). The cross-reference files link the transcript IDs with gene IDs. Zebrafish genome sequences, transcriptome alignment coordinates and coding sequence (CDS) coordinates were downloaded from the UCSC genome browser. Only transcripts with 96% base identity with the genomic sequence were kept. If the transcripts had multiple alignments to the genome, only alignments having a base identity level within 0.1% of the best for RefSeq transcripts, and 0.5% of the best for GenBank transcripts were kept . The Affymetrix Zebrafish Genome Array probe sequences and Chip Description File (CDF) were downloaded from NetAffx (October 23, 2008) .
The remapping protocol was adapted from a probe remapping protocol described by Dai in 2005 . The remapping was performed as follows. The Affymetrix probe sequences were aligned to the zebrafish genome using Exonerate . Only probes that had perfect sequence identity with the genome were used in the study. Probes with no match to the genome or which matched multiple times, were removed, because these probes may match unannotated genes. These two filters ensure the probes will hybridize to a specific location in the genome. An exception is that probes having no match to the genome, but which match transcript sequences, were considered as probes which cross exon boundaries, and were included in this analysis. This was performed by assembling transcript sequences from the GenBank, RefSeq and Ensembl databases from the transcribed regions of the genome. The Affymetrix probe sequences were aligned to the transcript sequences using Exonerate ..
Probes which matched multiple genes were removed because these probes may generate nonspecific signals. This was due to probes mapping to the overlapping untranslated regions (UTRs) of pairs of genes. Probes matched to the intergenic regions or introns of genes were removed. Reverse complementary probes were organized into a different probe set. Because these probes targeted the opposite strand of the transcript, they usually generate a much weaker signal than the probes targeting the positive strand of the transcript (further analysis of this is given in Additional File 1).
The major change from Dai's protocol was that probes were reorganized in transcript-level probe sets by clustering probes matching the same set of transcripts, in order to measure transcript level expression. Apart from the above filters, we also required that each new probe set should include more than 3 probes (see Additional File 1 for further details). The remapped probe set definition was transformed into new probe sequence and CDF packages using the Bioconductor packages, matchprobes  and makecdfenv . R libraries of the probe and CDF packages, and annotation of the remapped probe sets are provided in Additional File 3. The probe remapping protocol is implemented in a Perl script, and provided in Additional File 4.
Exlink, the gene definition proposed by Ensembl was used in the study in order to organize transcripts from multiple databases . Transcripts overlapping in coding exons were clustered in the same gene. Gene annotation using exlink is provided as Additional File 5. Several other gene definitions were also used to demonstrate the impact of gene definition on the probe remapping. Itbd, the gene definition proposed by Aceview , clusters all transcripts which share at least one intron boundary. Exbd, the gene definition proposed by RefSeq, clusters all transcripts sharing both boundaries of at least one exon . Overlap_0, the old gene definition used by Ensembl, clusters all transcripts which overlap in the exon sequences . The gene definition protocols are implemented in Perl scripts, provided in Additional File 4.
Eyes were dissected from 3 and 5 days post fertilization (dpf) zebrafish larvae, and total RNA extracted with Qiashredder columns and the RNeasy Minikit (Qiagen, Hilden, Germany) in an RNase-free environment. RNA was quantified using the Nanodrop ND-1000 (ThermoScientific) and quality was determined using RNA 6000 Pico chips with the Bioanalyzer 2100 (Agilent). Three biological replicates per timepoint with equal amounts of RNA were amplified and labelled using a two-cycle target labelling protocol (Affymetrix) and hybridised with Affymetrix Zebrafish Genome Arrays. The 3 and 5 dpf eyes microarray data set was deposited in GEO with series accession ID of GSE19320. A published microarray data set studying 36 and 52 hours post fertilization (hpf) whole zebrafish embryos was downloaded from GEO, with sample accession IDs from GSM224790 to GSM224796 . All experimental research on animals followed internationally recognized guidelines and approval from the UCD Animal Research Ethics Committee.
The signal intensity of the microarray was normalized and summarized using the Bioconductor package, gcrma . Differentially expressed genes were selected by using Bioconductor package, limma . The eBayes p-value was adjusted by using Benjamini & Hochberg's method . The threshold for differentially expressed genes was set as adjusted p-value < 0.05 and fold change ≥2 or ≤0.5.
Where NIi,1 is the normalized intensity of the probe set i in the first tissue, and NIi,2 in the second tissue.
If the Splicing Index for any probe set of a gene is ≥0.5 or ≤-0.5, this gene is predicted to be alternatively spliced . The probe set expressions are used separately to indicate transcript level expression. If the Splicing Indexes for all probe sets in this gene are below this threshold, the probe set expressions can be averaged to indicate the gene level expression.
To validate the microarray results, real-time PCR was used. Total RNA was extracted from zebrafish eyes as described above. Three biological replicates per time-point with equal amounts of RNA were reverse transcribed to cDNA with random hexamers using the SuperScript III First-Strand Synthesis System (Invitrogen, UK). Negative controls were synthesized using the same reaction without SuperScript III enzyme. Real-time PCR was performed on three biological replicates per timepoint using the ABI 7900HT Sequence Detection System with SYBR Green as the reporter. The initial cycle was 2 minutes at 50°C and 10 minutes at 95°C. Then the samples were cycled at 95°C for 15 seconds and 60°C for 1 minute. 18s rRNA primers were used as control. The primers were designed using Primer3  and synthesised by Eurofins MWG Operon (Germany). Primer sequences are listed in Additional File 2. All primers showed specific amplification in real-time PCR. Water and negative controls were not detected in the real-time PCR (Ct>40). Real-time data were normalized according to 18s rRNA, and standardized to the lowest abundance value. The algorithm was illustrated using Microsoft Excel in Additional File 2.
We would like to thank Conway Transcriptomics Core Facility for technical assistance in real-time PCR, Beata Sapetto-Rebow for assistance with zebrafish facility, Karen Power for help on alternative splicing, and Bronwen Aken for helpful discussion in gene definition methods. We thank the Irish Research Council for Science Engineering and Technology (IRCSET) Graduate Education Programme (GREP), and Science Foundation Ireland (SFI) SFI 04/IN3/B559 (BK), and SFI 06/RFP/BIM052 (BK) for funding support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.