Evaluating annotations of an Agilent expression chip suggests that many features cannot be interpreted
© Gertz et al; licensee BioMed Central Ltd. 2009
Received: 5 August 2009
Accepted: 30 November 2009
Published: 30 November 2009
While attempting to reanalyze published data from Agilent 4 × 44 human expression chips, we found that some of the 60-mer olignucleotide features could not be interpreted as representing single human genes. For example, some of the oligonucleotides align with the transcripts of more than one gene. We decided to check the annotations for all autosomes and the X chromosome systematically using bioinformatics methods.
Out of 42683 reporters, we found that 25505 (60%) passed all our tests and are considered "fully valid". 9964 (23%) reporters did not have a meaningful identifier, mapped to the wrong chromosome, or did not pass basic alignment tests preventing us from correlating the expression values of these reporters with a unique annotated human gene. The remaining 7214 (17%) reporters could be associated with either a unique gene or a unique intergenic location, but could not be mapped to a transcript in RefSeq. The 7214 reporters are further partitioned into three different levels of validity.
Expression array studies should evaluate the annotations of reporters and remove those reporters that have suspect annotations. This evaluation can be done systematically and semi-automatically, but one must recognize that data sources are frequently updated leading to slightly changing validation results over time.
Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F consists of 43,376 oligonucleotides 60 nucleotides in length, most of which are annotated as corresponding to the sequence of a known or predicted human gene, along with a number of probes that function as controls. Agilent supplies an annotation file for the array. The file provides the sequence of each oligonucleotide, its position on the NCBI human reference assembly, and the gene and transcript putatively associated with it. The annotation file states "This multi-pack (4 × 44K) formatted microarray represents a compiled view of the human genome as it is understood today." From here on, we refer to each oligonucleotide as a "reporter", rather than the more commonly used terms oligo, probe, or 60-mer, to indicate that each oligonucleotide is supposed to "report" the expression of a single gene unambiguously. Some reporters unambiguously distinguish between transcripts of a single gene, but others hybridize with more than one transcript of the same gene.
We sought to validate the annotations against the current information for the human genome. Preliminary analysis showed that a substantial number of reporter annotations are questionable, and that other reporters may not be a specific representation of a unique gene. Moreover, some reporter annotations are incomplete and fail to provide a discernible link between an oligonucleotide and a human RNA. We therefore validated the reporter annotations against the data present in the NCBI Entrez  database http://www.ncbi.nlm.nih.gov. Because numerous reporters were annotated with an identifier from the Gene Index Project but not recognized by Entrez, we used the Gene Index Project  as an additional source of transcript data.
Entrez is a collection of related databases of biological data, among which are the Entrez Nucleotide database of biological sequences, the RefSeq  database providing reference chromosome assemblies and transcribed sequences for several species, and the Entrez Gene  database cataloging known genes. Agilent includes the RefSeq RNA identifier and Entrez Gene identifier as standard fields in the annotation file. The RefSeq identifier is, however, supplied for only approximately two thirds of the oligonucleotides on the chip.
The Entrez databases are frequently updated. As our knowledge of the human genome increases, RNA and gene records are added and edited. Unreliable records are deleted or suppressed. Major changes to the data of a gene merit a change in the identifier of the RefSeq RNAs, or even of the Entrez Gene record. We therefore sought to associate the oligonucleotides on the array with human genes using the BLAST [5–7] and Splign  alignment algorithms, and cross references between current database records, rather than relying on the given identifiers. Since Entrez databases have been updated multiple times after the Agilent annotation file was prepared, some discrepancies were expected. In concept, this study is similar to a study performed by Gaj et al.  who evaluated the annotations against protein databases, which do not include the increasing number of non-protein coding genes. The method and programs (which are freely available for download; see Methods) used to perform this validation should be applicable to other gene chips.
Reporters divided into the five categories discussed in Results
RefSeq RNA Valid
Other Gene Valid
Results of aligning all eligible reporters to the database of human RefSeq RNAs
Results of placing the reporters that align with the RefSeq RNA transcripts of a single gene on the chromosome
We do not require that the GeneID associated with a reporter during the validation process matches the GeneID supplied by Agilent, nor that the chromosomal positions match exactly. GeneIDs may differ because the Entrez and RefSeq databases are not static, and sometimes updates to records merit changes in an identifier. The positions found by Splign sometimes disagree with the positions supplied by Agilent because a splice site occurs within the alignment, and Splign resolves the splice differently. There are also 87 reporters that are fully valid, but align to more than one position within their identified gene.
Counts of reporters associated with a putative transcript not in RefSeq
Results of placing the reporters that do not align with a RefSeq RNA transcript
Other Gene Valid
We initiated a study to re-evaluate published expression data that were suspicious because they suggested that some mouse genes hybridized better to the Agilent 4 × 44 chip than the orthologous human gene, even though the chip is designed to hybridize to human genes . Preliminary sequence analysis showed that some of the reporters did not align to any known human RNA, and others aligned only to reverse complements. Therefore, we decided to check the annotation file reporter by reporter. To our surprise, approximately 23% of the reporters failed basic tests and could not be assigned any meaning. Thus, the signal intensities of these reporters, which we call "invalid", do not provide useful information as to the expression level of a specific gene. While we had expected some reporters to be problematic - perhaps 5-10% - due to updates to information about the human genome, we did not, however, expect that as much as 40% of the reporters on a 44K gene expression array would for one reason or another yield uninterpretable data. The distinction between 40% uninterpretable and the earlier 23% includes reporters that map uniquely to the human genome but not to a region covered by a RefSeq transcript; for example, this includes reporters that map to an intron.
A search with PubMed and Google found that numerous studies have used Agilent 4 × 44K human expression chips; we cite 10 examples from different research groups [11–20]. Each paper has a small subsection in Methods explaining how the 4 × 44 arrays were hybridized and how the data were analyzed. None of these 10 studies seems to have considered the possibility that the annotations provided do not correspond unambiguously to human genes. The Methods in two studies [12, 13] describe some steps to restrict analysis to those reporters for which the Agilent annotation includes a human gene symbol. Another study  used a pre-established list of genes, but does not explain how gene identifiers were matched to the annotation file. The lack of a gene symbol in the annotation file eliminates 6089 of the 9964 reporters we consider invalid, and also eliminates 799 of the 25505 fully valid reporters, 148 of the 1859 Refseq RNA valid reporters, 1273 of the 2187 other gene valid reporters, and 2470 of 3168 possibly valid reporters. Of the 31904 reporters with a gene symbol, we consider 24706 (77%) fully valid. Our validation of the annotated transcript identifier invalidates fewer reporters than a test for the presence of a gene symbol, and would not, by itself, be an effective filter. An example of a fully valid reporter that is not annotated with a gene symbol or the identifier of a transcript is reporter ID 27386, which aligns perfectly to transcripts of the gene RPP14 (GeneID:11102).
Standards for microarray experiments, such as MIAME , are evolving. Generally, these standards seek to ensure that the experiments and data analysis are sufficiently detailed so as to be reproduced at another site. The MIAME paper  does recognize the potential for annotation problems, stating, for example, that "Because references to an external gene index may not be stable, it is essential to physically identify each element's composition. Disclosing the nature of the relationship between an array element and its cognate gene's transcript allows informed assessment..." However, there seems to be no easy-to-enforce mechanism to ensure that the annotations that come with an expression array are internally consistent, let alone consistent with changing external genomic databases. So long as researches follow MIAME and deposit data in detail, it is possible to apply reporter validation post-facto.
It is safe to use the 25505 fully valid reporters.
It is unsafe to use the 9964 invalid reporters.
The 4046 reporters that are either Refseq RNA valid or other gene valid might be used for gene-based studies.
The 7214 reporters that are in any of the three intermediate categories (Refseq RNA valid, other gene valid, possibly valid) might be used for position-based studies.
In any study that uses reporters in the three intermediate categories and reports either genome-wide or chromosome-wide results, the analysis should be repeated with the fully valid reporters only to verify that this more reliable subset provides qualitatively identical results.
The software we developed for this study is available at ftp://ftp.ncbi.nlm.nih.gov/pub/microarray_pipeline. Because some of the data sources change over time, readers may wish to rerun the analysis. Alternatively, readers may wish to modify the software to analyze other microarray annotation files.
We validated reporter annotations using the data current in Entrez on September 29, 2009. On this date, the current reference assembly was build 37.1, and the version number of RefSeq was 37. Each transcript in RefSeq is linked with an Entrez Gene database identifier (GeneID). Thus, given a RefSeq transcript, it is possible to find its GeneID. Conversely, for any human GeneID, one may determine the list of all transcripts for that gene that appear in RefSeq. The Entrez Gene database is the direct successor of the LOCUSLINK database, which is mentioned in the Agilent annotation file. The two databases use the same identifiers.
We retrieved transcript data from the Gene Index Database on May 29, 2009. On this date, the release number for the human genome was 17.0. We performed a batch retrieval using the web interface http://compbio.dfci.harvard.edu/tgi.
Microarray Annotation File
We analyzed the oligonucleotides on the Agilent-014850 Whole Human Genome Microarray 4 × 44K G4112F. Agilent provides an annotation file, putatively associating reporters with expressed human RNAs, and these RNAs with their respective genes. Agilent sometimes releases updated versions of the annotation file. We used a version from late 2007 that was used in a previous study . The latest version of the annotation file, released April 16, 2009, gives nearly identical results for fully valid and RefSeq RNA valid reporters; the only differences arise for a few reporters that have been assigned a different chromosome in the newer annotation file. Using the newest annotation file does not qualitatively change the results for the other gene valid, possibly valid, or invalid markers.
Of the 43,376 non-control oligonucleotides on the array, 414 had no chromosome labeled, 20 were associated with the mitochondrial genome, 133 were associated with the Y chromosome, and 126 were labeled with the qualifier "random". We did not consider any of these reporters, choosing instead to focus on the 42,683 reporters associated with the autosomes and the X chromosome.
Aligning Nucleotide Sequences
Except where otherwise stated, we aligned nucleotide sequences using NCBI's MegaBLAST program, version 2.2.21, with default settings. MegaBLAST is optimized for nearly identical sequences. By default, MegaBLAST filters the query sequence to exclude low-complexity regions. Thus, it may find no alignments for some reporters that contain repetitive DNA. The use of this option is appropriate, as we have low confidence that repetitive DNA may be used to identify a unique gene.
MegaBLAST has an option "-S" that causes it to search a database of nucleotide sequences using the query only in the forward sense, or only in the reverse complement sense. We use this option as appropriate without further comment.
Aligning Reporters with RefSeq Human Transcripts
We define that an oligonucleotide aligns with a human RNA transcript if MegaBLAST  with default parameters reports such an alignment. By default, MegaBLAST finds alignments between two sequences only if, after the query has been filtered to mask low-complexity sequences, the sequences still have a gapless alignment of length at least 28, sometimes relaxed to length 16. The alignment ultimately reported by MegaBLAST may be a higher-scoring alignment with gaps.
Comparing Reporters to their Annotated Sequence
We collected the set of reporters that were not validated and eliminated by alignment to a RefSeq human gene transcript. For each of these reporters, we searched field 15 of the annotation file for a sequence identifier. We preferred, in order, a RefSeq identifier, a GenBank  accession number, or an identifier from the Gene Index Project. GenBank accession numbers can be used to retrieve records from Entrez Nucleotide. The type of identifier was determined by the format. RefSeq identifiers start with the string "ref"; GenBank identifiers start with "gb"; and identifiers from the Gene Index Project start with "thc". We retrieved the sequences putatively associated with each reporter by the annotation file from either Entrez Nucleotide or from the Gene Index Project. At this stage, we eliminated reporters for which no identifiable transcript identifier could be found in the annotation file, or for which the identified sequence could not be found in either database. If the sequence was annotated within Entrez Nucleotide as replaced or suppressed, we excluded the corresponding reporter from further consideration.
We used MegaBLAST to align each candidate reporter with its annotated sequence. MegaBLAST's parameters were relaxed to use a word size of 16 and to omit low-complexity filtering, but an alignment was still required to have a score of at least 100 to be considered a significant match. Those reporters that had no significant match to their annotated sequence were eliminated from further consideration. For example, reporter 5467d is supposed to match NM_182578; the oligo does align to an obsolete version NM_182578.1, but not to the current version NM_182578.2. Therefore, 5467 was excluded.
Validating Reporters by Placement on the Reference Assembly
Appendix - Oligonucleotide Sequences
This research was supported by the Intramural Research Program of the NIH, NLM (EMG, AAS) and by the Intramural Research Program of the NIH, NCI (KS, MJD, TR).
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009, 37: D5-D15. 10.1093/nar/gkn741.PubMed CentralView ArticlePubMedGoogle Scholar
- Quackenbush J, Liang F, Holt I, Pertea G, Upton J: The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 2000, 28: 141-145. 10.1093/nar/28.1.141.PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35: D61-65. 10.1093/nar/gkl842.PubMed CentralView ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2007, 35 (Database issue): D26-31. 10.1093/nar/gkl993.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.View ArticlePubMedGoogle Scholar
- Kapustin Y, Souvorov A, Tatusova T, Lipman DJ: Splign: algorithms for computing spliced alignments with identification of paralogs. Biology Direct. 2008, 3: 20-10.1186/1745-6150-3-20.PubMed CentralView ArticlePubMedGoogle Scholar
- Gaj S, van Erk A, van Haaften RIM, Evelo CTA: Linking microarray reporters with protein functions. BMC Bioinformatics. 2007, 8: 360-10.1186/1471-2105-8-360.PubMed CentralView ArticlePubMedGoogle Scholar
- Sengupta K, Camps J, Mathews P, Barenboim-Stapleton L, Nguyen QT, Difilippantonio MJ, Ried T: Position of human chromosomes is conserved in mouse nuclei indicating a species-independent mechanism for maintaining genome organization. Chromosoma. 2008, 117: 499-509. 10.1007/s00412-008-0171-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalie E, Jaitin DA, Podoplelova Y, Piehler J, Schreiber G: The stability of the ternary interferon-receptor complex rather than the affinity to the individual subunits dictates differential biological activities. J Biol Chem. 2008, 283: 32925-32936. 10.1074/jbc.M806019200.View ArticlePubMedGoogle Scholar
- Bhatia B, Jiang M, Suranemi M, Patrawala L, Badeaux M, Schneider-Broussard R, Multani AS, Jeter CR, Calhoun-Davis T, Hu L, Hu J, Tsavachidis S, Zhang W, Chang S, Hayward SW, Tang DG: Critical and distinct roles of p16 and telomerase in regulating the proliferative life span of normal human prostate epithelial progenitor cells. J Biol Chem. 2008, 283: 27957-27972. 10.1074/jbc.M803467200.PubMed CentralView ArticlePubMedGoogle Scholar
- Mattsson JM, Laakkonen P, Kilpinen S, Stenman U-H, Koistinen H: Gene expression changes associated with the anti-angiogenic activity of kallikrein-related peptidase (KLK3) on human umbilical vein endothelial cells. Biol Chem. 2008, 389: 765-771. 10.1515/BC.2008.088.View ArticlePubMedGoogle Scholar
- Cao F, Wagner RA, Wilson KD, Xie X, Fu J-D, Drukker M, Lee A, Li RA, Gambhir SS, Weissman IL, Robbins RC, Wu JC: Transcriptional and functional profiling of human embryonic stem cell-derived cardiomyocytes. PLoS One. 2008, 3: e3474-10.1371/journal.pone.0003474.PubMed CentralView ArticlePubMedGoogle Scholar
- Verstraelen S, Nelissen I, Hooyberghs J, Witters H, Schoeters G, Van Cauwenberge P, Heuvel Van Den R: Gene profiles of a human alveolar epithelial cell line after in vitro exposure to respiratory (non-)sensitizing chemicals: Identification of discriminating genetic markers and pathway analysis. Toxicol Lett. 2009, 185: 16-22. 10.1016/j.toxlet.2008.11.017.View ArticlePubMedGoogle Scholar
- Hao Y, Chun A, Cheung K, Rashidi B, Yang X: Tumor suppressor LATS1 is a negative regulator of oncogene YAP. J Biol Chem. 2008, 283: 5496-5509. 10.1074/jbc.M709037200.View ArticlePubMedGoogle Scholar
- Tiwari VK, Cope L, McGarvey KM, Ohm JE, Baylin SB: A novel 6C assay uncovers Polycomb-mediated higher order chromatin conformations. Genome Res. 2008, 18: 1171-1179. 10.1101/gr.073452.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Galliher-Beckley AJ, Williams JG, Collins JB, Cidlowski JA: GSK-3β-mediated serine phosphorylation of the human glucocortoid receptor re-directs gene expression profiles. Mol Cell Biol. 2008, 28: 7309-7322. 10.1128/MCB.00808-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Vuillaume M-L, Uhrhammer N, Vidal V, Vidal VS, Chabaud V, Jesson B, Kwiatkowski F, Bignon Y-J: Use of gene expression profiles of peripheral blood lymphocytes to distinguish BRCA1 Mutation carriers in high risk breast cancer families. Cancer Inform. 2009, 7: 41-56.PubMed CentralPubMedGoogle Scholar
- Konishi K, Gibson KF, Lindell KO, Richards TJ, Zhang Y, Dhir R, Bisceglia M, Gilbert S, Yousem SA, Song JW, Kim DS, Kaminski N: Gene expression profiles of acute exacerbations of idiopathic pulmonary fibrosis. Am J Respir Crit Care Med. 2009, 180 (2): 167-75. 10.1164/rccm.200810-1596OC.PubMed CentralView ArticlePubMedGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FCP, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a micro-array experiment (MIAME)--towards standards for microarray data. Nat Genet. 2001, 29: 365-371. 10.1038/ng1201-365.View ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2009, 37: D26-D31. 10.1093/nar/gkn723.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.