Gene definition
The Entrez Gene database at NCBI was used as the reference to assign consistent gene annotations across platforms. The Entrez Gene database groups GenBank and RefSeq accession numbers and gene symbols by unique, non redundant and traceable gene ids [8]. These ids could also be cross-referenced to ENSEMBL transcripts and genes, and to any id that can be related to a GenBank accession number. Tables associating gene ids to GenBank and RefSeq sequences were obtained from the NCBI ftp server (files downloaded on December 15th 2005). A third table associating UCSC Known Genes to ENSEMBL transcripts was obtained from the UCSC ftp server (used files and source URLs in Table 1). A single table called genexref was created in the ArrayGene MySQL database, with three columns: gene id (from Entrez), sequence id, and type of sequence (i.e. Ensembl, genomic, mRNA, protein, RefSeq, symbol, synonym, NIA transcript, or Unigene). This table is the core of the gene annotation process of probes in microarrays. Only current Entrez Gene ids were used. The file gene_history from NCBI was used to delete obsolete genes gene ids from the database.
Genomic location for genes
The physical location of genes was defined as the position where any sequence associated with the gene could be aligned with confidence by sequence comparison. Genomic alignment results were obtained from the UCSC Genome Browser Database [14] for the August 2005 mouse genome assembly (Build 35.1). Text files from this server contain the results from aligning all mouse mRNA sequences deposited at GenBank against the mouse genome sequence using BLAT. All sequence alignments in the UCSC database have at least 98% identity. The track of "Known Genes" in the UCSC Genome Browser provides genomic coordinates only for mRNA that could be associated with a protein in SWISS-PROT, TrEMBL, or TrEMBL-NEW. Similarly, the track called RefSeq Gene contains codon and intron positions for RefSeq sequences. UCSC tracks were used in a hierarchical order: 1) Known Genes, 2) RefSeq Genes, and 2) mRNA sequences. Genes mapping to unordered scaffolds or to multiple positions from BLAT alignments were not considered. A table labeled genemap was created in MySQL to store the best, if any, mapping information for every gene following the above criteria. The table stores the genomic coordinates from a single sequence per gene. As a result of importing these tracks in hierarchical order, coordinates from known Genes are preferred over RefSeqs, and these are over mRNA ones. The genemap table is updated with every new release of the mouse alignment from NCBI. The table is stored in an alignment-specific database herein called Aligndb. The actual name of this database is provided by the user when a new genome alignment is available, and the current database is called mm7, in reference to the name provided by the UCSC genome browser to the annotated Build 35.1 assembly. The proportional contribution from each UCSC track to the genemap table is shown in Table 2. This table shows the number of genes in each track that could not be used because they mapped to multiple positions in the genome or to unordered scaffolds. The last column of the table shows the number of genes for each track that could be mapped to a unique known position in the genome and that were not already present in the database (i.e. were not included in any track previously imported).
Database architecture and maintenance
Two MySQL databases were created and are maintained separately. The ArrayGene database contains the genexref table which provides cross references between sequence identifiers and Entrez Genes and the arrays_table table storing all the information about microarray annotations (Figure 1). A second database is created every time a new mouse genome build is released. This database contains the genemap table providing mapping positions for genes probed in microarrays. The ArrayGene package was created to build and maintain both databases using command line and web forms and is included as [Additional file 4] in this report. Future new versions of the software will be available from the authors' website [24]. This system is a generic tool for the comparison of gene coverage providing a user interface to generate reports and can be used for any species with a sequenced genome and a database of genes associated with sequences, which are used to produce probes in microarrays. The input of files for both gene information and gene mapping are format independent and the programs can be customized through command line options to use any data file that is in tabular form. The software is distributed along with a user manual under the General Public License V.2 [25] and is freely available for the research community. No genomic or microarray data is included in the package and the user must follow the instructions included with the software to install and populate the database. The software depends on MySQL v. 3.23.50 or later versions, Perl (only tested on v. 5.8.2) and on some Perl modules described in installation instructions of the User Guide included as [Additional file 3].
Gene annotation of microarray probes
The import_array Perl script is designed to extract the probe annotations provided by vendors for their platforms in files commonly called gene lists. The script can parse a text file, extract probe and sequence ids (even from fasta-style description lines), connect to the ArrayGene database and find the Entrez Gene id for each probe. Finally, import_array can either write an output file with genomic annotations for probes or create a table in the database with this information. When multiple sequences are associated with a given probe in the gene list the Perl script checks if they all match the same gene. The program can look for sequence ids in 3 columns maximum, and it can detect inconsistencies between them if they point to different Entrez Gene ids in the database. It also detects single sequences that are associated with more than one gene in the Entrez Gene database. In any of these two cases, since a unique gene cannot be associated with a probe it is annotated as a cross hybridizing probe. The ids of all genes associated with that probe are stored but are not used in genome coverage comparisons. Reports about gene coverage in annotated microarrays are done by a series of CGI scripts providing an intuitive web interface to the database.
Microarray platforms and oligonucleotide sets
This study compared the gene coverage of mouse whole genome microarray platforms that are currently available to investigators. One older oligoset was also included to evaluate coverage improvements with time of release ([Additional file 1] in supplementary material provides time of release for some platforms, URL and filename for the Genelists used here). We compared four commercial one-color arrays, Affymetrix [3] Mouse Genome 430 2.0 Array, Amersham [26] Codelink Mouse Whole Genome, Sentrix Mouse-6 Expression BeadChip from Illumina [27], and Applied Biosystems [28] Mouse Genome Survey. We also compared the two-color Agilent [29] Mouse Oligo Microarray Kit; the commercial oligoset Operon [30] Array-Ready Oligo Set V. 4 and one previous version (Operon V. 3); the Sigma-Genosys Mouse Oligonucleotide library (available through Lab on Web [31]), and the Mouse Exonic Evidence Based Oligonucleotide (MEEBO) [32] produced by a group of investigators at UCSF, Stanford, Rockefeller, Basel, and the Stowers Institute. Table 3 lists all these platforms, indicating their full and short name used throughout this paper. A special mention should be made about the ABI microarray which contains almost 4,000 proprietary sequences, which are not in the public domain. The probes in this platform were designed based on the Celera Mouse Genome Alignment (Celera, Rockville, MD), which contains gene annotations based in proprietary methods. The present study compared gene coverage by using sequence annotations equivalent to public accession numbers available for 29,195 probes. However, these probes may target genes without an exact counterpart in the public domain given large methodological differences that exist for defining a gene between the Celera and the public domain approaches.
Gene lists for every platform included in this study, except for ABI, were downloaded from the vendors' websites. ABI's gene-list was obtained directly from the vendor.