Using a pipeline for high-throughput sequencing and bioinformatics, we generated ~72,000 reads of sequences expressed in the zebra finch brain. We incorporated these and 27,000 more reads from other projects in the community into a single master database available to the public via the internet . Our approach has been iterative, progressing through three sequential builds each one incorporating additional data. We used our evolving sequence assemblies in the design of two different microarrays for gene expression studies and we have deployed these resources in a large set of collaborations that should continue to bear fruit in the future.
The 86,784 filtered high-quality sequences in the current (third) build of ESTIMA:Songbird describe approximately 31,658 non-redundant transcript segments. Although this is a dramatic increase in the number of sequence information available (there were only 72 records for songbirds in Genbank when our project began ), this collection is still only a partial representation of the sequence diversity in zebra finch brain. Most of the sequences are represented by only a single read and even after three library subtractions, one of every three new sequences generated was still novel. An early analysis of RNA complexity in the canary forebrain suggested a population of as many as 100,000 different transcripts, with total nucleotide diversity approaching the size of the genome itself . This complexity seemed implausible when that analysis was first done, but recent studies in species from flies  to humans [39–41] have now demonstrated that transcription is much more widespread in the genome than previously anticipated.
Some of the 31,658 non-redundant sequences undoubtedly represent discontinuous portions of common transcripts, including alternative splicing products as well as non-overlapping ESTs. Indeed, using the chicken IPI annotation, we detected significant matches for 13,219 sequences but only 8127 (61%) of these IPI identifiers were unique. Presumably this indicates that a substantial fraction of the 31,658 "unique" sequences in the SB3 assembly are in fact derived from a smaller set of genes; by extrapolation this suggests an upper bound of 19,311 (31658 × 61%) unique transcription units represented by the SB3 assembly. Hence we have no reason at this juncture to expect the complement of protein coding genes is any larger for the zebra finch than it is for the chicken, i.e., 20,000–23,000 .
Approximately 23% of ESTIMA:Songbird 3 sequences do not align significantly against the current build of the chicken genome (WASHUC2, May 2006). Some of these could represent orthologs of sequences that are present in the chicken genome but not yet incorporated into the chicken genome build. The current chicken genome assembly is believed to lack approximately 8% of the genes in the full genome complement, including sequences on the female-specific W chromosome in particular. Our cDNA libraries were prepared from a mix of both sexes.
As is true for all gene indices, annotation of function is a major ongoing challenge. Very few genes in songbirds have received any sort of direct experimental analysis thus we must rely on inferences based on similarities to sequences in other species. The closest species with significant genome annotation is the chicken. However, annotation of the chicken genome itself has relied heavily on putative orthologies to mammalian sequences. During our three sequential builds we probed a number of databases to assess their utility for generating putative functional annotations according to sequence similarities. As of Build 3, we focused on the chicken International Protein Index as a primary annotation reference. This is a relatively conservative reference and reflects alignments only with known or predicted protein coding sequences. Only 42% of our Build 3 assembled sequences align against the IPI index, and these represent only a third of the total number of sequence identifiers in the complete chicken IPI index. We anticipate that many of the currently unannotated Build 3 sequences probably represent non-coding segments of IPI-annotated genes. Identification of these relationships should follow from the annotation of the zebra finch genome assembly, expected in 2008.
Our database probably also contains some non-coding RNAs . A preliminary BLAST analysis of the 86,784 ESTs in our current database against miRBASE , a compendium of ~5000 known microRNA sequences, revealed significant alignments for 789 zebra finch ESTs against 489 mature microRNA sequences (data not shown). Additional work will be required to establish whether these sequences are indeed processed into mature microRNAs in the zebra finch.
To stimulate broad application of genomic approaches to songbird research, we generated a cDNA microarray from our first EST build and organized a Community Collaboration system for design and execution of experiments using this array. Part of our group also developed an Affymetrix oligonucleotide array based on our second EST build . To validate these tools and to refine the general methods for the Community Collaborations we did several analyses. These analyses included assessment of optimal methods for RNA purification, MA plots of hybridization data, amplification and labeling from dissected brain samples; and methods for microarray statistical analysis. Some of these studies are summarized in our Call for Community Collaborations (Additional File 2). Validation of the Affymetrix array has been done by the group at Lund University and is not included in this publication. In our main report here, we described several analyses that may have more general implications for transcriptome analysis in songbirds.
The SoNG 20 K cDNA microarray includes 5 different probes for transcripts derived from the zenk gene, including one from the canary sequence. All probes gave signals well above background with labeled cDNA from the dissected auditory lobules of individual birds hearing either song or silence, and all reported an increase in signal in the song-stimulated birds relative to the silence controls. Comparing the near-full positive control probes from zebra finch and canary, both gave equivalent signal intensities and fold-change measurements. In the direct-comparison design used for this experiment, a group size of 6 birds per condition hybridized on 6 arrays was sufficient to reach fairly robust statistical significance for three of the single-spotted probes. The fourth (SB02047B2F12.f1, the 5'-most EST) would not have passed the threshold for identification using even generous criteria (e.g., raw p < 0.05, FDR < 20%). This EST reported a mean intensity that was almost three-fold higher than for the zebra finch and canary positive control probes. We suspect this may indicate cross-hybridization to some other sequences for this particular probe. Alternatively, the variations in mean intensities and fold-changes for the five different probes could indicate differential transcription or RNA processing across the transcript, phenomena that are increasingly observed in high-resolution analyses of transcription units . It is also worth noting that in this preliminary experiment we detected changes in 220 other probes at the significance threshold of the single-spotted zebra finch cDNA postive control (FDR 16%). Analysis of song-regulated sequences is a topic of several of the proposed Community Collaborations and will be described in detail elsewhere.
With the production of the Lund-zfa array after ESTIMA Build 2 , we were able to compare hybridization results using the same tissue samples on the two array platforms. Almost all of the transcripts represented on each platform were detected in adult male and female brains, and there was a general concordance between the platforms on the estimated log2-fold changes for the transcripts they had in common. While the exact "significant" gene lists varied slightly, much of this difference is probably due to differences in sample treatment and analyses between the two array types, which determine the probability of detecting differences. For, example, the SoNG-20 K arrays used a common reference design which helps to normalize differences between arrays, whereas the individual samples had additional shipping and handling before being hybridized to the Lund-zfa arrays, which could have added to their variability. Despite the differences between the platforms, a principle components analysis (Fig. 2B) shows that the biological differences between males and females dominate over any technical differences between the arrays. In sum, both array platforms describe a similar overall picture of differential gene expression and should be useful for further studies.
When we surveyed the songbird research community for interest in using the 20 K array, we recognized that there was considerable interest in experiments involving species other than the zebra finch (Table 7). We initiated comparative genomic hybridizations at this point primarily to evaluate the feasibility of cross-hybridizations with other species. The use of single-species microarrays for species cross-hybridizations is a controversial point in the literature (e.g., pro: [44–50]; con: ; cautionary: [49, 52, 53]). Direct comparison of gene expression in different species on a single-species array is especially problematic, as variations in gene copy number, sequence divergence and RNA expression levels are all confounded. However, for most of the experiments initially proposed for Community Collaborations, the goals were to analyze particular phenomena within a single species, focusing on phenomena that are not well represented in the zebra finch (e.g., effects of photoperiod). We believe that our cross hybridization studies clearly validate our 20 K cDNA microarray for within-species comparisons of other oscine songbirds (Fig. 3), with the caveat that some minority of the probes will give reduced signals compared to hybridizations with zebra finch material. One must also interpret any annotations with particular care, and cross-hybridization results based on particular zebra finch array probes may need to be verified independently, e.g., by resequencing in the target species and RT-PCR to confirm regulation.
The Lund-zfa array was developed with our second build of ESTIMA:Songbird (zebra finch), and it was specifically designed with the intent of supporting research in other songbird species. Quantitative analysis of CGH to DNA from the common whitethroat (Sylvia communis; family Sylviidae) supports the efficacy of cross-hybridization as 96% of the ESTs are called as present . Taking these studies together with our own, zebra finch probes have now been shown to be adequate for detecting signals in the all three major superfamilies of the Passerida parvorder of the oscines (divergence time < 50 MYA) [1, 54, 55]. However, our data for the kingbird (Fig. 3) suggest that use of zebra finch arrays for analysis of sub-oscines (divergence time ~70 MYA) may be more problematic. Direct empirical tests are needed to establish viability for use with Corvid species, oscines that diverged from the Passerida ~50 MYA.