A salmonid EST genomic study: genes, duplications, phylogeny and microarrays

Background Salmonids are of interest because of their relatively recent genome duplication, and their extensive use in wild fisheries and aquaculture. A comprehensive gene list and a comparison of genes in some of the different species provide valuable genomic information for one of the most widely studied groups of fish. Results 298,304 expressed sequence tags (ESTs) from Atlantic salmon (69% of the total), 11,664 chinook, 10,813 sockeye, 10,051 brook trout, 10,975 grayling, 8,630 lake whitefish, and 3,624 northern pike ESTs were obtained in this study and have been deposited into the public databases. Contigs were built and putative full-length Atlantic salmon clones have been identified. A database containing ESTs, assemblies, consensus sequences, open reading frames, gene predictions and putative annotation is available. The overall similarity between Atlantic salmon ESTs and those of rainbow trout, chinook, sockeye, brook trout, grayling, lake whitefish, northern pike and rainbow smelt is 93.4, 94.2, 94.6, 94.4, 92.5, 91.7, 89.6, and 86.2% respectively. An analysis of 78 transcript sets show Salmo as a sister group to Oncorhynchus and Salvelinus within Salmoninae, and Thymallinae as a sister group to Salmoninae and Coregoninae within Salmonidae. Extensive gene duplication is consistent with a genome duplication in the common ancestor of salmonids. Using all of the available EST data, a new expanded salmonid cDNA microarray of 32,000 features was created. Cross-species hybridizations to this cDNA microarray indicate that this resource will be useful for studies of all 68 salmonid species. Conclusion An extensive collection and analysis of salmonid RNA putative transcripts indicate that Pacific salmon, Atlantic salmon and charr are 94–96% similar while the more distant whitefish, grayling, pike and smelt are 93, 92, 89 and 86% similar to salmon. The salmonid transcriptome reveals a complex history of gene duplication that is consistent with an ancestral salmonid genome duplication hypothesis. Genome resources, including a new 32 K microarray, provide valuable new tools to study salmonids.


Background
Extensive knowledge of trout and salmon is a result of their widespread use in scientific research, as an environmental sentinel species and as a food and sport fish. Perhaps more is known about the physiology, ecology, genetics, behavior and biology of salmonids than any other fish group [1]. This background provides a wealth of data from an economically important and phylogenetically distinct group of fish that can help guide, and benefit from, new genomic studies.
The common ancestor of salmonids is purported to have experienced a whole genome duplication event between 25 and 100 MYA [6,7]. Extant salmonids are considered pseudo-tetraploid as they are in the later stages of reverting to a stable diploid state. Evidence for the ancestral salmonid autotetraploid genome duplication includes: multivalent chromosome formation during male meiosis and evidence for tetrasomic segregation at some loci [6]; one of the larger euteleost genome sizes (3-4.5 pg) with double that of sister groups Esociformes (0.8-1.8 pg, pike) and Osmeriformes (0.7 pg, smelt) [8]; homeologous chromosomal segments based on recent genetic maps and comparative studies using microsatellite markers, and duplicated gene family studies such as Hox, Major Histocompatibility complex (MH), growth hormone, and nineteen allozymes [6,[9][10][11][12].
The genome duplication in salmonids is the most recent genome duplication in this lineage. There are now a number of studies and good evidence, primarily from sequenced zebrafish and pufferfish genome sequences, for tetraploidization/rediploidization early in the ray-finned fish lineage (350-400 MYA) [13][14][15][16]. Several of these studies have suggested that the ancestral fish duplication, in addition to the two ancestral vertebrate genome duplications, are part of the reason why ray-finned fishes make up nearly half of all extant vertebrates species and exhibit tremendous biodiversity affecting their morphology, ecology, behavior and evolution.
Vertebrate species diversity and body plan diversity have commonly been linked to genome duplications, although there is some debate on how well we can draw these conclusions based on the very old genome duplications commonly studied. Mechanistically, how a genome reorganizes itself to cope with duplicated chromosomes, gene dosage effects, and the role of gene duplications for evolution and adaptation are long-standing issues in biology that remain unresolved [6,[13][14][15][16][17]. The number and diversity of salmonid species, and their relatively recent genome duplication, make salmonids ideal for examining recent events that could have played such a pivotal role in generating gene diversity and species diversity found in modern vertebrates.
The genomics resources of salmonids are being rapidly expanded through a few large-scale genomics programs [18][19][20][21][22][23]. Here we identify 354,061 new ESTs from Atlantic salmon and several other salmonid and related species in order to obtain a comprehensive view of the salmonid transcriptome, identify species relationships, identify gene duplications and introduce a new 32 K microarray tool for transcriptome analysis.

Results and discussion
cDNA libraries New, directionally cloned, mixed tissue (brain, kidney and spleen), normalized cDNA libraries were constructed for Atlantic salmon (Salmo salar; European McConnell, and Canadian, Saint John River strains), chinook salmon (Oncorhynchus tschawytscha), sockeye salmon (Oncorhynchus nerka), brook trout (Salvelinus fontinalis), lake whitefish (Coregonus clupeaformis), grayling (Thymallus thymallus), and northern pike (Esox lucius). Separate normalized libraries were constructed from Salmo salar thymus, thyroid, and head kidney tissues. In addition, one full-length, mixed tissue, large insert (> 2 kb), non-normalized library was constructed to identify longer gene transcripts. cDNA clones were isolated, purified and sequenced from the 5' and 3' ends. Clone numbers and insert sizes for the different libraries and species that were done as part of this study are listed in Table 1.

Transcript analysis: sequence and assembly
To obtain a comprehensive list of genes in salmonids, we used a strategy of deep 5' and 3' EST sequencing from a few high quality libraries. This approach complements previous studies, which examined more limited EST surveys of cDNA libraries from a large number of different tissues and developmental stages [18][19][20][21][22]. For Atlantic salmon, over 30,000 clones were sequenced from each of the thymus, thyroid, and head kidney tissue libraries. From previously described normalized libraries, [19] 9,584 additional clones were sequenced from the Atlantic salmon pyloric caecum tissue library and 60,288 additional clones were sequenced from a mixed tissue library (rgb2; Table 1). The total number of clones examined from the rgb2 library was 84,176 which yielded 127,660 sequence reads or 30% of the total Atlantic salmon EST database. Even with this deep sequencing, nearly 13% of the last 637 reads were novel (< 99% over 100 bp) and the maximum redundancy for a single transcript from the rgb2 library was 58 (Table 1).
The results of the assembly of 298,304 Atlantic salmon ESTs obtained in this study along with 138,325 ESTs from previous studies [ [19,22], GenBank] are shown in Table 2. Due to the complexities of the salmonid genome duplication and because it provides a stable, conservative starting point for all subsequent analyses, our analysis began with a first stage assembly using stringent parameters (PHRAP: 0.99 repeat stringency and 100 minscore). A second stage assembly (96% repeat stringency and 300 minscore) was implemented to combine some of contigs which may be alleles, or possibly very recent gene duplications (distinguishing among alleles, minor assembly errors, miss-calls and very recent gene duplications, particularly in lower quality sequence regions is very difficult in the absence of genomic sequence data). In Atlantic salmon, 81,398 potential transcripts (2 stage assembly) were identified, of which 29,844 (37%) were similar (BLASTX, 1e-10) to annotated sequences in CDD or SwissProt protein databases. For comparison, an assembly of 246,704 ESTs from rainbow trout [ [18,21], GenBank] resulted in 51,199 transcripts, of which 19,266 (38%) had BLASTX hits. Assembled contigs are available [24].
Transcript surveys of additional salmonid species included 4,800-7,500 clones sequenced from each of chinook salmon, sockeye salmon, brook trout, grayling and lake whitefish. 11,664 sequences were obtained from chinook salmon, 10,813 sequences from sockeye salmon, 10,051 sequences from brook trout, 10, 975 sequences from grayling and 8,630 sequences from lake whitefish. Sequence, assembly and summary statistics are shown for those data obtained in this study (Table 1) and when combined with data from public databases ( Table 2). In addition, to provide non-genome-duplicated sister group comparisons, 2,304 clones were sequenced from northern pike (3,624 sequences) ( Table 1 and 2). For many of these species, the ESTs provided in this study represent nearly all or most of the known transcripts. Recently published data from rainbow smelt (Osmerus mordax) [23] was also included in Table 2.
To examine the relationships among the contig consensus sequences of Atlantic salmon we compared all contigs (including singletons) against each other by BLAST and plotted the number of top pair-wise alignments (E-value < 1e-50; length > 200 bp) with the identity score ( Figure  1). 36,775 contigs showed greater than 80% identity over 200 bp to at least one other contig. Of these, 12,883 were 97-99.9% similar to at least one other contig. These contigs may represent alleles, recent duplicates or errors in sequence data. 23,892 contigs show between 80 and 96.9% identity with at least one other contig. The large number of duplicated transcripts observed in the Atlantic salmon genome is consistent with the hypothesis of an ancestral salmonid genome duplication, though it is surprising that so many of the duplicated contigs are so similar. This observation is being pursued further in a separate study. The analysis of contig similarity shows that the majority of the 81,398 contigs represent distinct transcripts. Note that since the assembly process itself combines sequences with high levels of similarity (> 96% repeat stringency with minscore > 300; see Methods), very recent duplications may not all be identified in this process. Furthermore, since the species used in this study differ by greater than 5% (Table 3), this process would be expected to identify ancestral salmonid duplications occurring at or prior to the rainbow trout and Atlantic salmon speciation.
Determining the number of genes in Atlantic salmon from the number of EST contigs is difficult for several reasons; 1) the partial representation of genes by EST sequences may result in several contigs associated with a single gene transcript, 2) allelic or recently duplicated genes may be represented by similar but unique transcripts (this latter case is particularly problematic in pseudotetraploid salmonids), 3) alternative splicing, alternative poly adenylation and termination sites from the same gene can result in different transcripts, and 4) transcription products can occur from intergenic regions. An estimation of the number of genes in salmonids will require additional information such as full-length cDNA sequences and gene mapping information.

Salmonid comparisons
Similarity among the different salmonid species was assessed using the top BLASTN hit against Atlantic salmon and rainbow trout EST contig databases. The similarity values from chinook salmon, sockeye salmon, rainbow trout, Atlantic salmon (McConnell and Saint John River strains), brook trout, grayling, lake whitefish, northern pike and rainbow smelt are shown in Table 3   Number of aligned contigs (y-axis) out of 81,398 total contigs is plotted against percent similarity of alignments (x-axis) Figure 1 Number of aligned contigs (y-axis) out of 81,398 total contigs is plotted against percent similarity of alignments (x-axis).
contigs matching (E-value < 1e-25) at least one contig in the rainbow trout or Atlantic salmon databases. These comparisons provide only a very general indication of the similarity between transcriptomes of various salmonids, as assemblies contain both 5' (generally genic regions) and 3' (generally 3'-UTR regions) transcript reads. However, these DNA sequence similarity values correspond well to the limited number of values in the literature. Non-coding sequence similarity between rainbow trout and Atlantic salmon are 95% over 120 kb in MH class IA and B loci [12], and 93-97% over 4 kb in growth hormone (GH) genes [11]. Similarity between salmon and whitefish is 90-93% in GH genes [11].
Northern pike and rainbow smelt average 89.4 and 86.1% identity to rainbow trout and 89.6 and 86.2% identity to Atlantic salmon, but only 25-39% of these contigs matched anything in the rainbow trout or Atlantic salmon database. These latter comparisons have many fewer significant similarities identified partly because of the much older divergence times [3]. However, the reason for the lower than expected number of matches between northern pike and rainbow trout or Atlantic salmon is not clear. While the more distantly related rainbow smelt contigs show similar numbers of BLASTX hits to protein databases as salmonids, the northern pike contigs showed very few similarities to Atlantic salmon and rainbow trout contigs (25% compared to 39% for rainbow smelt and 87% for lake whitefish) and very few BLASTX hits to protein databases (8% compared to 50% for rainbow smelt and 36% for lake whitefish). One possible explanation may be due to longer 3'-UTRs in northern pike, but this remains to be confirmed.

Transcriptome representation
It is difficult to assess how comprehensive the extensive Atlantic salmon and rainbow trout EST databases are. However, 73% (37,573 of the 51,199) of all rainbow trout contigs are also found in Atlantic salmon. Moreover, only 28% of those transcripts unique to rainbow trout (13,626) have protein hits (E < 1e-25) that support their legitimacy as genic regions, while other single ESTs may be from spurious transcription. 91% of lake whitefish transcripts have a significant similarity (BLASTN comparisons with e-values less than 1e-25) to the Atlantic salmon or rainbow trout databases. Comparative data from chinook salmon, sockeye salmon, brook trout, grayling, lake whitefish and rainbow smelt are provided in Table 3.
Overall, these data provide support for extensive gene coverage in salmonid EST databases.

Full-length analysis
The rapid progress of EST sequencing has enabled an estimation of the number of full-length cDNA clones. Fulllength cDNAs (fl-cDNAs) are defined as having a "Start -Open Reading Frame (ORF) -Stop -3' UTR -polyA signal" with the ORF corresponding to a full-length protein.
Given multiple start and stop sites, alternative splicing and partial homologies to known proteins, it is difficult to give precise numbers of completed fl-cDNAs. However, TargetIdentifier (using BLAST comparisons to full-length genes in databases and Start signals; [25]) identifies 17,399 possible fl-cDNAs (averaging 1,361 bp in length) from the 81,398 possible transcripts in Atlantic salmon and 10,453 fl-cDNAs from the 51,199 rainbow trout transcripts. Thus far, about half of the predicted fl-cDNA meet all of the criteria above, and many of the fl-cDNAs are already fully characterized on a single clone. These tend to be the shorter (< 1.5 kb) genes. The list of over 10,000 putative fl-cDNA transcripts assembled from ESTs is available at the GRASP website [24] and further identification of clones for complete sequence analysis is underway.

Salmonid EST, assembly, ORF and annotation database
All ESTs have been deposited in GenBank, however the EST assemblies themselves and the resulting consensus sequences are also very useful in identifying genes. These assemblies, together with the raw data are available [24]. The assembly consensus sequences are available for download and for searching using BLAST tools. A contig visualization tool was developed to allow users to search for similar consensus sequences using BLAST searches, identifying consensus names and then visualizing the sequences, alignment, open-reading frames (ORFs), Tar-getIdentifier predictions, and BLASTX hits in a single view ( Figure 2: Cluster tools). Until such time as the genomes are completed, this database provides the salmonid community with access to several levels of EST and gene analyses.

Salmonid phylogeny and gene duplication
The relationships among major groups of salmonids have been largely unresolved, particularly with respect to the placement of Salvelinus (represented in this study by brook trout), Oncorhynchus (represented here by rainbow trout, chinook and coho salmon) and Salmo (Atlantic salmon) within Salmoninae, and the placement of Thymallinae (grayling), Coregoninae (whitefish) and Salmoninae (salmon) within Salmonidae [2,11,[26][27][28][29][30]. From the EST contigs (Table 2), 78 separate gene sets have been identified, each of which contained at least one EST contig sequence from each of five major salmonid genera (Oncorhynchus, Salmo, Salvelinus, Coregonus, and Thymallus), in addition to representation by a non-salmonid (Osmerus). Contig sequences within each gene set were aligned, trimmed to a common length (minimum of 300 bp) and analyzed using phylogenetic methods. 73 of the 78 gene sets could be identified by BLASTX searches to SwissProt databases (Table 4). For each gene set, a 70% neighbour-joining (NJ) consensus tree based on 500 Screen shot of Atlantic salmon contig viewer bootstrap replicates was generated and the consensus tree rooted with Osmerus mordax sequences (rainbow smelt).
The single species tree shown in Figure 3 represents a compilation of the phylogenetic results from 78 gene sets. In the summary tree, each branch is noted by; i) the number of 70% consensus NJ trees supporting the branch ii) the number of 70% consensus trees providing no resolution to the branch point, and iii) the number of consensus trees that conflict with the shown result. In this summary, the placement of Salmo as a sister group to Oncorhynchus and Salvelinus is supported in 18 of the 27 gene consensus trees for which resolution was found. Eight alternative consensus trees support grouping Salmo and Salvelinus, one consensus tree supports grouping Salmo and Oncorhychus, and the remaining 51 trees provide no resolution. Thus the overall result is in agreement with some of the more recent studies examining mitochondrial and nine nuclear genes [26], and suggests good support for grouping Oncorhynchus and Salvelinus apart from Salmo within the Salmoninae subfamily.
Consistent with traditional nomenclature, the Salmoninae group, which includes Salvelinus, Oncorhynchus and Salmo is also very well supported with 51 of the 54 resolved trees consistent with this grouping. The three discrepant trees supported a Salmo/Coregonus grouping.
The relationships among the three subfamilies within Salmonidae have not been extensively addressed at the molecular level. However, on the basis of a morphological analysis, Coregoninae (whitefish and ciscos) has been hypothesized as the earliest branch within the salmonids [ [29], also see [30]]. In the present analysis, 14 of the 25 informative gene sets are more consistent with the basal position of Thymallinae ( Figure 3). Of the discrepant trees, 8 sets support a Thymallinae/Coregoninae grouping and 3 support an ancestral position of Coregoninae. While these data are not definitive, there appears to be some support for an ancestral Thymallinae branching within the Salmonidae with Coregoninae as the sister group to Salmoninae. These data provide the first largescale molecular view of salmonid subfamily relationships and provide an important perspective on future analyses of duplicated genes, as well as physiological and ecological traits [27] that have evolved subsequent to the ancestral salmonid genome duplication.
The salmonid whole genome duplication hypothesis makes it is difficult to separate an analysis of species relationships from gene phylogeny. One expectation arising from a relatively recent genome duplication is evidence for extensive nuclear gene duplicates. Subsequent to the genome duplication, the number of observed duplicated transcribed genes is expected to decrease as, over time, one of the duplicates becomes transcriptionally inactive.
When multiple species are examined, some species may have both duplication products while other species may have only one representative. Evidence of an ancestral duplication is identified in gene trees that contain multiple species trees that may have missing representatives. Of the 78 gene sets examined in this study, 51 show clear evidence of multiple species trees within gene trees that are consistent with a gene duplication in the ancestor of Salmonidae, sometime after the separation of Osmeriformes and Salmoniformes fish. 23 gene sets (Table 4) provided no evidence for any ancestral gene duplication, and 4 sets could not be interpreted. The data from 78 gene sets representing 372 consensus sequences and 11,397 bp of aligned DNA from five salmonid genera, indicate that a large number of salmonid genes show evidence of extensive gene duplication at a phylogenetic position that is consistent with the whole genome duplication in the ancestral Salmonidae hypothesis. Further studies of Esociformes fish will more precisely establish the timing of some of these gene duplications.

Salmonid 32 K microarray
To use the data generated by ESTs and assemblies for examining gene expression, a new 32,000 feature cDNA microarray was developed. This new array is based on the existing 16 K GRASP array [31] plus 14,496 additional Atlantic salmon and 1,491 additional rainbow trout contigs that were identified as unique and were successfully amplified in this study. The 32 K cDNA microarray is composed mainly of 27,917 Atlantic salmon (AS) and 4,065 rainbow trout (RT) cDNA elements or features. 54% of the elements have fairly stringent (1-e10) hits to annotated members in public protein databases. Hybridization performance of this array was evaluated using Atlantic salmon, rainbow trout, coho salmon, brook trout and lake whitefish RNA obtained from liver organs. The success of hybridization of labeled target to the salmonid elements was judged by the numbers of Atlantic salmon and rainbow trout elements passing background plus 2 SD threshold values (see Methods). No transformations or normalizations were performed on the data. Overall statistics are presented in Table 5. In summary, for RNA isolated from the liver of Atlantic salmon, rainbow trout, coho salmon, brook trout and lake whitefish, an average of 48% of the 32,018 elements showed significant detection levels of expression. Comparing these results to that from the previous 16 K GRASP arrays indicates that doubling the number of elements from 16 K to 32 K resulted in the ability to assess expression patterns of approximately 61% additional transcripts. This represents a substantial increase in our ability to assess gene transcription patterns in salmonids. The hybridization performances of the different salmonid species (assessed from numbers of Atlantic salmon and rainbow trout elements passing threshold) conformed to expectations, given the close  evolutionary relationships of the species tested (92-94% identity, Fig. 3) and, with the possible exception of brook trout, all members of the family Salmonidae tested showed similar levels of hybridization to the Atlantic salmon and rainbow trout elements on the 32 K microarray. As the Salmonidae family represents 68 closely related species, the 32 K cDNA array provides an excellent opportunity to evaluate gene expression patterns of a large group of culturally and economically important species.

Conclusion
Atlantic salmon and rainbow trout now rank 19 th and 29 th in terms of species representation in EST databases with over 730,000 salmonid ESTs in total. Almost half of these data are presented in this study. These data provide an excellent genetic resource for physiological, ecological, biochemical, behavioral, disease and biological studies of salmonids. They also provide key materials for the development of polymorphic markers for genetic and physical genomics maps, for the identification and analysis of proteins and for the development of microarrays and primers for transcriptional analyses.
Summary of 78 gene set consensus (70%) trees depicting the relationships among the major groups of Salmonidae  Table 4.  * Percent elements on cDNA array with median signal intensity greater than threshold (background signal+ 2SD). %CV is percent coefficient of variation and "n" is the number of biological replicates.

Subfamily
Transcript assemblies and analyses have identified over 81,000 possible transcripts from Atlantic salmon and 51,000 transcripts from rainbow trout. These assemblies and consensus sequences are available from the author or through a database housed on the GRASP website [24]. As many as 17,399 full-length Salmo salar gene assemblies are present in this database.
Comparison of orthologous ESTs from Atlantic salmon, rainbow trout, chinook salmon, sockeye salmon, brook trout, lake whitefish, grayling, northern pike and rainbow smelt show that Pacific salmon (Oncorhynchus), Atlantic salmon (Salmo salar) and brook trout (Salvelinus fontinalus) average 94-96% similarity. Lake whitefish (Coregonus clupeaformis) and grayling (Thymallus thymallus) are more distant from Pacific and Atlantic salmon (92%), followed by northern pike (Esox; 89%) and rainbow smelt (Osmerus; 86%). A view of salmonid relationships and support for the salmonid genome duplication has been found. With the new EST database, a new, more extensive 32 K cDNA microarray has been developed to help assess gene expression patterns in salmonids.

Tissues, RNA, Aquaculture and Sampling
Salmo salar (McConnell strain), Oncorhynchus tshawytscha and Oncorhynchus nerka tissues were obtained from the Department of Fisheries and Oceans (Robert Devlin, WestVan Lab., West Vancouver, British Columbia). Salvelinus fontinalis and Coregonus clupeaformis tissues were obtained from Louis Bernatchez (Laval University, Quebec). S. salar (Saint John River strain; brain, kidney and spleen) were obtained from Vanya Ewart (NRC Institute for Marine Biosciences, Nova Scotia). Thymallus thymallus brain, kidney and spleen tissues were obtained from Craig Primmer (University of Turku, Finland). Esox lucius were captured by gill net from Charlie Lake British Columbia. All fish were euthanized, followed by rapid dissection of tissues. Tissues were flash frozen in liquid nitrogen or dry ice and stored at -80°C until RNA extraction.

cDNA libraries
Total RNA or poly(A)+ RNA (FastTrack MAG kit; Invitrogen) was extracted from flash frozen tissues. Salmo salar and Oncorhynchus tshawytshca mixed tissue (spleen, head kidney, brain) libraries were directionally constructed in both pCMV Sport-6.1 (Research Genetics Inc.) and pAL-17.3 (Evrogen Co.). S. salar (normalized head kidney, thymus and thyroid), Coregonus clupeaformis. Thymallus thymallus and Salvelinus fontinalis libraries were constructed in pAL-17.3 (Evrogen). The Oncorhynchus nerka mixed tissue normalized library was also constructed in pCMV Sport-6.1 (ResGen). S. salar (mixed tissue St. John strain) and Esox lucius libraries were constructed in pDNR-Lib using Creator SMART cDNA library construction kits (Clontech). Insert sizes of cDNA libraries were determined by visual comparison of clone restriction fragments with the DNA size markers HindIII (GibcoBRL) and 1 kb ladder (GibcoBRL).
The number of Salmo salar contigs was assessed using the PHRAP assembly program because of its ability to assemble very large numbers of ESTs in a single run, and its integration with PHRED base quality scores on primary reads and subsequent consensus sequences. The CAP3 assembler [39] was also used and similar results were obtained for smaller datasets. For this study, contig assembly employed a two-stage process. The first stage assembly used parameters 100 minscore and 0.99 repeat stringency to build contigs and consensus sequences that appeared to separate alleles of many transcripts. The second stage used the consensus sequences (with quality scores) from the first stage and parameters 96% repeat frequency and 300 minscore to build contigs and consensus sequences that appeared to combine some of the contigs that contained some base calling discrepancies, as well as what appeared to be alleles or very recently duplicated genes. Various parameters were tested and final parameters were chosen to minimize the number of contigs, where the number of contigs changed the least with respect to small changes in parameter values, and where distinct contigs appeared to have some biological significance (i.e., 99/ 100 appeared to separate many alleles and 96/300 as a second stage appeared to join some alleles and provided values that separated a clear majority of orthologous salmonid gene comparisons). With both sets of parameters, we were able to discriminate between similar sequences from different salmonid species. Sequences in contigs containing more than one polyA site were removed from the assemblies as they may represent chimeric clones.
Assemblies provide rough estimates of transcripts. Several algorithms have been examined and all have strengths and weaknesses. Examples of other assemblies include DFCI gene indexes [40,41] that estimate 83,554 TCs+singletons from 244,984 rainbow trout ESTs and 63,138 contigs from a partial 236,009 EST dataset from Atlantic salmon (these assemblies are periodically updated). INRA [21] using CAP3 estimates 56,392 transcripts (contigs + singlets) from 326,719 rainbow trout ESTs and 45,349 contigs from a partial Atlantic salmon EST database. Uni-Gene [42], from NCBI does not provide true assemblies and may cluster duplicated genes into single bins, which is problematic in salmonids. UniGene estimates approximately 30,000 and 25,000 UniGene sets in Atlantic salmon and rainbow trout respectively. While differences exist, the general number of estimated transcripts is similar. Problem areas that have been identified in assemblies tend to be associated with long transcripts, so these contigs will have to be treated carefully and perhaps manually edited. Assemblies are freely available from the author and the GRASP website [24]. As a caveat, because of the purported recent duplication of the salmonid genome and potential for miss-assembly of duplicated transcripts, these contigs have to be treated with caution.
Percent identity measures between contig consensus sequences from the various species were obtained from BLASTN alignments where a minimum length of 200 bp was observed. As in other distance measures, this finds the most similar sequence fragments and is biased high, particularly for more distant comparisons. A partial estimate of the impact on more distantly related sequence comparisons is the increased number of contigs for which no cross-species alignments were found and the reduction in average length of alignments. These values are provided in Table 3. However, the percent identity measure provides an estimate of observed similarity that is useful for evaluating potential cross-species DNA hybridizations in microarray experiments (see below).

Gene phylogenetic analysis
Contig sequences from Salmo salar (Atlantic salmon), Oncorhynchus mykiss (rainbow trout), Osmerus mordax (rainbow smelt), Coregonus clupeaformis (lake whitefish), Salvelinus fontinalis (brook trout), and Thymallus thymallus (grayling) [ Table 2] were BLASTed against each other (evalue < 1e-35, hits > 100bp) and the results used to gen-erate clusters of contigs. Bins of similar sequences, or clusters, were generated containing all contigs irrespective of species origin that had alignments with greater than 75% of the length of the shorter sequence and had greater than 70% identity in the overlapping regions (alignments consisted of ends-free alignment with scores of 2/-2/-5/-1 for match/mismatch/open gaps/extend gaps [43]. After the contigs had been grouped into clusters, the individual clusters were then further selected to only contain contigs that had mutually overlapping regions and all contig members were trimmed to the largest common alignment (same alignment parameters as above). A good alignment was considered to be greater than or equal to 300 bp in length with greater than 60% identity in the overlapping region. At this point, clusters that did not contain at least one sequence from each of the six target species were discarded. This resulted in a dataset of 78 clusters or gene sets. All gaps (and their corresponding positions in other sequences of the cluster) were removed, and the data within each gene set were bootstrapped 500 times. The PHYLIP package was used because it offers many different analysis methods, is freely available and is commonly used [44]. Distance matrices were computed for each bootstrapped dataset within each cluster using the F84 model of nucleotide substitution and Gamma-distributed rates of variation across sites with a coefficient of variation of 0.5 [44]. Neighbor-joining trees were then computed from each set of distance matrices and the set of resulting bootstrapped trees was used to derive a 70%-majority consensus tree [44]. The consensus trees were rooted with Osmerus mordax, and simplified by iteratively collapsing all pairs of leaf nodes having the same species and showing > = 98% similarity in the aligned portion of their sequences. Independently, maximum likelihood trees were generated for all 78 data sets using the default options with the Phylip program dnaml (transition/transversion ratio of 2.0, empirical base frequencies, constant rate variation among sites). A general evolutionary model was used for the 78 data sets because each set potentially consisted of a mixture of unidentified coding and noncoding data. All of the 78 ML trees were consistent with their 70%-consensus bootstrapped Neighbor-joining counterparts. EST accession numbers used to make contig consensus sequences, alignments and the 70% consensus trees are available [see Additional file 1] or online at the GRASP website [24].

Microarray Clone selection
Starting from the existing GRASP 16 K cDNA microarray [24], additional clones were selected for representation on the following basis: a) the contig ( Table 2) includes at least one clone that is on hand; b) the contig is of high quality with few conflicting positions, few singleton positions, no interior singleton positions (potential chimeric sites) and there are at least two clones in the contig (from at least 2 plates, and preferably from at least 2 libraries); c) if the contig is singleton then it must have a good BlastX hit (e-value < 1e-8) or other indication of orientation (eg. consistent poly(A) tail information); d) contig must have < = 94% identity to another sequence on the chip (the existing 16 K plus any new contig; not counting rainbow trout orthologs); and e) no tRNA, ribosomal, or mitochondrial sequences. We chose clone representatives within each contig based on: a) the reliability of the cDNA library and sequence; b) high similarity to consensus of contig (allow 20 bp at ends for poor trimming); c) reliable sequence from the 3'-end of contig and correct (3' -> 5') orientation; and d) ownership of clone.

Microarray fabrication
The initial clones were robotically rearrayed from daughter glycerol stock 384-well plates into 96-well plates prefilled with 8% glycerol in 2XYT + ampicillin with a MicroGrid II-610 (Biorobotics, Cambridge, UK), incubated overnight at 37°C, and checked for uniform optical density. Plasmid inserts were PCR-amplified in a MJ Tetrad PTC-205 thermocycler (Bio-Rad, Hercules, CA, USA) by using 1.0 μL overnight culture, 0.3 μM M13/pUC forward primer (5'-CCCAGTCACGACGTTGTAAAACG-3'), 0.3 μM M13/pUC reverse primer (5'-AGCGGA-TAACAATTTCACACAGG-3'), 2 mM MgCl2, 10 mM Tris-HCl, 50 mM KCl, 200 μM dNTPs, 1U AmpliTaq (Roche Diagnostics, NJ, USA), and nuclease-free H2O (Qiagen, Valencia, CA, USA) to 100 μL. PCR conditions were as follows: 2 min at 95°C denaturation; 35 cycles of 30 sec at 95°C, 45 sec at 59°C, and 4 min at 72°C; and 7 min at 72°C. Hotstar taq (Qiagen) was used to amplify additional inserts (clone set 2) with an initial denaturation of 15 mins. Amplicon specificity and yield was analyzed by capillary electrophoresis using the HT DNA SE 30 LabChip on Caliper AMS 90 system (Zymark-Caliper Life Sciences, MA, USA). PCR products were robotically cleaned (Qiagen) and consolidated into 384-well plates, lyophilized by speed-vac, and resuspended in 20 μL 3× SSC plus 1.0 M betaine. All cDNAs (average printing concentration of 165 ng/ul [original inserts] and 100 ng/ul [new inserts]) were printed as single spots on Erie Aminosilane slides (Erie, Portsmouth, N.H., USA) with a Genetix QArraymax microarray printer (Genetix, New Milton, Hampshire, UK) or MicroGridII-610 printer (Biorobotics, Cambridge, UK). All clones and controls were distributed randomly on the array. Genetix aQu 65 um quill pins or Biorobotics 10 k quill pins in a 48-pin tool were used to deposit < 1.0 nL (0.1 ng cDNA) per spot onto the slide. The resulting microarrays have a 4 × 12 subgrid layout with 699 spots per subgrid, each spot having diameter and pitch of 90-130 and 160-190 μm, respectively. A 280-bp GFP (green fluorescent protein) cDNA was amplified from a GFP clone (BD Biosciences, Mountain View, CA, USA) by using the primers (5'-GAAA-CATTCTTGGACACAAATTGG-3') and (5'-GCAGCTGTTA-CAAACTCAAGAAGG-3'), and printed in subgrid corners to assist in placing on the grid. The slides were crosslinked in a UV Stratalinker 2400 (Stratagene, La Jolla, CA, USA) at 300 mJ. One slide every 20 to 30 slides was hybridized with labeled random 9-mer oligonucleotide (SpotQC, Integrated DNA Technologies, Coraville, IA, USA) and scanned using GenePix 4200AL scanner (Molecular Devices, Sunnyvale, CA, USA). Presence/absence, shape, signal intensity vs. background, diameter and DNA binding site capability were measured for each spot on the slide using files generated by Imagene software (BioDiscovery Inc., El Secundo, CA, USA). Position and description of flagged spots (spots absent or thought to be unusable during post hybridization analysis), sub-grid defects and other noticed irregularities are recorded. Two PCR fragments from each plate were randomly selected and sequenced to ensure correct matches to the original clone sequence in the EST database. For controls, Stratagene SpotReport Alien cDNA Array Validation system PCR products (Cat # 252550) composed of 10 unique PCR products are spotted five times on the array. Corresponding mRNA for these PCR products can be purchased from Stratagene. The alien mRNA spikes can be used to determine mRNA quality, cDNA synthesis efficiency, positive and negative hybridization control, normalization for dye differences and determination of hybridization consistency.

Microarray hybridizations
The microarray experiments were designed to comply with MIAME guidelines. To minimize technical variability, all targets were synthesized in one round and hybridization experiments were conducted on slides from a single batch. Each hybridization experiment included dye-flips to compensate for cyanine fluor effects. Total RNA samples were quantified and quality-checked by spectrophotometer and agarose gel, respectively. All hybridization experiments were performed using the SuperScript III Indirect cDNA Labeling System kit and following manufacturers instructions (Invitrogen). Briefly, total RNA was reverse transcribed using an anchored oligo d(T) 20 primer in cDNA synthesis reactions that incorporated aminoallyl-and aminohexyl-modified nucleotides. The modified cDNAs were then labeled with fluorescent Cy5 or Cy3 dye in reactions with the amino-functional groups in coupling buffer.

Microarray analyses
Fluorescent images of hybridized arrays were acquired immediately at 10 um resolution using ScanArray Express scanner (PerkinElmer). The Cy3 and Cy5 cyanine fluors were excited at 543 nm and 633 nm, respectively, at the same laser power (90%), with adjusted photomultiplier tube settings between slides to balance the Cy5 and Cy3 channels. Fluorescent intensity data was extracted from TIFF images using Imagene 5.6.1 software (Biodiscovery). Quality statistics were compiled in Excel from raw Imagene fluorescence intensity report files. The hybridization performance of labeled targets to salmonid features was assessed as a percentage of features bound from the numbers of AS and RT features passing a hybridization signal threshold. Signal threshold was defined by 2 standard deviations above the signal mean for the 3× SSC/ betaine buffer spots. Outliers of buffer spots were removed based on the Median Absolute Deviation method [45] whereby elements with a test statistic value greater than 5 were removed. No transformations or normalizations were performed on these data. Only features deemed present by Imagene 5.6.1 (excluding marginal and absent values) were used for analyses.