Fish and chips: Various methodologies demonstrate utility of a 16,006-gene salmonid microarray

Background We have developed and fabricated a salmonid microarray containing cDNAs representing 16,006 genes. The genes spotted on the array have been stringently selected from Atlantic salmon and rainbow trout expressed sequence tag (EST) databases. The EST databases presently contain over 300,000 sequences from over 175 salmonid cDNA libraries derived from a wide variety of tissues and different developmental stages. In order to evaluate the utility of the microarray, a number of hybridization techniques and screening methods have been developed and tested. Results We have analyzed and evaluated the utility of a microarray containing 16,006 (16K) salmonid cDNAs in a variety of potential experimental settings. We quantified the amount of transcriptome binding that occurred in cross-species, organ complexity and intraspecific variation hybridization studies. We also developed a methodology to rapidly identify and confirm the contents of a bacterial artificial chromosome (BAC) library containing Atlantic salmon genomic DNA. Conclusion We validate and demonstrate the usefulness of the 16K microarray over a wide range of teleosts, even for transcriptome targets from species distantly related to salmonids. We show the potential of the use of the microarray in a variety of experimental settings through hybridization studies that examine the binding of targets derived from different organs and tissues. Intraspecific variation in transcriptome expression is evaluated and discussed. Finally, BAC hybridizations are demonstrated as a rapid and accurate means to identify gene content.


Background
Atlantic salmon are part of the Salmonidae family which comprise all salmon, trout, whitefish, grayling, and charr. A tremendous amount of basic biology is already known about salmonids from studies carried out on their physiology, population dynamics, behavioural ecology and phylogenetics [1]. Salmon also provide an excellent model system in which to study fundamental genetic mechanisms of growth, development, reproduction and response to infection and disease. For example, salmonids serve as prominent models for studies involving environmental toxicology [2], carcinogenesis [3], comparative immunology [4], the molecular genetics and physiology of the stress response [5], olfaction [6], vision [7], osmoregulation [8], growth [9] and gametogenesis [10].
Answers to fundamental scientific questions can also be gained from the study of salmonid genomes. The ancestor of all extant salmonids underwent a whole genome duplication and after a series of subsequent genetic events, salmon are now considered to be pseudo-tetraploid. How a genome reorganizes itself to cope with a duplicated genome and the importance of gene duplications for evolution and adaptation are long standing issues that remain unresolved. Questions regarding the origins of genomes have direct implication for our understanding of the roles of gene families, duplication and deletion of segments of genomes, and the mutational process in human health and disease. They also provide a foundation for understanding the genome of Atlantic salmon to benefit conservation and enhancement of wild stocks, aquaculture and environmental assessments. Genomic resources enable us to address fundamental scientific questions concerning the evolution of salmonid genomes, and the expression of genes and proteins in a wide variety of natural and altered environments and conditions. Toward these goals, more than 175 cDNA libraries have been constructed from a wide variety of tissues and different developmental stages and more than 300,000 salmonid cDNA sequence reads have been combined from a consortium comprising groups from Canada . These sequences were assembled into over 40,000 unique contigs. A preliminary microarray of 3,557 cDNAs was constructed and assessed on its' ability to provide new data in the study of cellular and tissue responses to pollutants, diseases and stress, as well as for reproduction and development [11][12][13][14][15]. On the basis of these results, a larger array of 16,006 genes has been constructed and initial results have shown sensitivity of gene expression patterns to disease challenge, and to small environmental and physiological changes [16].

Results and discussion
Library construction (directional cloning by 5'EcoRI, 3'XhoI in pBluescript II XR, Stratagene; or TOPO TA cloning of suppression subtractive hybridization PCR products, Invitrogen and Clontech) and subsequent EST sequencing (using M13 forward primer) were designed to generate 3'-end sequences to enable us to distinguish between potential paralogs arising from the recent salmonid genome duplication. We have determined from a weighted average measurement comparing four different directionally-cloned library types (such as non-normalized versus normalized libraries) that approximately 9% of inserts are in the reverse orientation and therefore yield 5' sequence with the M13 forward primer [11]. The GRASP 3'-end reads were used as a framework on which to build the contigs from additional data provided by the NRC, INRA, USDA/ARS and the NSVS. Part of the evaluation process for selecting genes for the microarray required criteria that would guard against chimeras. Simply put, this meant that each gene choice had to be part of a contig with multiple distinct clones covering each region, or that it was sufficiently similar to another sequence across its whole length that it was unlikely to be chimeric. We did select for immune-specific and reproduction-relevant genes for the microarray, but the preponderance of ESTs on the 16K chip were randomly picked based on EST cluster quality and uniqueness and therefore represent a wide variety of different classes of genes.

Application of a 16K cDNA microarray to different species
To explore the validity of using the 16K microarray with other fish species, the 13,421 Atlantic salmon (AS) and 2,576 rainbow trout (RT) cDNA features were interrogated with labeled liver targets from four members of the order Salmoniformes (AS, RT, chinook salmon and lake whitefish) and one member of the order Osmeriformes (rainbow smelt) ( Table 1). The average percentage binding of AS, RT, chinook salmon, lake whitefish (LW) and rainbow smelt liver targets to the 16K chip was 54.0%, 63.3%, 51.0%, 50.6% and 30.1%, respectively. The average percentage of targets bound to AS and RT features for each species are also shown ( Table 1).
Our study indicates that there are no significant differences in the percent of targets that bound to the 16K microarray for the four salmonids examined (AS, RT, chinook and LW). There is a similar hybridization performance for all salmonids. However, RT targets do consistently show higher overall binding to the microarray; the reason for this efficiency is not yet clear.
The hybridization performance of the rainbow smelt targets were roughly one-half those of the salmonid cDNAs. Of the species contributing targets to our heterologous hybridization experiment, the osmerid targets were the most phylogenetically removed from the salmonid features. Indeed, a recent mitogenomic study places the Osmeroidei in a separate clade from the Salmoniformes [17]. These two clades are separated by at least 200 MY with the Salmonidae having undergone at least one genome duplication event since their divergence [18,19]. Other factors such as genome gene content (ie., numbers of paralogs) and genome size are likely to be factors affecting the overall degree of hybridization [11].

Application of a 16K cDNA microarray to different tissues
Different tissues and organs exhibit differences in transcriptome complexity, depending on their cellular heterogeneity and differentiated specializations. The mRNAs of a typical somatic cell are divided into three classes based on their sequence complexity and diversity [20]. The most prevalent class consists of only a few mRNA species that comprise the abundant transcripts present in a cell. Often these transcripts are dedicated to cellular functions common to all tissues, but they usually represent genes that specify an organs' unique function. The high complexity class of mRNAs includes thousands (perhaps millions) of different mRNA species, each represented by fewer than 15 copies per cell [20].
However, it should be noted that some subsets of genes that have been thought to be unique to one organ have been found to be expressed in others. This has been demonstrated for transcripts in the brain-gonad axis, and is probably not exclusive to these organs. For example, mammalian pheromone/odorant receptors and specific piscine hormones and receptors of the brain are also expressed in the gonad [12,21,22]. To date, the biological functions of these transcripts in the gonad have not been determined, raising intriguing questions regarding multiplicity of functions for complex transcripts, even in diploid vertebrates such as mammals.
To determine the differences in the transcriptome complexity of seven different AS tissues and organs, the 13,421 AS and 2,576 RT cDNA features were hybridized with labeled targets from midgut, brain, spleen, muscle, ovary, kidney and testis ( Table 2). The average percentage binding of midgut, brain, spleen, muscle, ovary, kidney and testis targets to the 16K chip was 64.4%, 54.7%, 54.6%, 52.8%, 51.0%, 49.7% and 30.2%, respectively. In general, about 45% of the salmonid microarray features were not bound by targets from the various AS tissues and organs.

Application of a 16K cDNA microarray to the same tissue from cohorts
To determine the amount of gene expression variability that exists between individuals of a single species, we compared the transcriptomes of livers from three fish with identical histories. We compared the average percent of variation (or scatter) in expression of liver transcripts between cohorts 1 and 2 (liverpairs 1/2), cohorts 1 and 3 (liverpairs 1/3) and cohorts 2 and 3 (liverpairs 2/3). Two separate experiments of six hybridizations each were conducted with each liverpairing having one dye-flip.
Examining each individual array in the intraspecies study showed that the overall mean scatter was 12.6% (Table 3). When the liverpair arrays and their respective dye-flips  were combined and averaged, the overall mean scatter was reduced to 9.7%. This indicates that systematic unequal dye incorporation exists resulting in high scatter values. This dye bias has been well-documented by other researchers [23][24][25] and illustrates the importance of incorporating dye swap pairs when performing microarray hybridizations whenever possible. The overall mean scatter was further reduced to 5.2% when the analysis included technical dye swap replicates between respective liverpairs (Table 3). This demonstrates that increasing the number of technical replicates in a microarray experiment is an important factor to consider for reducing random scatter. It is encouraging that the overall scatter between individuals from the same broodstock was quite low. Thus technical and biological variability across arrays and individuals can be significantly reduced by the investigator if the appropriate experimental design is employed.

Application of a 16K cDNA microarray to analyze BAC contents
To assess the use of the 16K array as a screening tool to identify the genes present in a salmonid BAC, the 13,421 AS and 2,576 RT features were interrogated with nebulized and labeled fragments from a single BAC whose sequence has been determined ( Table 4). Analysis of our initial BAC hybridizations revealed that a high proportion of transposon-like sequences and long and short interspersed nuclear elements were binding to the array. It is known that many different repeat elements derived from once-mobile transposable segments comprise large portions of the Atlantic salmon genome [26][27][28][29]. In an effort to improve the specificity of target binding to the microarray for BAC hybridization, we employed a Cot-1 DNA protocol to reduce the binding of these repetitive elements ( Table 4). The addition of Cot-1 DNA increased the number of expected genes identified and the number of hits for the expected genes by displacing many of the repeat family and transposon associated elements.
Although Cot-1 DNA did improve the ability to identify genes for the BAC we examined, Cot-1 DNA alone is not enough to block the complications that arise from repetitive elements in whole genome hybridizations. In preliminary comparative genomic hybridization studies we have found that even with Cot-1 DNA included in the hybridizations, the repetitive DNA segments found in salmonid genomes interfere with the interpretation of the data. Most investigators are not interested in these repeti-tive segments, but rather in the genes that are interspersed between them. Moreover, we have found that often these repetitive elements lead to false positives. Using other methods, such as including repeat-element amplified products with Cot DNA, as well as higher stringency washes, might improve binding specificities. We are currently working on various strategies to maximize blocking of this repeat element 'noise'.

Conclusion
We validate and demonstrate the usefulness of the 16K microarray over a wide range of teleosts, even for transcriptome targets distantly removed from salmonids phylogenetically. We show the potential of the use of the microarray in a variety of experimental settings through hybridization studies that examine the binding of targets derived from different organs and tissues. Intraspecific variation in transcriptome expression is evaluated and discussed. Finally, BAC hybridizations are demonstrated as a rapid and accurate means to identify gene content. We expect that this array will serve as an important resource for genetic, physiological, ecological and many other fields of salmonid study.

Methods
Gene selection cDNA library construction, recombinant plasmid preparation and extraction, sequencing, sequence analysis and contig assembly for the GRASP have been described previously in detail [11][12][13]. Selection criteria for unique Atlantic salmon (AS) and rainbow trout (RT) cDNAs for inclusion on the 16K microarray were as follows: ESTs (cDNA fragments) were assembled into contiguous sequences (contigs) by PHRAP [30] under stringent assembly parameters (minimum overlap score:100; repeat stringency: 0.99). Contig consensus sequences and singleton sequences were aligned with non-redundant GenBank nucleotide and amino acid sequence databases using BLASTN and BLASTX, respectively [31,32]. Threshold for a significant BLAST hit was set at E = 1e-15.
It was determined that a contig must contain at least one "usable" sequence, where "usable" was a)-the sequence must be 3' (with high probability; containing polyA signal or having been sequenced with an oligo-dT primer or being at the 3'-end of a contig, with orientation determined by a strong hit against a protein in GenBank's nonredundant protein database), b)-be a sequence stretch containing more than 400 bp, and c)-the sequence must be at least 95% similar to the consensus of the contig.
It was also determined that if a contig was a singleton or singleton-equivalent (where all sequences were from the same plate or library thus not providing sufficient evidence for non-chimera status), then the contig selection was reinforced either by a)-a significant BLAST hit, E<1e-15 (BLASTN or BLASTX), or b)-it having 94% (or more) identity with a homolog (either paralog or ortholog) covering at least 400 nucleotides. If the contig was a non-singleton, it was determined that it must be a)-one "block" (having no regions in the interior of the contig covered by only one sequence, to decrease probability of chimeras), and b)-of high enough overall quality (with an overall score > 95% positions without conflicts, weighted by number of sequences which support the consensus) and c)-have few leading and trailing singleton positions (no more than 25%), since such positions make it a de facto singleton.
Approximately 3,500 additional sequences were selected with the following criteria: a)-no chosen contig could have 94% or more identity with another chosen contig, and b)-tentative consensus sequences (TC) identified by TIGR [33] could be included. By these criteria, approximately 1000 clones were picked indiscriminately from both normalized AS and RT cDNA libraries, 800 clones were selected from suppression subtracted hybridization libraries and 700 sequences were added from requests of potential array users. Additionally, 949 non-overlapping sequences (856 AS, 93 RT) from clones included in the preliminary 3,557-gene chip (plus one T cell receptor beta) were selected. Finally, approximately 500 immunespecific genes were also chosen to bring the total number of genes represented on the chip to 16,006. In the 16,006 cDNA features there are 13,421 AS, 2,576 RT, 4 chinook salmon, 3 rainbow smelt and 2 LW representatives.

Gene identification
EST contigs were built using cDNAs on the array as reference and all ESTs currently in the GRASP database. Subsequent to microarray fabrication, the consensus sequences were screened for repeats using a custom salmonid repeat database with RepeatMasker. Masked consensus sequences were compared to GenBank databases. Using the stringent selection threshold above, the current percentage of the 16K features that are known and unknown genes is 55.8% and 44.2%, respectively. Analysis at less stringent thresholds is ongoing to identify all genes on the microarray. followed by 72°C for 7 min. Five ul of each PCR product were run on a 1% agarose gel to assess yield and quality. PCR products were robotically cleaned (Qiagen) and consolidated into 384-well plates, lyophilized by speed-Vac, and resuspended in 20 ul 3X SSC. Each purified PCR product concentration was determined and diluted to give a final concentration of 400 ng/uL.

Microarray fabrication
All cDNAs were printed as single spots on EZ Rays aminosilane slides (Matrix/Apogent Discoveries) with the Biorobotics Microgrid II microarray printer (Genomic Solutions). Microspot™ 10K quill pins (Biorobotics) in a 48 pin tool were used to deposit approximately 0.5 nl (0.2 ng cDNA) per spot onto the slide. The slides were crosslinked in a UV Stratalinker 2400 (Stratagene) at 300 mJ. The resulting microarrays have a 4-by-12 metagrid layout with 19 X19 spot subgrid, each spot having an approximate diameter and pitch of 100 um and 0.20 mm, respectively. A 280 bp GFP (green fluorescent protein) cDNA was amplified from a GFP clone (Clontech) using the primers (5'-GAAACATTCTTGGACACAAATTGG-3') and (5'-GCAGCTGTTACAAACTCAAGAAGG-3') and printed in each subgrid corner to assist in gridding.
Six exogenous genome (Arabidopsis) cDNAs were amplified from the following clones kindly provided by The Arabidopsis Each institution that provided tissue, raised and treated the fish in compliance with ethics committee or government body guidelines.

Tissue and RNA extraction
Fish were exsanguinated for several minutes. The tissues were removed and flash frozen in liquid nitrogen and stored at -80°C until RNA extraction. Flash frozen tissues were ground using baked (220°C, 5 h) mortars and pestles under liquid N 2 , then total RNA was extracted in TRIzol reagent (Invitrogen). RNAs obtained from these preparations were used for generating labeled targets for microarray hybridizations.

Microarray hybridizations
The microarray experiments were designed to comply with MIAME guidelines [34]. To minimize technical variability, all targets were synthesized in one round and each hybridization experiment was conducted simultaneously on slides from a single batch where possible. Each hybridization experiment included dye-flips to compensate for cyanine fluor effects. Total RNA samples were quantified and quality-checked by spectrophotometer and agarose gel, respectively.
All hybridization experiments were performed using the SuperScript Indirect cDNA Labeling System kit and instructions (Invitrogen). Briefly, 5.0 ug total RNA was reverse transcribed using an anchored oligo d(T) 20 primer in cDNA synthesis reactions that incorporated aminoallyland aminohexyl-modified nucleotides. The modified cDNAs were then labeled with fluorescent Cy5 or Cy3 dye in reactions with the amino-functional groups in coupling buffer. , and then 2 X 5 min in (2X SSC, 0.1% SDS), 2 X 5 min in 1X SSC and 2 X 5 min in 0.1X SSC at room temperature, then dried by centrifugation.

Microarray analyses
Fluorescent images of hybridized arrays were acquired immediately at 10 um resolution using ScanArray Express (PerkinElmer). The Cy3 and Cy5 cyanine fluors were excited at 543 nm and 633 nm, respectively, at the same laser power (90%), with adjusted photomultiplier tube settings between slides to balance the Cy5 and Cy3 channels. Fluorescent intensity data was extracted from TIFF images using Imagene 5.5 software (Biodiscovery). Quality statistics were compiled in Excel from raw Imagene fluorescence intensity report files. Features were sorted (16,006 salmonid spots each representing different cDNAs; 24 Arabidopsis spots representing 6 different cDNAs) and median signal values and mean numbers of salmonid features passing threshold were determined for Cy3 and Cy5 data separately.
For cross-species and tissue-on-tissue experiments, the hybridization performance of labeled targets to salmonid features was assessed as a percentage of features bound from the numbers of AS and RT features passing a hybridization signal threshold, defined as two standard deviations above Arabidopsis signal mean. No transformations or normalizations were performed on these data. Only features deemed present by Imagene 5.6.1 (excluding marginal and absent values) were used for analyses. We also analyzed some of these data at two standard deviations above empty spot mean signal intensity and found that this was a less stringent method of thresholding (data not shown).
Intraspecific liver and BAC hybridization data analysis (background correction, Lowess normalization, and fold change gene list formation) was performed in GeneSpring 6.1 (Silicon Genetics). All scanned microarray TIFF images, extracted ImaGene grid files, the gene identification file and ImaGene quantified data files are available on-line as supplemental data [35]. The data is deposited in NCBI's GEO repository under PLATFORM GPL 2716 [36].