Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library
© Sánchez et al. 2009
Received: 28 July 2009
Accepted: 25 November 2009
Published: 25 November 2009
Skip to main content
© Sánchez et al. 2009
Received: 28 July 2009
Accepted: 25 November 2009
Published: 25 November 2009
To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population.
The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the validated markers were associated with rainbow trout transcripts.
The use of reduced representation libraries and pyrosequencing technology proved to be an effective strategy for the discovery of a high number of putative SNPs in rainbow trout; however, modifications to the technique to decrease the false discovery rate resulting from the evolutionary recent genome duplication would be desirable.
Single Nucleotide Polymorphisms (SNPs) are highly abundant markers which are evenly distributed throughout the genome and can be functionally relevant. They are suitable markers for fine mapping of genes and candidate gene association studies aimed at identifying alleles potentially affecting important traits. Technologies that enable simultaneous analysis of thousands of SNPs have permitted genome-wide association studies for complex traits in humans , chicken , cattle [4–6] and sheep . Additionally, reduced representation libraries and pyrosequencing technologies have facilitated the high throughput discovery of SNPs [8–12]. Developing a large set of SNP markers for genome analyses in rainbow trout will facilitate fine mapping of QTL and will improve the identification and exploitation of genes affecting important traits and enable selective breeding through genomic selection.
Since 2002, the National Center for Cool and Cold Water Aquaculture (NCCCWA) has selectively bred rainbow trout broodstock for improvement of aquaculture production traits. The primary objectives have been to reduce the negative impacts of specifc bacterial pathogens on rainbow trout culture and to improve growth performance [13–15]. Currently, additional economically important traits such as stress tolerance [16, 17] and meat/fillet quality are being evaluated. Molecular genetic technologies have the potential of increasing the rate of genetic gain of traditional selective breeding schemes . Most production traits are controlled by multiple genes and inherited as quantitative traits. Analysis of quantitative trait loci (QTL) to identify loci for use in marker assisted selection (MAS) strategies can be used to optimize genetic improvement of desired traits . To this end the NCCCWA is developing molecular resources for rainbow trout, including cDNA and BAC (Bacteria Artificial Chromosome) libraries, microsatellite markers and a linkage map [19–21]. The map was based on a panel of five families that represent the starting genetic material of the selective breeding program. At present, the linkage map consists of 1,124 microsatellite loci falling into 29 linkage groups, with an average sex distance of 2,927 cM and an average resolution of 2.6 cM. A recent linkage disequilibrium (LD) study by Rexroad et al.  found significant LD blocks at distances of 2 cM or less, implying that, to be used in QTL mapping, there is a need of a larger number of markers to be included in the map.
Expressed Sequence Tags (ESTs) have been used for SNP discovery in several non-model organisms, including salmonids [23–27]. However, when using EST data for SNP identification, the salmonid pseudo-tetraploid genome  can lead to assemblies of paralogous sequences and false positive SNP identification [24, 26, 27]. Initially, we attempted an SNP discovery approach based on alignments of ESTs. The Genbank UniGene Build #22 http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene presents over 200,000 Rainbow trout EST sequences. We first used custom bioinformatic pipelines to assemble the available ESTs and discover SNPs in rainbow trout. The occurrence of the genome duplication led to the assembly of paralogous sequences and resulted in a high number of false positive putative SNPs. Next, we sequenced genomic DNA based on EST sequences which proved to be a more efficient method for SNP discovery in rainbow trout, genomic DNA sequencing would allow the representation of the entire genome, and not just the transcriptomes, as that is the case with ESTs. However, this procedure was also extremely time consuming and not suitable for high throughput SNP discovery and it became clear that an alternative approach would be necessary to achieve the goal of a dense set of markers spanning the entire rainbow trout genome. In this study, we present the use of a high-throughput methodology to discover SNPs in rainbow trout based on pyrosequencing technology to generate data from a reduced representation library (RRL) constructed with genomic DNA pooled from 96 unrelated rainbow trout individuals, and evaluate the efficiency of this approach.
After digesting the pooled DNA from 96 unrelated individuals with HaeIII, the 440 bp fragments were selected for library construction (3.2 μg of fragment DNA was recovered from the preparative gel). An aliquot of the recovered fragments was cloned and Sanger sequenced before proceeding with the library construction protocols: 89% of 1,269 sequenced fragments were unique and 33% of the analyzed bases masked to the cGRASP salmon repeat masker database http://lucy.ceh.uvic.ca/repeatmasker/cbr_repeatmasker.py.
Sequence data analysis results
No. of contigs
No. of loci
No. of contigs (SNPs)
In order to discover SNPs present only in non-repetitive regions, the set of masked reads were mapped onto the two assemblies, with 75% mapping to the Newbler assembly and 63% mapping to the Velvet assembly. Both assemblies had similar distributions of coverage depth: ~40% less than 4×, ~10% greater than 10× and ~50% 4-10×. The average read depth at SNPs was skewed to the high end, ~9×. Despite the similarities, the two methods discussed thus far produced widely different predictions for the number of SNPs: 187,916 for Newbler and 52,942 for Velvet. The distribution of the SNPs across the contigs reveals that the vast majority of SNPs in the Newbler set came from contigs with 5 or more predicted SNPs. Filtering out such SNPs from both sets leaves 47,128 for Newbler and 33.624 for Velvet. Previous experience with SNP discovery in swine using a reduced representation library showed a much more reasonable distribution of putative SNPs across the assembled contigs . One of the two attempts to lower the high rate of false positives was to use Velvet to create a different assembly. The other was to add an additional layering of filtering at the mapping stage (method N2 as described in the Methods section). N2 predicted 23,923 SNPs before filtering out the contigs with 5 or more SNPs and 22,022 after applying the filter. Results of the three analyses are summarized in Table 1.
A comparison of the SNPs from the three methods was done on a contig by contig basis, allowing for imperfect matches between velvet contigs and newbler contigs, as well as the occasionally varying numbers of predicted SNPs for the contigs. While compiling the lists of testable SNPs, some moderate filtering was performed to ensure sufficient quality and quantity of flanking sequence. As stated above, the two assemblies were more alike than different, yet the largest subgroup of SNPs is V1, indicating that the contigs unique to the Velvet assembly contain either a rich source of false positives or the best source of true SNPs. The next largest subgroup is N1, despite the fact that most of its potential SNPs were filtered out due to their presence in contigs with 5+ SNPs. The false positives seem to be infiltrating the unfiltered contigs as well. The third largest subgroup contains the SNPs predicted by all three methods, giving some assurance that the three approaches are similar enough to warrant comparison of validation rates as a means to incrementally improve the computational methods for SNP discovery. Even though the reads mapped in N2 were a subset (48%) of those mapped in N1, N2 contained SNPs not predicted in N1. These were from contigs that had 5 or more SNPs predicted by N1 but fewer than 5 predicted in N2. SNPs from each of the subgroups were tested and the validation results are discussed in the next section.
Number of genotyped and validated markers from the three data analysis approaches.
Call frequencies; minor allele frequencies and expected/observed heterozygosis of the validated SNPs are reported in Additional file 1. Call frequencies were higher than 90% for all markers and minor allele frequencies ranged from 1.3% to 49.7%. The double haploid samples were of great assistance for the validation process. Both Whale Rock samples were monomorphic for all validated markers; Skookumchuck, Hot Creek and Swanson were polymorphic for only one marker (OMS00033, OMS00163 and OMS00108, respectively) and Klamath for three markers (OMS00018, OMS00015 and OMS00108). The WGA DNA sample was successfully genotyped, 100% genotypes being equal to the original sample. Using SNaPshot, four SNPs were cross-validated. Genotyping results of the double-haploid fish were equal in both methods and 96% of the genotypes from the discovery samples were also equivalent.
Validated SNPs annotation.
AAI16452-B-cell CLL/lymphoma 9 [Homo sapiens]
CAG02508-unnamed protein product [Tetraodon nigroviridis]
ACI66580-Lipocalin precursor [Salmo salar]
Validated markers were also genotyped using the NCCCWA mapping families and 167 markers were polymorphic and mapped to the rainbow trout second generation map. In Additional file 2, the position of the SNP markers in the linkage map is indicated. The positions are based on the two-point distance of the SNP markers to the closest linked marker in the map with the highest confidence in map location. The newly developed markers were distributed along all linkage groups of the rainbow trout map; there are at least two markers in each linkage group.
In our first attempts to find SNPs in rainbow trout we evaluated multiple bioinformatic pipelines for their ability to detect SNPs from EST data (unpublished). The occurrence of the genome duplication resulted in many assemblies of paralogous sequences, which led to the identification of a large proportion of false positives (data not shown). In addition, a high number of the putative markers were not validated due to PCR amplification problems caused by incorrect assemblies of sequences; presence of introns in the amplicons and unspecific binding of primers (multiple loci). Similar results were observed in other salmonid species [23, 26, 27] and were attributed to the amplification of both paralogs of the same loci. A customized SNP discovery process using only 3' ESTs was then developed, which used stringent assembly parameters (98% identity; at least 50 bp overlap) were set to overcome issues associated with the genome duplication. 3' EST sequences were selected to avoid intron/exon boundaries; increase primer binding specificity and to raise the probability of finding SNPs in UTRs. However, the results were not improved and no markers were validated. The next approach was to sequence genomic DNA based on ESTs sequences, which proved to be an effective method of SNP identification in rainbow trout. Twenty putative SNPs were tested and 12 SNPs representing 10 different genes were validated on the double haploid samples; 5 of the validated SNPs were successfully genotyped on the 96 NCCCWA broodstock unrelated individuals (data not shown). However, this methodology is extremely time consuming and not suitable for high throughput SNP discovery.
Using a high throughput technology, over 20,000 putative SNPs were identified, 384 were tested and 183 validated; 90% of the validated markers were placed and widely distributed in the rainbow trout linkage map. A BLAST analysis was performed to identify coding regions in validated SNP sequences and 2% of the validated markers were associated with relevant genes. SNPs found within or near a coding sequence, are of particular interest because they are more likely to alter the biological function of a protein. This class of SNPs is especially important for species without a genome sequence such as aquaculture species. Gene-associated SNPs serve as suitable markers for mapping in comparative genome studies and in MAS of economically important traits [29, 30].
Recent reports of using highly parallel sequencing methodologies in combination with reduced representation libraries (RRL) for SNP discovery in cattle and swine reported validation rates above 92% [9, 12]. These species do not have the complication of recent whole genome duplication that is a prominent feature of the salmonid genome, which resulted in the dramatically reduced validation rate of approximately 48% reported here. Nevertheless, the RRL approach proved much more efficient than previous approaches used for trout SNP detection, providing tens of thousands of putative SNPs at a fraction of the cost of generating them by PCR and Sanger sequencing and in a much shorter time. Future work will attempt to revise the method to more efficiently avoid the class of putative SNPs resulting from comparisons between paralogous loci, by sequencing RRL produced from a double haploid strain in parallel to the pool of discovery, diploid animals. Additional gains may be made by refinement of filtering strategies by incorporating flanking sequence data more effectively, especially as developments in the pyrosequencing platform increase read lengths later this year permitting creation of RRL with longer fragments and increasing the ability to discriminate paralogs. We anticipate that future studies using RRL and pyrosequencing will approach the 90% success of other organisms, although we recognize that the tetraploid genome of rainbow trout will necessitate a much larger ratio of DNA sequenced to valid SNPs discovered than in species with less complex genomes.
The use of reduced representation libraries and pyrosequencing technology proved to be an effective strategy for the discovery of a high number of SNPs in rainbow trout. Over 20,000 putative SNPs were discovered and 384 were tested, resulting in a 48% validation rate. A hundred and eighty three markers were validated and 167 (43.5%) of those markers were polymorphic and placed in the rainbow trout linkage map. According to the validation results, it could be implied that at least 10,000 putative SNPs from this data set would be validated and useful in future genomic studies. However, it is still necessary to dramatically decrease the false discovery rate.
Five different restriction enzymes (AluI; BstUI; HaeIII; HpyCH4V and RsaI) having four-base recognition target sequences and producing fragments with blunt ends were tested for the library construction. Digestions of 1 mg of DNA from a single fish were performed for each enzyme as suggested by the manufacturer (New England Bio Labs). The digested DNA was separated on 3 mm thick 5% nondenaturing polyacrylamide gels. DNA fragments between 400 and 500 bp were excised from the gels (slices were cut every 2 mm or approximately 20 bp), fragments were eluted using a modified  "crush and soak" procedure and DNA was purified by alcohol precipitation. For trial Sanger sequencing to identify the best enzyme for reduced representation library construction, the fragments were ligated into the pSTBlue-1 vector of the Perfectly Blunt Giga Cloning kit (Novagen). DNA sequencing was carried out with an ABI Prism 3100 Genetic Analyzer. The ability to completely digest the trout genomic DNA and the number of DNA repetitive elements present in the selected fragment size range were evaluated for each enzyme.
Approximately equal amounts of DNA from 96 unrelated individuals from the NCCCWA 2005 and 2006 brood years  was pooled prior digestion with HaeIII. Complete digestions of 1 mg of DNA from the pool (3 U enzyme/μg) were performed overnight at 37°C. DNA fragments were separated and eluted from the gel following the procedures previously described. Excised fragments approximately 440 bp in length were selected to construct the library; 600 ng of DNA obtained from the preparative gel was used for construction of the library and ligated to the sequencing adapters provided in the GS FLX library construction kit (Roche Applied Science). From this point, the library was prepared and sequenced on the GS FLX as per the manufacturer's protocols.
Three independent SNP discovery approaches were used. The first two, which were called N1 and N2, were initiated by creating a reference sequence set by using the Newbler algorithm (version 1.1.03, provided with the GS FLX sequencer) to assemble all the unmasked sequence data into contigs. The individual reads were then screened for known repetitive elements and repeat masked using the cGRASP RepeatMasker database http://lucy.ceh.uvic.ca/repeatmasker/cbr_repeatmasker.py and sequences with at least 50 consecutive non-repetitive bases were mapped onto the assembly using ssaha2. N2 differs from N1 in that the mappings for N2 were filtered so that only those with at least 90% sequence identity over at least 80% of the full length of the read were mapped onto the reference assembly. This step was done to decrease the likelihood of reads being mapped to duplicated regions of the genome. The third approach (V1) used the Velvet algorithm  for the assembly; in this case, all the sequence data was repeat-masked first, and the reads that contained at least 50 consecutive non-repetitive bases were used as input data. This same set of masked and filtered reads were then mapped back onto the Velvet assembly using ssaha2. The next step in all three approaches was to use ssaha_pileup  to detect sequence variation among the mapped reads. To be included as putative SNPs, both the major and the minor allele had to be detected at least twice each, without a third allele being observed. All SNPs from contigs that presented more than 5 putative SNPs (1 putative SNP/100 bp) were discarded, as a compromise balancing number of SNPs discovered and false discovery rate, since contigs with multiple apparent SNPs are more likely to represent paralogous loci (discarding loci with only two putative SNPs greatly decreases number of apparent SNPs but likely increases validation rate; allowing higher numbers adds additional SNPs but also increases false discover rate). Additionally, flanking sequence of at least 50 bp in one side (with a limit of 10% unknown bases) and 10 bp on the other side was required, in order to provide sufficient flanking sequence data to support assay design in the Illumina system to be used for genotyping.
Illumina GoldenGate assays  were used to evaluate a fraction of the discovered SNPs. A set of 384 putative SNPs was randomly chosen (from among those with the highest scoring flanking sequences) to represent each one of the three sequence analysis methods (N1: 47; N2: 46 and V1: 100) and putative SNPs found in more than one analysis were also tested (N1 and N2: 47; V1 and N1: 47; V1 and N2: 47; V1, N1 and N2: 50).
Genomic DNA from 85 of the discovery panel fish and 98 samples from five NCCCWA mapping families  were used for the validation assays. In addition, seven double haploid samples from the Arlee, Hot creek, Klamath, Whale rock female, Whale rock male, Skookumchuck and Swanson clonal lines [35, 36] were genotyped and used as a control to eliminate the paralogous sequences that can erroneously be identified as SNPs. In some genotyping projects, the amount of DNA available for the assays is limited, and it is necessary to use Whole-Genome Amplified (WGA) DNA, as has already been used in GoldenGate assays . A protocol based on the method described by  was used to prepare one WGA sample, which was added to the SNP validation panel, its respective template was also genotyped for comparison.
The BeadStudio software version 3.2 (Illumina) was used to analyze the Goldengate assays SNP data. Genotyping data quality was assessed following the guidelines suggested by Illumina . The no-call threshold used was 0.25 and all the samples used for validation had at least a 0.85 Call rate score. A cross-validation assay was done using the ABI SNapshot method; four SNPs were randomly selected and the 96 discovery samples and the seven double-haploid individuals were genotyped.
A local BLAST search using the sequences of the validated SNPs against the rainbow trout transcriptome data from TIGR (BioEdit software at default settings http://www.mbio.ncsu.edu/BioEdit/BioEdit.html) was performed to identify coding regions.
Segregating data from the NCCCWA mapping families was used to anchor the polymorphic validated markers into the second generation rainbow trout linkage map . Genotype data combined for both sexes were formatted using MAKEPED of the LINKAGE  program and checked for inconsistencies with Mendelian inheritance using PEDCHECK . RECODE  and LNKTOCRI  were used to assemble the data into CRIMAP  format. SNP genotype data were added to that of Rexroad et al  and MULTIMAP  was used to conduct two-point linkage analyses to identify the closest markers from the published map which were ordered at the highest level of significance.
The authors would like to acknowledge Roseanna Long and Kristy Shewbridge for DNA preparation; Dr. Gary Thorgaard for providing the double haploid samples; Renee Godtel and Kevin Tenill for assistance in library construction; Bob Lee for 454 sequencing; Dr. Tad Sonstegard and Alicia Bertles for assistance with GoldenGate genotyping and Dr. Guangtu Gao for helping with the linkage analysis.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.