Automated SNP detection from a large collection of white spruce expressed sequences: contributing factors and approaches for the categorization of SNPs
© Pavy et al; licensee BioMed Central Ltd. 2006
Received: 01 April 2006
Accepted: 06 July 2006
Published: 06 July 2006
High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. As a first step toward this goal, we aimed to develop a resource of candidate Single Nucleotide Polymorphisms (SNP) in white spruce (Picea glauca [Moench] Voss), a softwood tree of major economic importance.
A white spruce SNP resource encompassing 12,264 SNPs was constructed from a set of 6,459 contigs derived from Expressed Sequence Tags (EST) and by using the bayesian-based statistical software PolyBayes. Several parameters influencing the SNP prediction were analysed including the a priori expected polymorphism, the probability score (PSNP), and the contig depth and length. SNP detection in 3' and 5' reads from the same clones revealed a level of inconsistency between overlapping sequences as low as 1%. A subset of 245 predicted SNPs were verified through the independent resequencing of genomic DNA of a genotype also used to prepare cDNA libraries. The validation rate reached a maximum of 85% for SNPs predicted with either PSNP ≥ 0.95 or ≥ 0.99. A total of 9,310 SNPs were detected by using PSNP ≥ 0.95 as a criterion. The SNPs were distributed among 3,590 contigs encompassing an array of broad functional categories, with an overall frequency of 1 SNP per 700 nucleotide sites. Experimental and statistical approaches were used to evaluate the proportion of paralogous SNPs, with estimates in the range of 8 to 12%. The 3,789 coding SNPs identified through coding region annotation and ORF prediction, were distributed into 39% nonsynonymous and 61% synonymous substitutions. Overall, there were 0.9 SNP per 1,000 nonsynonymous sites and 5.2 SNPs per 1,000 synonymous sites, for a genome-wide nonsynonymous to synonymous substitution rate ratio (Ka/Ks) of 0.17.
We integrated the SNP data in the ForestTreeDB database along with functional annotations to provide a tool facilitating the choice of candidate genes for mapping purposes or association studies.
Large-scale sequencing projects offer a possibility for low cost SNP discovery since sequence variants can be found computationally by analysing the redundancy in sequence databases. For example, this approach has facilitated SNP-based genetic mapping in human [1, 2], Arabidopsis [3–5], and rice [6, 7]. Following the development of pipelines to handle data derived from sequencing platforms, computational tools have been developed to predict SNPs in overlapping fragments of genomic sequences or in expressed sequences such as ESTs [8, 9].
The first-generation SNP detection software was derived from tools used to help processing sequence data. After the clustering and alignment of ESTs with Phrap, false positives were avoided by applying several filters to remove low quality sequences . Such approaches were applied to public EST databases to identify candidate SNPs, which were verified experimentally in human populations . A major improvement of this approach was the detection and filtering of putative paralogous sequences misassembled into the same cluster. Such filtering was conducted using phylogenetic analysis  or sequence similarity searches . Indeed, one difficulty in searching for intraspecific orthologous SNPs in sequence databases is to distinguish true polymorphisms from artifacts due to transcription/sequencing errors or misassembly of paralogous sequences. Without sequence trace data, a confidence score can be deduced from the redundancy at the SNP locus and from its co-segregation with other surrounding SNPs in the alignment [9, 13]. Other methods use sequence quality information derived from the raw trace data. Bayesian statistics were applied to incorporate background information into the specification of a tested model for data analysis . They were implemented in the software PolyBayes to determine a confidence score for each SNP detected in a cluster of ESTs . PolyBayes uses a priori information about the average pairwise difference between paralogous sequences to calculate a posteriori, the probability that a sequence is native by comparison to a reference sequence from the EST cluster. Thus, sequences classified as paralogous in one EST cluster are excluded from the SNP detection procedure. Then, based on the alignment of the ESTs within the cluster, another Bayesian calculation generates the probability that a variant at a given location of a multiple alignment represents a true polymorphism as opposed to a sequencing error. This calculation takes into account the alignment depth, the base calls in each of the sequences, the associated base quality values, the base composition in the region, and the expected a priori rate of polymorphism of the species under investigation. This approach was shown to be adequate for SNP prediction in human , sugarcane , soybean , and pine .
The development of SNP markers in plant species will contribute to the understanding of crop evolutionary history, to the positional cloning of QTLs, and to marker-assisted breeding (reviewed in ). Especially in conifers, candidate gene approaches are favored to dissect complex traits since linkage desequilibrium is low or declines rapidly within the length of an average-sized gene [20, 21]. In these species, the development of high density gene maps based on intraspecific SNPs should result in better marker-aided selection systems by enabling the efficient co-localization of genes and quantitative/qualitative trait loci to further facilitate association studies. These high-density gene maps will also be helpful to delineate homologous chromosomal regions among taxa and to study how gene regions are structured and have evolved (e.g. ). To accelerate mapping projects in conifers, we plan to use high-throughput genotyping technologies. A first step toward this goal is the development of a resource of candidate SNPs. Such resources can be obtained computationally since several EST collections are being developed in pines [23–26] and in spruces [27, 28].
As part of our research integrating functional and structural genomics analyses in white spruce (Picea glauca) , we have generated diverse cDNA libraries from 12 genotypes and carried large-scale EST sequencing . Here, we present the in silico detection of SNPs in the EST alignments using the PolyBayes software and describe the factors that affect the automated SNP detection. The quality of the in silico approach was further assessed by the independent sequencing of a subset of genomic DNA fragments. We demonstrate how this analysis was used to define a set of highly reliable predicted SNPs according to their probability scores. Contigs may also contain paralogous sequences which may cause the detection of non allelic SNPs. Parameters for limiting this bias can be set in the prediction software PolyBayes, but methods for a posteriori assessment of this source of variation remain to be defined. We took advantage of our SNP dataset to evaluate the proportion of paralogous SNPs in our white spruce gene index using statistical and experimental approaches. Two general approaches were also applied to delineate translated regions and distinguish synonymous from nonsynonymous SNPs. To facilitate the choice of candidate genes to be mapped, this white spruce SNP resource was integrated into a publicly available database ForestTreeDB, which also includes functional annotations of the gene index.
Results and discussion
We estimated SNP diversity and distribution parameters in 6,459 contigs each derived from sequences of at least two cDNA clones with the PolyBayes software. Predicted SNP analyses have frequently been conducted based on a number of aligned sequence reads. We have conducted large-scale EST sequencing with both 3' and 5' reads from the same cDNA clones and obtained a large number of cDNAs with of overlapping reads in the same contig . We thus considered the contigs for inclusion in our analysis based upon the number of clones represented rather than the number of reads.
Experimental error rate within EST assemblies
We exploited the fact that sequence reads in opposite directions from the same clone overlapped in 4,395 contigs, in order to verify whether SNPs were detected in these redundant sequences. Indeed, differences detected within same-clone overlaps would not represent real SNPs but experimental errors from reverse transcription or sequencing. The total available alignment length of these overlapping sequences was 4,755,154 nucleotides. In total, there were only 288 positions (1 per 16,511 nucleotide sites) where reads from the same clone conflicted, and 74% of these positions were uncalled or ambiguous bases. Among the remaining 75 putative false substitutions, we observed a few cases where the discrepancy in one clone was repeated in another clone at the same site in the contig. After removing these repeated discrepancies, we detected 69 sites that conflicted between overlapping reads from the same clone, out of the overall 7,555 SNPs detected with probability score PSNP ≥ 0.10. The overlapping sequences represented an error rate or false prediction rate of less than 1%. The low level of false prediction indicates that the sequences were generated with a high accuracy and that the initial filtering using a Phred score of 20 appears to be sufficiently stringent to eliminate the majority of experimental errors.
In silico detected SNPs according to the expected SNP frequency and PolyBayes probability scores
In this report, we used a p-prior of 0.01 corresponding to a mean frequency of one SNP per 100 nucleotides, as in a previous study on Pinus pinaster . A very similar SNP frequency was experimentally determined at 0.012 in Picea glauca, based upon a total of 9,253 nucleotide sites sampled for eight nuclear genes (KN1, KN2, KN3, KN4, HB-3, NAM, rpl13A, rpl15) [21, 31]. It is also in the range of that observed for Pinus taeda (0.016)  and for Pinus pinaster (0.0097) . With p-prior = 0.01, we detected, 9,310 SNPs among 3,590 contigs (PSNP ≥ 0.95; Figure 1), resulting in a polymorphism rate of 1 SNP per 700 nucleotide sites. In Pinus pinaster, LeDantec et al. (2004) reported a SNP rate of 1 per 660 nucleotide sites by using p-prior = 0.01 and PSNP ≥ 0.99, and analysing contigs containing 4 reads or more . The rate that we estimated in P. glauca became 1 SNP per 689 nucleotide sites, somewhat closer to that reported for P. pinaster, when using the same probability levels and considering only contigs of 4 clones or more. Our rate rate further increased to 1 SNP per 492 nucleotide sites when considering SNPs with PSNP ≥ 0.95 (p-prior of 0.01, contigs with 4 clones). Our analyses also showed that 55.6% to 61.5% the contigs included at least one SNP predicted when varying PSNP from ≥ 0.95 to ≥ 0.10 (with a constant p-prior = 0.01: Figure 1).
Experimental estimation of the rate of false positives according to PSNP
The collection of in silico detected SNPs was built for genotyping purposes. To extract the most reliable SNPs, we could have chosen the ones detected with the highest scores (PSNP ≥ 0.99). Indeed, several SNP resources were developed by using PolyBayes to detect SNPs and only SNP detected with score PSNP ≥ 0.99 were considered [17, 18]. However, the stringent cutoff of PSNP ≥ 0.99 would eliminate 44.6% of the in silico detected SNPs from our data set.(Figure 2). Useche et al. (2001)  used PolyBayes to detect SNPs in maize ESTs. They determined that the PSNP score as an important parameter and chose to display all the SNPs in their database, irrespective of their PSNP scores, to avoid discarding too many SNPs. However, the validity of the various polymorphisms was not assessed experimentally. We felt it was important to strike a balance between the cost of genotyping and the desire to achieve a high SNP discovery rated, thus, we experimentally tested the in silico detected SNPs for various levels of PSNP scores.
For the purpose of validation, fragments of spruce genomic DNA were amplified and sequenced from the same source material as the previously conducted EST sequencing . We carried out the analyses using a subset of the dataset, corresponding to the sequences obtained from the single genotype PG653. Working with a single genotype reduced the number of sequencing reactions required for this validation, since detected SNPs represented biallelic positions. Hence, putative SNP sites would correspond to double peaks in the sequence chromatograms derived from PG653 (Figure 2). A set of 156 contigs encompassing 325 SNPs detected in silico with a range of PSNP values were selected to amplify the genomic sequences. We were able to specifically amplify the single expected sequence for 112 contigs, encompassing 245 in silico detected SNPs. A total of 43,286 nucleotides corresponding to these 112 genes were sequenced and were visually inspected by aligning the trace files [see Additional file1] for details about the targeted contigs and SNPs). We obtained a true positive rate of 74%, represented by the in silico detected SNPs that were also detected in the genomic DNA sequence analysis (Figure 2). Nearly identical validation rates of 85.1% and 84.9% were observed when considering only the SNPs detected with higher stringency criteria of PSNP ≥ 0.95 or ≥ 0.99. Based on this observation, we considered SNPs detected with values of PSNP ≥ 0.95 for the remained of our study. However, the subset of SNPs detected with a PSNP ≥ 0.99 were used in specific analyses to enable comparisons with published reports [17, 18]. Similarly to our results, Le Dantec et al. (2004) reported a validation rate of 83% for SNPs detected with a PSNP ≥ 0.99 in pine . By using a cut-off of PSNP ≥ 0.95 rather than of PSNP ≥ 0.99, we increased the number of SNPs by 2,514 out of a total of 12,264 SNPs detected in silico using PolyBayes.
Such experimental validations of SNPs detected by PolyBayes or other computational prediction methods have been conducted only sparingly, and for a few other genomes. PolyBayes was used to predict SNPs in human ESTs; a subset of 36 SNPs detected with a PSNP score ≥ 0.40 was experimentally verified by screening four populations, leading to a confirmation rate of 56% . On a larger scale, 1,200 SNPs from dbSNP which were detected in silico in the human genome by data-mining procedures were experimentally analysed by a pooled DNA sequencing approach, showing that 80% of the detected SNPs were found polymorphic in three ethnic groups . In another validation study, the confirmation rate reached 88%, but the SNPs were detected in silico using several approaches and it was not possible to attribute a performance level to PolyBayes specifically .
We obtained an experimental validation rate 85% (137 out of 161) for SNPs detected in our white spruce contigs at PSNP ≥ 0.95. The validation rate increased to 92% when only considering contigs derived from 2 or 3 clones. The SNPs falsely predicted by PolyBayes represent either experimental errors in the ESTs or differences between paralogous sequences erroneously assembled into a single contig. The experimental error rate was previously shown to be very low, from the analysis of overlapping sequence reads in opposite directions from the same clone. On the other hand, the higher confirmation rate observed with contigs containing fewer clones would indicate that putative paralogs are indeed present in the contigs, especially in those comprised of several cDNA clones, in spite of the stringency of the parameters used in the contig assembly. When the complete genome is available to help delineate orthologs and paralogs, the confirmation rate of in silico detected SNPs tends to be higher than that observed in our study. For example, in a SNP detection project conducted in Arabidopsis, the confirmation rate was 96% for in silico detected SNPs . Analysing SNPs in a completely sequenced genome can be much more powerful because most of the paralogs may be excluded from the contigs used to detect SNPs. Our results confirm the need for experimental validation of in silico SNP detection in genomes that are incompletely sequenced, before the information is used for more advanced genetic analyses. In the following section, we examine two methods to estimate the proportion of paralogous SNPs among in silico detected SNPs and show that most of the SNPs falsely predicted by PolyBayes represent differences between paralogous sequences.
Estimating the proportion of paralogous SNPs
A number of contigs were likely to contain paralogous sequences, so we were interested in evaluating the proportion of paralogous versus orthologous SNPs in the dataset. We use the term 'paralogous SNPs' to refer to non allelic SNPs which likely result from the misassembly of paralogous, and 'orthologous SNPs' for allelic differences occurring between reads truly derived from the same gene. The paralogous SNPs are to a great extent undesirable for population or association genetics studies and in genetic mapping. To estimate the proportion of paralogous SNPs, we followed two approaches.
We first used the previous validation experiment of in silico detected SNPs where a set of 112 contigs was re-sequenced from genomic DNA of the single genotype PG653. The same set of contigs was re-sequenced from genomic DNA of a haploid white spruce megagametophyte, with 43,286 nucleotides determined and visually inspected by aligning the trace files. The presence of superimposed peaks in sequences from haploid DNA indicates the presence of 'paralogous SNPs' We found 19 SNPs (out 161) that corresponded to multiple peaks in the sequences obtained from haploid megagametophyte DNA, among the in silico detected SNPs used for validation, with genomic DNA of PG653 (see above), which represents a proportion of paralogous SNPs of 12%. If we only consider the SNPs detected in silico with PSNP ≥ 0.95, the proportion of paralogous SNPs decreases slightly to 10.5%.
The second method used is based on population genetics theory and relies entirely on estimates of in silico SNP frequencies, as described in Methods. It relies on a comparison of the frequency of SNPs in contig sequences derived only from the single genotype PG653, and the frequency of SNPs in the total population sampled. When paralogous sequences are intermixed with orthologous sequences, the proportion of segregating sites k is biased upward by a quantity corresponding to ε. ε is a constant independent of the number of genotypes analysed, while k will increase with the number of haploid genomes analysed, thus resulting in a reduction of ε relative to k. This principle can be used to estimate ε. For this purpose, only SNPs with PSNP ≥ 0.95 and contigs containing 10 clones or more were considered, in order to sample as many distinct haploid genomes as possible while sampling a reasonable number of contigs and SNPs. When sequences corresponding to contigs obtained from PG653 only were considered, 43 SNPs were detected from a cumulative length of 36,191 nucleotide sites, for a k value of 1/842 or 0.00119. When sequences from all genotypes were considered for the same contigs, 1,532 SNPs were detected from a cumulative length of 507,528 nucleotide sites, and k was higher with a value of 1/331 or 0.00302. Then, the estimated proportion of «paralogous SNPs» in the dataset, ε, was obtained by solving the set of equations (4) and (5) in Methods and for two sampling scenarios delineated therein. The ε values obtained were 0.000238 and 0.000517, respectively, corresponding to proportions of paralogous SNPs of 8% and 17% in the entire dataset. As explained in Methods, the second scenario, which assumes a sampling size of 24 haploid genomes, is less likely than the first scenario of sampling 10 distinct haploid genomes per contig, given that very few contigs would be characterized by a number of distinct clones approaching 24, the maximum number of haploid genomes corresponding to the 12 distinct genotypes used to obtain EST sequences. Thus, the value of 8% should be more realistic than the value of 17%.
These ε estimates agree well with the proportion of paralogous SNPs obtained from experimental validation using haploid megagametophyte DNA (rate of 10 to 12%). These estimates indicate that most of the false positives predicted by PolyBayes were not experimental errors, but that the criteria used to assemble the ESTs and run PolyBayes were stringent enough to avoid a wide occurrence of paralogous SNPs. To our knowledge, these are the first experimental and statistical estimates of proportions of paralogous SNPs in SNP datasets derived from EST contigs. The statistical procedure outlined above offers a simple way to estimate the proportion of paralogous SNPs for a given set of contig assembly parameters, as long as EST sequences from a single genotype can be traced back. However, this procedure may not be suitable when the reference genotype has been genetically manipulated.
SNP frequency according to contig depth and length
Frequency of detected SNPs with PSNPscore ≥ 0.95 in the contigs derived from all 12 genotypes used to obtain ESTs
Number of clones in the contig
Number of contigs
Cumulated length of the contigs
Cumulated length of the contigs excluding non redundant sites1
SNPs/redundant nucleotide site2
To evaluate SNP frequencies, two methods were used (Table 1). First, the number of in silico detected SNPs was divided by the cumulative length of the contigs, resulting in a number of SNPs per nucleotide site (PSNP ≥ 0.95). The SNP frequency ranged from one per 1,184 nucleotide to one per 361 nucleotide sites in contigs from two to 10 or more clones, respectively (Table 1). On average, one SNP was detected per 700 nucleotide sites. Second, we generated a less biased measure of the SNP frequency by computing the number of SNPs (PSNP ≥ 0.95) per redundant nucleotide site. A redundant site is one where information for more than one sequence is available in the contig. We thus developed a Perl script to only count the number of redundant nucleotide sites involved in the SNP detection within each contig, thus excluding non redundant nucleotide sites, which were only represented by one sequence in the contig alignment. With this approach, the SNP frequency range from one per 712 redundant nucleotide to one per 331 redundant nucleotide sites for contigs of two or 10 and more clones, respectively. We thus detected one SNP per 511 redundant nucleotide sites, across all the contigs with 2 or more clones (Table 1). These estimates of SNP frequency have less of a downward bias than estimates based on the total alignment length because non redundant nucleotide sites are excluded. The difference in mean alignment length per contig can be seen on Figure 3 and Table 1, when excluding or including non redundant nucleotide sites. As contigs contain more clones, the relative impact of this factor diminishes (Figure 3). The proportion of paralogous SNPs estimated in the previous section was also based on SNP rates per redundant nucleotide site.
Detection of putative coding regions by sequence similarity searches and ORF prediction
To determine which SNPs lay outside or inside putative coding regions, two methods were used to predict open reading frames (ORFs). The first one was based on similarities with protein sequences in public databases and the second one was based on ab initio predictions of open reading frames.
The number of contigs with identifiable coding regions. ORFs were delineated based on one method (on the diagonal), based on a combination of two methods (in bold), and based on data found by both methods (in italics). The total number of snp'ed contigs was 3,590 (PSNP≥ 0.95).
e-value < e-10
e-value < e-10
p-value < e-08
p-value < e-08
e-value < e-10
e-value < e-10
p-value < e-08
p-value < e-08
A second method based on ORF prediction was used to pursue two objectives. It enabled us to identify putative ORFs among the 430 orphan contigs and to classify the SNPs as coding or not coding without any assumptions regarding sequence similarity. ORFs were predicted by using the Diogenes software which was trained with ORFs from several species, thus generating multiple matrices to detect ORFs in plant sequence datasets . We analysed the results generated by two matrices, one derived from the "Brassicaceae" and one from the "Pinaceae" training sets, using a p-value < 1e-08 in each case (Table 2). Diogenes detected ORFs with a in 214 out of the 430 orphan contigs (p-value < 1e-08). with only 9 detected by Diogenes trained with Pinaceae sequences only. We combined the ORFs detected by blastx and Diogenes and we were able to localize an ORF for a total of 3,374 out of the 3,590 snp'ed contigs. An ORF was assigned to 88% of the contigs based on blastx matches alone, whereas the combination of both blast and Diogenes increased the assignments to 94%. The assignments were distributed as follows: i) 3,160 ORFs with a blastx hit in Uniref100 (including 20 sequences with a match in the Arabidopsis proteome but not in UniRef100); ii) 225 ORFs predicted by Diogenes-Brassiceae; and iii) 9 ORFs specifically predicted by Diogenes-Pinaceae.
Estimating the rates of nonsynonymous and synonymous SNPs
In order to classify synonymous and nonsynonymous SNPs, we used the boundaries of the ORFs predicted by Diogenes wherever possible using, in preference over the boundaries of the conserved regions detected by blastx. Boundaries of coding regions delineated by similarity searches (blastx) were avoided because they could have introduced an estimation bias related to conserved regions. We compared the blastx and the Diogenes results to annotate the coding regions based on ORF predictions and developed and improved dataset to classify nonsynonymous and synonymous SNPs and estimate their frequencies. The improved dataset was still comprised of 3,374 ORFs but distributed as follows: i) 2,823 ORFs predicted by Diogenes-Brassiceae ii) 9 ORFs specifically predicted by Diogenes-Pinaceae, iii) 533 regions similar to known proteins identified by blastx searches against Uniref100 but without any ORF predicted by Diogenes, and iv) 9 regions similar to Arabidopsis proteins identified with blastx but without any ORF predicted with Diogenes.
Descriptive parameters of coding SNPs
e-value < 1e-10
e-value < 1e-10
Diogenes ORF Brassicaceae trained
p-value < 1e-08
Diogenes ORF Pinaceae trained
p-value < 1e-08
Dataset of 205 ORF predicted by Diogenes but with no match in Uniprot e-value < 1e-10
Combination of all methods
Contigs with a putative coding sequence assigned
Contigs with no coding region assigned
Synonymous SNPs (1)
Nonsynonymous SNPs (2)
Total coding SNPs
Number of nonsynonymous sites (La)
Number of synonymous sites (Ls)
Total number of coding sites (La+Ls)
Rate of nonsynonymous SNP per site (2)/La (Ka)
Rate of synonymous SNP per site (1)/Ls (Ks)
By combining several approaches to delineate ORFs we were able to increase the number of coding SNPs by 13%. With Diogenes, we classify 221 SNPs (among the 430) contigs with no match in Uniref100. In this subset of 221 SNPs, the rate of synonymous substitutions was 6.18 SNPs per 1,000 sites and that of nonsynonymous substitutions was 1.10 SNP per 1,000 sites, which are slightly higher rates than the overall dataset., but the overall Ka/Ks ratio remained about the same at 0.18. This result was expected, as the discovery of additional SNPs by in silico methods should not affect the overall balance between nonsynonymous and synonymous SNPs.
Integration of the SNP resource in ForestTreeDB
To facilitate the retrieval of snp'ed contigs according to their functional annotations, the white spruce SNP data was uploaded into ForestTreeDB, a database unifying EST data and sequence annotations. ForestTreeDB includes tables containing data related to 12,264 white spruce SNPs detected in silico. These public query pages are available to mine the data at http://ccgb.umn.edu:8088/ForestTreeDB/nimbus/project.do?project=ForestTreeDB. Each SNP is described by its location on the contig sequence, including the appropriate strand. The possible bases of the SNP are represented by the IUPAC nucleotide ambiguity code. An "evidence code" indicates the SNP status as "predicted", "validated," or "not validated". The number of distinct clones that support each SNP location is retained, as well as the analytical parameters used for the prediction, including the p_prior value and the confidence level of the in silico detection (PSNP). Several computational approaches were used to explore the putative functions of the sequences recovered in the snp'ed contig dataset the results were incorporated in the ForestTreeDB database. The analyses included sequence similarity searches against several databases and Hidden Markov Models searches with the models available in the PFAM protein families database, as described previously . Approximately 87% of the contigs containing at least one SNP had a blastx hit in UniRef100 (e-value < 1e-10; Table 2). We assigned putative GO terms to the subset of snp'ed contigs through the sequence similarity searches against UniRef100 and the Arabidopsis TAIR database providing a link to molecular function terms in the Gene Ontology (GO). Among the diverse molecular function categories, 1,943 annotations (belonging to 94 functional classes) were associated to 1,388 snp'ed contigs (one contig may be associated to more than one GO term). The wide diversity of predicted functions among the snp'ed contigs. indicates that dataset could be useful in diverse association studies drawing upon genes from diverse protein functions.
Plant material and EST collection
A total of 17 Picea glauca cDNA libraries were prepared from a diverse tissues sampled to maximize the diversity of the isolated genes. Details about the sampled tissues are available on the Arborea web site . Nine libraries were prepared from the genotype PG653, and the remaining eight libraries were prepared from 11 different accessions of Picea glauca to incorporate more genotype diversity. Details of the preparation of cDNA libraries and EST sequencing methods previously presented elsewhere .
All of the clones were randomly chosen and sequenced from the 3' end . A subset of clones selected among 10 libraries were also sequenced from the 5' end. As the 3' and 5' reads from the same clone may overlap in the same contig, all SNP diversity and distribution parameters were estimated by considering the number of clones analysed instead of the number of reads. Thus, we directed our search for SNPs to the 6,459 contigs derived from at least 2 clones and which had mean length 1,009 nucleotides of Phred score above 20. The majority of contigs, namely 73.4%, were derived from both 3' and 5' reads; they had an average length of 1,078 nucleotide and represented completely sequenced inserts. A total 20.4% of the contigs were derived from the 3' reads and only 6.2% were from the 5' reads alone. The 5' and 3'contigs were slightly shorter, with mean lengths 803 and 823 nucleotides, respectively. The EST sequences were deposited in the dbEST section of the GenBank database. The contig sequences, the contig composition, blast reports, and some statistics about the sequence dataset are available at our web site .
Experimental validation of the in silico detected SNPs by resequencing of genomic DNA
Experimental validation of the in silico detected SNPs involved PCR amplification and sequencing of genomic DNA corresponding to EST contigs. To minimize the risk of amplifying multiple members of the same multigenic family when amplifying genomic DNA corresponding to EST contigs, we chose contigs with a small number of hits against the more extensively sequenced pine transcriptome (Pinus Gene Index Release 4). We assumed that these contigs would be part of small gene families and would also be less represented in the spruce genome. We also extrapolated the ATG and/or the STOP positions on the spruce contigs based on the assumption that spruce proteins would be of similar size than the complete Arabidopsis proteins against which they had a match. We thus anchored one of the two primers (or both) upstream of the ATG or downstream of the STOP codons to target the less conserved untranslated regions. For the contigs in which ATG or STOP codons could not be predicted, we selected sequence regions that were not conserved with known sequences to anchor at least one of the two PCR primers. Primers for PCR amplification and sequencing were designed by using the Primer3 software  installed locally. The primer sequences were compared with blast  against the spruce and the pine transcript datasets to exclude primers with possible hits in multiple transcripts. PCR reactions were performed in 30 μl containing 20 mM Tris-HCl (pH 8.4), 50 mM KCl, 1.5–2.0 mM MgCl2, 200 μM of each dNTP, 200 μM of both 5' and 3' primers and 1.0 unit of Platinum Taq DNA Polymerase (Invitrogen, Carlsbad, California). About 5–20 ng of genomic DNA was used as template. A peltier Thermal Cycler (DNA Engine, DYAD™, MJ Research, Walthman, Massachusetts) was used, with the following thermal cycling program : 4 min at 94°C, followed by 35 cycles of 30 s at 94°C, 30 s at annealing temperature optimized between 54 and 60°C for each pair of primers, and 1 min at 72°C, followed by 10 min at 72°C. Each PCR product was directly sequenced in both directions with a Perkin-Elmer ABI 3700 XL DNA sequencer (Applied Biosystems, Foster City, California), using BigDye Terminator cycle sequencing kit (version 3.1). Contigs were constructed with the seqmerge program in the GCG package (Wisconsin Package Version 10.3, Accelrys Inc., San Diego, California).
EST sequence processing and SNP automated detection
EST sequences were processed with locally-developed system at the Center for Computational Genomics and Bioinformatics, University of Minnesota, for contig quality control, as was previously described in detail . Sequence trace files from the spruce ESTs were processed to yield raw sequences with the Phred base calling software version 0.020425.c . Phred quality values less than 20 were considered to be ambiguous in this experiment and were assigned base N. Quality trimming and vector filtering (with polyA/polyT removal, as appropriate) were done using the software gstVF4 . Processed EST sequences were then assembled using Phrap (version 0.990319) . Phrap contigs were evaluated for chimeric sequences, and reassembled so as to minimize chimeric tendencies. Final assembly threshold for phrap were -minmatch 50 -minscore 100. This procedure produced better quality contigs prior to running PolyBayes. For the SNP prediction, the PolyBayes software version 3.0  was run with the following parameters: -inputFormat ace -aceIn spruce.fasta.screen.ace.1 -phdFilePathIn../phd_dir -reportOut../polybayes.out -p_prior (0.001 0.00125 0.0025 0.005 0.01 0.02) -thresholdSnp 0. The candidate SNPs were filtered by using the probability PSNP provided by PolyBayes with several cutoffs for PSNP. PolyBayes outputs were analysed with in-house scripts written in Perl.
Statistical estimation of the proportion of paralogous SNPs
The following procedure was used to estimate the proportion of paralogous versus orthologous SNPs in the SNP dataset. In the absence of paralogous SNPs, it can be shown that
k = a θ (1)
where k is the proportion of segregating sites, those harboring a SNP, and θ = 4N e μ, N e being the effective population size and μ, the mutation rate per site per generation, and
a = 1 + 1/2 + ... + 1/(n-1) (2)
where n is the number of sequences sampled per gene (eq. 9.8, ). When paralogous sequences are inter-mixed with orthologous sequences, k is biased upward by a quantity corresponding to ε, such that
k = a θ + ε (3)
Thus, ε could be estimated by solving the following set of equations:
kPG653 = θ + ε (4)
kpop = a θ + ε (5)
where kPG653 and kpop correspond to the number of segregating sites in sequences from the single genotype PG653 and from the population sample of 12 genotypes, respectively. In both cases, the number of sites sampled corresponded to the cumulative length of sequences of redundant nucleotide sites, those represented in at least two clones, and we used only SNPs with PSNP ≥ 0.95. Because contigs with a small number of clones are biased downward in terms of sampling SNP diversity (Figure 3), we limited our analysis to contigs containing 10 clones or more when considering sequences from PG653 only as well as when considering sequences from all 12 genotypes. In the case of PG653 and equation (4), we assumed that all contigs with 10 clones and more contained sequences from both haploid complements, thus the sampling size n corresponded to 2 and thus, a = 1 was used when transforming equation (1) to (4). When considering all 12 genotypes and equation (5), lower and higher bound scenarios were considered. A first scenario assumed a sampling size of 10, given that contigs with 10 or more sequences were analysed, thus corresponding to an a value of 3.733 in equation (5). We also tested a second scenario, by assuming a sampling size of 24 corresponding to the number of haploid complements used to obtain EST sequences, with corresponding a value of 2.929 in equation (5). This scenario is less likely, given that very few contigs had a number of sequences in excess of 30. Then, the proportion of paralogous SNPs in the entire dataset could simply be obtained by the ratio of ε over kpop.
Phrap contigs were analysed by blastx comparisons  against several databases, including a non-redundant peptide set (UniRef100) provided by the UniProt consortium , Arabidopsis protein dataset retrieved from the TAIR web site , and the Pine Gene Index (PGI4) retrieved from TIGR . We extrapolated the Gene Ontology (GO) terms [47, 48] associated with sequences in UniRef100 and the TAIR databases to the contigs using sequence similarity criteria. The Diogenes software , which was trained with sequences from multiple species to predict open reading frames, was used with two parameter sets derived respectively from Brassicaceae and Pinaceae species. Only those predicted coding regions with p < 1e-08 were retained. A Perl script was written to standardize the blastx output and the Diogenes output (frames were converted into phases, and the start/end of the ORFs were recalculated to comply with the blast HSP start/end). To sort the SNPs into coding and non coding classes, a Perl program was developed to extract from the contig sequence the coding regions, to split them into triplets based on the frame information and then to deduce the codons based on the strand information. To determine whether the SNP induced an amino acid change, the CodonTable from Bioperl was incorporated into the main script.
Funding for this work was provided by Genome Canada and Genome Québec to J.M. and J.B. for the project Arborea. The authors are grateful to John Crow (CGGB, Univ. Minnesota) for helping with the prediction of the ORF with Diogenes, France Gagnon (Arborea, Univ. Laval) for her skillful help with the experimental validation, Etienne Noumen (Arborea, Univ. Laval) for the development of scripts, Jérôme Laroche (Centre for Bioinformatics, Univ. Laval) for estimating the numbers of synonymous and nonsynonymous positions, Nathalie Isabel for providing the PG653 material, Jim Johnson and the CCGB team for the work on the database.
- Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D: International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001, 409: 928-933.PubMedView ArticleGoogle Scholar
- Matise TC, Sachidanandam R, Clark AG, Kruglyak L, Wijsman E, Kakol J, Buyske S, Chui B, Cohen P, de Toma C, Ehm M, Glanowski S, He C, Heil J, Markianos K, McMullen I, Pericak-Vance MA, Silbergleit A, Stein L, Wagner M, Wilson AF, Winick JD, Winn-Deen ES, Yamashiro CT, Cann HM, Lai E, Holden AL: A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set. Am J Hum Genet. 2003, 73: 271-284.PubMedPubMed CentralView ArticleGoogle Scholar
- The Arabidopsis Information Resource. [http://www.arabidopis.org/]
- Jander G, Norris SR, Rounsley SD, Bush DF, Levi IM, Last RL: Arabidopsis map-based cloning in the post genome area. Plant Physiol. 2002, 129: 440-450.PubMedPubMed CentralView ArticleGoogle Scholar
- Schmid KJ, Sorensen TR, Stracke R, Torjek O, Altmann T, Mitchell-Olds T, Weisshaar B: Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. Genome Res. 2003, 13: 1250-1257.PubMedPubMed CentralView ArticleGoogle Scholar
- Feltus FA, Wan J, Schulze SR, Estill JC, Jiang N, Paterson AH: An SNP resource for rice genetics and breeding based on subspecies indica and japonica genome alignments. Genome Res. 2004, 14: 1812-1819.PubMedPubMed CentralView ArticleGoogle Scholar
- Shen YJ, Jiang H, Jin JP, Zhang ZB, Xi B, He YY, Wang G, Wang C, Qian L, Li X, Yu QB, Liu HJ, Chen DH, Gao JH, Huang H, Shi TL, Yang ZN: Development of genome-wide DNA polymorphism database for map-based cloning of rice genes. Plant Physiol. 2004, 135: 1198-1205.PubMedPubMed CentralView ArticleGoogle Scholar
- Marth GT: Computational SNP discovery in DNA sequence data. Single Nucleotide Polymorphisms: Methods and Protocols. Edited by: Kwok PY. 2002, Humana Press, Totowa, NJ, 212: 85-110.View ArticleGoogle Scholar
- Clifford RJ, Edmonson MN, Nguyen C, Scherpbier T, Hu Y, Buetow KH: Bioinformatics tools for single nucleotide polymorphism discovery and analysis. Ann N Y Acad Sci. 2004, 1020: 101-109.PubMedView ArticleGoogle Scholar
- Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, Boyce-Jacino M: Mining SNPs from EST databases. Genome Res. 1999, 9: 167-174.PubMedPubMed CentralGoogle Scholar
- Buetow KH, Edmonson MN, Cassidy AB: Reliable identification of large numbers of candidate SNPs from public EST data. Nat Genet. 1999, 21: 323-325.PubMedView ArticleGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11: 1725-1729.PubMedPubMed CentralView ArticleGoogle Scholar
- Batley J, Barker G, O'Sullivan H, Edwards KJ, Edwards D: Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 2003, 132: 84-91.PubMedPubMed CentralView ArticleGoogle Scholar
- Beaumont MA, Rannala B: The Bayesian revolution in genetics. Nat Rev Genet. 2004, 5: 251-261.PubMedView ArticleGoogle Scholar
- Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok P, Gish WR: A general approach to single-nucleotide polymorphism discovery. Nat Genet. 1999, 23: 452-456.PubMedView ArticleGoogle Scholar
- Grivet L, Glaszmann JC, Vincentz M, da silva F, Arruda P: ESTs as a source for sequence polymorphism discovery in sugarcane: example of the Adh genes. Theor Appl Genet. 2003, 106: 190-197.PubMedGoogle Scholar
- Zhu YL, Song QJ, Hyten DL, Van Tassell CP, Matukumalli LK, Grimm DR, Hyatt SM, Fickus EW, Young ND, Cregan PB: Single-nucleotide polymorphisms in soybean. Genetics. 2003, 163: 1123-1134.PubMedPubMed CentralGoogle Scholar
- LeDantec LL, Chagné D, Pot D, Cantin O, Garnier-Géré P, Bedon F, Frigerio JM, Chaumeil P, Léger P, Garcia V, Laigret F, De Daruvar A, Plomion C: Automated SNP detection in expressed sequence tags: statistical considerations and application to Maritime pine sequences. Plant Mol Biol. 2004, 54: 461-470.View ArticleGoogle Scholar
- Rafalski A: Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol. 2002, 5: 94-100.PubMedView ArticleGoogle Scholar
- Neale DB, Savolainen O: Association genetics of complex traits in conifers. Trends Plant Sci. 2004, 9: 325-330.PubMedView ArticleGoogle Scholar
- Guillet-Claude C, Isabel N, MacKay J, Bousquet J: Molecular evolution of regulatory genes in the conifer Picea : evidence for frequent intragenic recombination, selection, and recent population expansion. 2006,Google Scholar
- Pelgas B, Bousquet J, Beauseigle S, Isabel N: A composite linkage map from two crosses for the species complex Picea mariana × Picea rubens and analysis of synteny with other Pinaceae. Theor Appl Genet. 2005, 111: 1466-1488.PubMedView ArticleGoogle Scholar
- Kirst M, Johnson AF, Baucom C, Ulrich E, Hubbard K, Staggs R, Paule C, Retzel E, Whetten R, Sederoff R: Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. Proc Natl Acad Sci USA. 2003, 100: 7383-7388.PubMedPubMed CentralView ArticleGoogle Scholar
- Pine EST database. [http://pine.ccgb.umn.edu/]
- Pine EST database at the University of Georgia. [http://fungen.org/Projects/Pine/Pine.htm]
- Pinus Gene Index. [http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=pine]
- Spruce Gene Index. [http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=spruce]
- SpruceDB. [http://ccgb.umn.edu/Pub_SpruceDB]
- Arborea project. [http://www.arborea.ulaval.ca]
- Pavy N, Paule C, Parsons L, Crow J, Morency MJ, Cooke J, Johnson JR, Noumen E, Guillet-Claude C, Butterfield Y, Barber S, Yang G, Liu J, Stott J, Kirkpatrick R, Siddiqui A, Holt R, Marra M, Séguin A, Retzel E, Bousquet J, MacKay J: Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters. BMC Genomics. 2005, 6: 144-PubMedPubMed CentralView ArticleGoogle Scholar
- Bouillé M, Bousquet J: Trans-species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae): Implications for the long-term maintenance of genetic diversity in trees. Am J Bot. 2005, 92: 63-73.PubMedView ArticleGoogle Scholar
- Brown GR, Gill GP, Kuntz RJ, Langley CH, Neale DB: Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc Natl Acad Sci U S A. 2004, 101: 15255-15260.PubMedPubMed CentralView ArticleGoogle Scholar
- Useche FJ, Gao G, Harafey M, Rafalski A: High-throughput identification, database storage and analysis of SNPs in EST sequences. Genome Informatics Series. 2001, 12: 194-203.PubMedGoogle Scholar
- Marth G, Yeh R, Minton M, Donaldson R, Li Q, Duan S, Davenport R, Miller RD, Kwok PY: Single-nucleotide polymorphisms in the public domain: how useful are they?. Nat Genet. 2001, 27: 371-372.PubMedView ArticleGoogle Scholar
- Reich DE, Gabriel SB, Altshuler D: Quality and completness of SNP databases. Nature Genetics. 2003, 33: 457-458.PubMedView ArticleGoogle Scholar
- Crow JA: Diogenes – Reliable prediction of protein-encoding regions in short genomic sequences. 2005, [http://analysis.ccgb.umn.edu/diogenes]Google Scholar
- Li WH, Wu CH, Luo CC: A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985, 2: 150-174.PubMedGoogle Scholar
- Wright S, Lauga B, Charlesworth D: Rates and patters of molecular evolution in inbred and outbred Arabidopsis. Mol Biol Evol. 2002, 19: 1407-1420.PubMedView ArticleGoogle Scholar
- Zhang L, Vision T, Gaut B: Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana. Mol Biol Evol. 2002, 19: 1464-1473.PubMedView ArticleGoogle Scholar
- Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz S and Misener S. 2000, Humana Press, Totowa, NJ, 365-386.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.PubMedView ArticleGoogle Scholar
- Ewing B, Hiller LD, Wendl MC, Green P: Base calling of automated sequencer traces using Phred. II. Accuracy assessment. Genome Res. 1998, 8: 175-185.PubMedView ArticleGoogle Scholar
- gstVF4 software. [http://biodata.ccgb.umn.edu/]
- Phrap software. [http://www.phrap.org]
- Li WH: Molecular Evolution. 1997, Sinauer Associates, Sunderland, MassachusettsGoogle Scholar
- UniProt consortium. [http://www.uniprot.org]
- Gene Ontology. [http://www.geneontology.org]
- Consortium GO: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11: 1425-1433.PubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.