High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome
© Novaes et al. 2008
Received: 29 February 2008
Accepted: 30 June 2008
Published: 30 June 2008
Skip to main content
© Novaes et al. 2008
Received: 29 February 2008
Accepted: 30 June 2008
Published: 30 June 2008
Benefits from high-throughput sequencing using 454 pyrosequencing technology may be most apparent for species with high societal or economic value but few genomic resources. Rapid means of gene sequence and SNP discovery using this novel sequencing technology provide a set of baseline tools for genome-level research. However, it is questionable how effective the sequencing of large numbers of short reads for species with essentially no prior gene sequence information will support contig assemblies and sequence annotation.
With the purpose of generating the first broad survey of gene sequences in Eucalyptus grandis, the most widely planted hardwood tree species, we used 454 technology to sequence and assemble 148 Mbp of expressed sequences (EST). EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting discovery of homologues to almost half of Arabidopsis genes, and a comprehensive survey of allelic variation in the transcriptome. By aligning the sequencing reads from multiple genotypes we detected 23,742 SNPs, 83% of which were validated in a sample. Genome-wide nucleotide diversity was estimated for 2,392 contigs using a modified theta (θ) parameter, adapted for measuring genetic diversity from polymorphisms detected by randomly sequencing a multi-genotype cDNA pool. Diversity estimates in non-synonymous nucleotides were on average 4x smaller than in synonymous, suggesting purifying selection. Non-synonymous to synonymous substitutions (Ka/Ks) among 2,001 contigs averaged 0.30 and was skewed to the right, further supporting that most genes are under purifying selection. Comparison of these estimates among contigs identified major functional classes of genes under purifying and diversifying selection in agreement with previous researches.
In providing an abundance of foundational transcript sequences where limited prior genomic information existed, this work created part of the foundation for the annotation of the E. grandis genome that is being sequenced by the US Department of Energy. In addition we demonstrated that SNPs sampled in large-scale with 454 pyrosequencing can be used to detect evolutionary signatures among genes, providing one of the first genome-wide assessments of nucleotide diversity and Ka/Ks for a non-model plant species.
The high-throughput and cost-effectiveness of DNA pyrosequencing using 454 Life Sciences technology  has been successfully applied to large-scale EST sequencing in maize [2, 3], Medicago  and Arabidopsis [5, 6], resulting in a significant contribution of additional ESTs for these species. However, these studies were carried out on organisms with extensive transcriptome sequences already available. Pre-existing sequences support the assembly of 454 reads into contigs, thereby minimizing the drawback of short average read length (100–200 bp) produced by the pyrosequencing technology. However, the benefits from recent improvements in sequencing technologies may be even more valuable for plant species with high economic value but limited genomic resources. Yet, it has not been shown whether the primary limitations of the pyrosequencing method - short read lengths and ambiguous homopolymer reads - can be overcome to produce useful information for species with essentially no prior gene sequence information.
The 454 sequencing also provides an opportunity to identify allelic variants by sequencing and aligning ESTs from several haplotypes. Recently, two 454 cDNA sequencing runs, each interrogating a single maize inbred line, were used to identify over 36,000 putative single nucleotide polymorphisms (SNPs) . SNP discovery with 454 technology could also be accomplished by simultaneously sequencing multiple genotypes to sample the nucleotide diversity of an organism [8, 9]. However, the assembly of individual haplotypes is not feasible when sequencing a cDNA pool from highly heterozygous individuals. The unavailability of haplotype sequences hinders the use of most statistics to calculate genetic diversity. Therefore, new models are needed to estimate population genetic parameters.
Here we describe sequencing, assembly, and SNP discovery from 454-derived EST sequences of Eucalyptus grandis, a species with virtually no public genome sequence information. Eucalyptus is the most widely planted woody angiosperm in the world . Because of its value as a renewable source of raw material for wood, paper products, and biofuels, E. grandis has been selected for sequencing by the U.S. Department of Energy/Joint Genome Institute for completion in 2010 . Accurate annotation of the E. grandis genome sequence will rely heavily upon a repository of expressed sequences, as was demonstrated during the annotation of the Arabidopsis  and Populus  genomes. However, only 1,939 E. grandis expressed sequence tags (ESTs) are currently deposited in the National Center for Biotechnology Information (NCBI) database. Thus, rapid development of a large EST sequence collection will be crucial to support the E. grandis genome sequence annotation and for continued advancement of Eucalyptus genomics research.
With the purpose of generating the first broad survey of genes in a Eucalyptus species, we sequenced and assembled 148 Mbp of E. grandis ESTs from two GS-20 and one GS-FLX 454 pyrosequencing runs. Assembled sequences (25.4 Mbp) were deposited in the GenBank representing a 37× enrichment in publicly available expressed sequences for the species. Sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes, promoting extensive gene discovery and a comprehensive survey of allelic variation in the transcriptome. We demonstrate that 454 reads can be reassembled into contigs without utilizing traditional cDNA sequences as scaffolds. In addition, we show that SNPs detected from pyrosequencing a pool of genotypes are useful to reveal selection signatures among genes. Pyrosequencing technology can rapidly provide a foundational public genomic resource for an economically important but previously uncharacterized species.
Summary of E. grandis cDNA sequences.
Number of reads
454 all (3 runs)
GS-20 + FLX
Summary and distribution of assembled sequences. Length distribution and characteristics of contigs assembled from the two GS-20 runs, one GS-FLX run, the three 454 runs combined and from the control Sanger sequenced ESTs
Run(s) in assembly
1 + 2
1 + 2 + 3
≤ 100 bp
101 - 250 bp
251 - 500 bp
501 - 750 bp
751 - 1000 bp
> 1000 bp
Average contig length (bp)
Reads in contigs
Next we compared the sequences from 454 with 86,328 ESTs obtained from a proprietary database derived from sequencing a broad set of Eucalyptus cDNA libraries using conventional (Sanger) dideoxy-based sequencing (Table 1). The comparison shows that although Sanger contigs were built from a dataset with only one third the total basepairs of that derived from 454, they are on average more than twice as long (Table 2). However, the larger number of reads generated by the 454 technology substantially improves the efficiency of gene discovery, as demonstrated below.
Proportion of gene models with homologies to E. grandis cDNAs.
Matches against 454 unigenes
Matches against Sanger unigenes
Arabidopsis with transcript evidence (22,032)
To compare the efficiency of gene discovery in the two sequencing platforms (454 and Sanger) we established a unigene set for the Sanger sequenced ESTs by combining the 21,432 contigs with 17,203 singlets. The Sanger unigene set has a total of 22.05 Mbp, and is therefore comparable to the 25.42 Mbp in the E. grandis 454 unigenes measuring over 100 bp. We first aligned the two unigene sets to each other using BlastN (E value 10-5), and detected 84% of the Sanger unigenes having a match to the 454 dataset but only 41% of the 454 unigenes having homology to the Sanger sequences (data not shown). This is an indication of a greater coverage of distinct cDNAs in the 454 derived sequences. However, it is possible that the higher frequency of shorter sequences found in the 454 dataset contributed to an increased number of no-hits, as shorter sequences are less likely to align with a significant E-value. To rule out this possibility, Sanger unigenes were also used in an analogous BlastX against the Arabidopsis, Populus and Oryza gene models. For all organisms and all Blast thresholds applied, a larger number of gene models had similarity to the 454 unigenes than those generated by Sanger sequencing (Table 3). Therefore, the large number of reads generated by 454 pyrosequencing appears to maximize gene discovery in EST sequencing projects. On the other hand, mean Blast alignment lengths using the 454 unigenes are approximately 2× shorter than for the Sanger unigene set (data not shown). Thus, although the 454 unigene set samples a broader diversity of transcriptional units, this occurs at the cost of decreased length of sequence of the individual genes.
Polymorphisms detected among E. grandis cDNA sequences.
Number of variants
Number of contigs containing SNPs
Involving two or more nucleotides
Lower confidence SNPs (freq rare allele <10%)
Higher confidence SNPs (freq rare allele ≥ 10%)
Validation of SNPs with conventional sequencing.
Number of predicted SNPs
The proportion of mutations that change amino acid sequence (i.e. non-synonymous) relative to those that do not (i.e. synonymous) can indicate whether a gene is under purifying, neutral or diversifying selection . The large number of ESTs sequenced from a mixture of genotypes could provide information for a genome-wide analysis of gene evolution based on the ratio of non-synonymous (Ka) to synonymous substitutions (Ks). To carry out this analysis, we initially determined whether SNPs introduce synonymous or non-synonymous mutations by (1) defining the sequence reading frame with a BlastX alignment against Arabidopsis peptides, (2) isolating codons containing SNPs, and (3) comparing the translated amino acids for each allele. Next, the degeneracy of nucleotides in the contigs' coding sequence was evaluated to estimate non-synonymous to synonymous substitution rate (Ka/Ks). We estimated Ka/Ks for 2,001 E. grandis contigs that have at least three high confidence SNPs and one positive BlastX hit against Arabidopsis gene models (E value 10-5) [see Additional file 1]. Distribution of Ka/Ks among these contigs was right-skewed, with estimates ranging from 0.008 to 2.101 and averaging 0.30, suggesting that the majority of the genes are under purifying selection.
GO categories enriched for E. grandis genes under purifying and diversifying selection.
Proportion out extreme
Proportion in extreme
ubiquitin-dependent protein catabolic process
chromosome organization and biogenesis
ribosome biogenesis and assembly
response to hydrogen peroxide
response to high light intensity
multicellular organismal development
where S is the number of SNPs detected in the contig, L is the contig sequence length and D is the sequencing depth estimated by the average number of reads aligned to each nucleotide position during assembly of the contig. The constant 1 was added to the number of SNPs in the numerator to enable comparisons with contigs containing no detected SNP. Contigs with extensive length and depth but for which no SNP is detected represent highly conserved genes.
There is one important distinction between β and the traditional nucleotide polymorphism estimate θ. When calculating θ, the average sequencing depth (D) variable in β is, instead, the number of different sampled haplotypes. These two variables are different because sequence reads from 454 represent a random sample from the pool of haplotypes in the cDNA library. Therefore some haplotypes may be sequenced multiple times, while others may not be sampled at all. Although this complicates comparisons with θ estimates from other studies, β is useful as a relative measurement to compare the nucleotide diversity between contigs generated within this project.
Summary of diversity in expressed sequences of E. grandis.
1.86 × 10-3
1.65 × 10-3
1.22 × 10-4 - 9.11 × 10-3
1.81 × 10-3
1.49 × 10-3
1.35 × 10-4 - 15.14 × 10-3
7.88 × 10-3
6.19 × 10-3
6.67 × 10-4 - 48.15 × 10-3
GO categories enriched among conserved and diverse genes of E. grandis.
Proportion out extreme
Proportion in extreme
malate metabolic process
ubiquitin-dependent protein catabolic process
response to biotic stimulus
Our analysis of 148.4 Mbp of E. grandis expressed sequences generated with three 454 sequencing runs demonstrates that short reads produced with pyrosequencing technology can be assembled de novo into reasonably long contigs; an advantage for species with limited public genomic resources. The 25.4 Mbp comprised in our unigene set represents an enrichment of 37× in the amount of publicly available E. grandis expressed sequences, and will provide substantial support for the genome sequencing and annotation that are currently under way . Longer reads from the GS-FLX were essential to assemble a reasonable proportion of biologically relevant contig sequences. Although our sequencing effort produced three times more sequence (≈ 148 Mbp) than the control dideoxy-based EST library (Sanger, ≈ 45 Mbp), the 454 short read-based contigs average only one third of the size. Despite this limitation, a substantial gain occurs at the level of gene discovery - the large number of reads in our 454 dataset leads to sampling of sequences from a much broader variety of genes. Therefore, 454-based gene discovery projects represent a viable and, perhaps favorable alternative to Sanger-based sequencing of EST libraries when a diverse sampling of genes is more important than obtaining transcript-length contigs. As GS-FLX becomes the standard 454 pyrosequencing platform, large-scale EST sequencing and gene discovery projects will be more successful in assembly of transcript-length contigs.
We also demonstrated that the detection of valid SNPs is possible by sequencing a pooled sample of highly heterozygous genotypes. By aligning the reads derived from cDNA of 21 E. grandis genotypes, we were able to detect 23,742 SNPs and validate 83% of a sample of 337. Therefore, approximately 4,000 of the detected SNPs may be false positives, possibly arising from sequencing errors or alignment of paralogs. Paralogs that share high levels of sequence similarity may have been assembled in the same contig because they cannot be distinguished due to the short read length of 454 pyrosequencing. This would lead to the detection of false SNPs. On the other hand, a higher stringency of assembly raises the possibility of the opposite problem: the separate assembly of haplotypes from highly polymorphic genes. However, the assessment of the EST assembly quality in this study is difficult due to the lack of a reference genome sequence. Nonetheless, the validation rate observed here was similar to the 85% reported for maize, where SNPs were detected by comparing sequences from two separate 454 runs, each interrogating a different inbred homozygous line .
A significant number of polymorphisms in our sample may have been overlooked because of the stringent methodology used to declare SNPs. For contigs with reasonable sequence coverage (length > 200 bp) and depth (> 10 reads on average per nucleotide) we detected one SNP for every 192 bp on average. This rate of SNP discovery appears low compared to previous reports of one SNP every ≈ 100 bp in coding sequences of Eucalyptus [18, 19] and other forest species [20–25]. There are at least three foreseeable reasons for missing true SNPs that are intrinsic to our experimental methodology and SNP detection approach. First, because 454 was used to randomly sequence cDNAs, not all genotypes were sequenced for every SNP locus - in fact, the average sequencing depth (6.7 reads/bp) in our experiment is far lower than the number of possible haplotypes in our sample. Secondly, the requirement that a SNP be observed in at least 10% of the reads aligned to a polymorphic site compromised the sensitivity required to detect rare alleles. Studies have demonstrated an excess of rare alleles in natural populations of forest tree species [22–24] which are probably largely discarded by our approach. Additionally, there is a relatively high level of relatedness among the 21 individuals utilized in our study. The sampled genotypes come from seven open-pollinated families - i.e. seven groups of three individuals share one common maternal parent, limiting the genetic diversity sampled relative to what might be present in a similarly sized sample of unrelated trees. Finally, the detection of polymorphisms was likely hindered by the requirement that at least two reads containing the variant alleles have at least 20 nucleotides of conserved sequence upstream and downstream of the locus. This requirement is intended to minimize the discovery of false polymorphisms due to the alignment of paralogs - a potentially significant problem when aligning short sequence reads. Therefore, only nucleotide variants in relatively conserved or recently derived paralogs may have been incorrectly identified as SNPs. The drawback is that true SNPs in hotspots of genetic diversity or genes under high diversifying selection (e.g. disease-resistance genes, ) may be discarded. Considering the high diversity found in forest trees [20–25], the requirement for sequence conservation in regions surrounding a SNP may be too conservative.
High throughput DNA sequencing and SNP discovery from a pool of multiple genotypes may be a powerful approach for rapid assessment of genetic diversity and selection in a genomic scale. Genome-wide surveys of genetic diversity have only recently been reported in the model plant Arabidopsis . Here we attempted to generate an approximate estimate of genetic diversity for a broad sample of genes by adapting existing parameters of genetic diversity to our experimental methodology. The most commonly used measure of genetic diversity, the nucleotide diversity parameter theta (θ), is an estimate of the number of polymorphic sites in a random haplotype sample from a population and is independent of the allele frequencies [17, 28]. In our study, the number of independent haplotypes sampled is unknown, as the same haplotype may be sampled repeatedly. Therefore we developed a modified nucleotide diversity estimator beta (β) based on SNPs detected by randomly sequencing a multi-genotype cDNA pool. β is always underestimated relative to θ, because the number of reads at a given SNP position is always equal or smaller than the effective (true) number of genotypes. The lower sensitivity to detect SNPs discussed previously also contributes to the underestimation of β. Therefore, the average β reported here (0.00186) is, as expected, much lower than the average θ estimated for genes of Populus by 9× , Douglas fir by 3.8× , loblolly pine by 2.2–2.7× [20, 21, 29] and Norway Spruce by 1.7× . Although it is not possible to compare estimates of β to the commonly used θ, they are useful for a comparison of nucleotide diversity among genes in this project.
In addition to the nucleotide diversity (β) we also estimated the proportion of non-synonymous to synonymous substitutions (Ka/Ks). As observed in other plant species [13, 16, 30] most genes of E. grandis appear to be under purifying selection and, accordingly, Ka/Ks distribution averages 0.30 and is heavily right-skewed. Among genes predicted to be under strong purifying selection based on the Ka/Ks ratio, there is enrichment for GO categories involving essential biological processes conserved across kingdoms, such as translation, ubiquitin-dependent protein degradation and nucleosome assembly by histones. rRNA and ribosomal proteins have been shown to be highly conserved between species and to evolve under strong purifying selection [31–33]. Several studies also confirm that histone genes evolve constrained by negative selection [34–36].
Among genes classified as diverse and/or under positive selection based on βN and Ka/Ks distributions, there is enrichment for genes classified as unknown biological process, defense response and response to biotic stress, and multicellular development. Other researchers have already reported an excess of unknown function among genes under positive selection [16, 30, 37]. A possible explanation for this overrepresentation is that the category of uncharacterized genes may be enriched for duplicated genes where relaxed selection constraints are leading their diversification and/or eventual silencing . In fact, the category may include pseudogenes and transcribed but untranslated loci. It is also possible that duplicated genes might have higher nucleotide diversity due to the assembly of paralogs. However we do not anticipate false SNPs to bias Ka/Ks estimates because they are not expected to occur more or less frequently in synonymous versus non-synonymous sites by chance.
Genes acting in defense response/response to biotic stimulus are frequently positively selected for diversification to compete with rapidly evolving avirulence genes of pathogens [26, 39]. We did find extensive diversity in non-synonymous sites (βN) of most defense response genes, but the diversity was also high among synonymous sites and, as a result, these genes were not enriched among those under diversifying selection (measured by the Ka/Ks). Similar results were reported in another study . One possible explanation is that positive selection generally only operates in certain domains (i.e., leucine-rich repeat (LRR)) of resistance genes [26, 39] and we estimated Ka/Ks over the entire coding sequence. The multicellular development GO class which is enriched among genes under diversifying selection is mainly due to the presence of NAC transcription factor genes. Transcription factors have been demonstrated to have an excess of positively selected genes in humans , and specifically one member of the NAC family was shown to be evolving rapidly in Arabidopsis . Finally, the literature survey supports our results and demonstrates that the proposed nucleotide diversity estimator (β) accurately depicts relative differences of variability within gene sequences.
The unigene sequences being released from this study will provide a much needed public resource for Eucalyptus research and will be important for the annotation of the forthcoming E. grandis genome sequence. 454-based EST sequencing and de novo assembly can provide a foundational transcriptome resource when limited prior sequence information is available. Lastly, we showed that nucleotide diversity of an organism can be sampled by high-throughput sequencing a pool of genotypes, and that this strategy is useful for detecting evolutionary signatures in a large number of genes that are in agreement with smaller scale traditional sequencing projects.
Three greenhouse-grown Eucalyptus grandis seedlings from each of seven open-pollinated families (21 genotypes in total) were dissected into xylem (Xy), phloem (Ph), roots (R), young leaves(YL), mature leaves (ML) and apical/lateral meristems (M). Tissues were immediately frozen in liquid nitrogen and stored. For each tissue, RNA was extracted by a standard protocol  from a pool of equal proportion of plant material from all 21 genotypes[41, 41]. RNA concentration was estimated using an ND-1000 Spectrophotometer (NanoDrop USA, Wilmington, DE) and integrity was evaluated on an agarose gel stained with ethidium bromide.
RNA isolated from each tissue pool was combined in varying proportions (10% YL, 10% ML, 15% R, 15% Ph, 20% M and 30% Xy) to a single pool in an attempt to maximize the diversity of transcriptional units sampled. Pooled RNA was DNase treated and purified using the RNAeasy Plant Mini Kit (Qiagen USA, Valencia, CA). Full-length cDNA was synthesized from 2 μg of RNA using the Clontech SMART cDNA Library Construction Kit (Clontech USA, Mountain View, CA) according to manufacturer's protocol, except that the Clontech CDSIII/3' PCR primer was replaced with the Evrogen CDS-3M adaptor (Evrogen, Moscow). cDNA was amplified using PCR Advantage II Polymerase (Clontech USA, Mountain View, CA) in 16 thermo cycles (7s at 95°C, 20s at 66°C, and 4 mins at 72°C) and was subsequently purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). The cDNA was normalized using the Evrogen Trimmer-Direct Kit (Evrogen, Moscow) to minimize differences in representation of transcripts. This normalization protocol is based on denaturing-reassociation of cDNAs, followed by digestion with a duplex-specific nuclease (DSN). The enzymatic degradation occurs primarily on the highly-abundant cDNA fraction. The single-stranded cDNA fraction was then amplified twice by sequential PCR reactions according to the manufacturer's protocol. Adaptors incorporated during the first strand synthesis were partially removed by SfiI digestion (8 U/μg of cDNA). Normalized cDNA was purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA).
Approximately 15 μg of normalized cDNA were used for library construction and sequencing at the Interdisciplinary Center for Biotechnology Research (ICBR) at the University of Florida, following the procedures described by Margulies et al. . Two sequencing runs were produced on the GS-20 platform, and one on the GS-FLX. Bases were called with 454 software by processing the pyroluminescence intensity for each bead-containing well in each nucleotide incorporation. An initial assembly of the sequences was performed with Newbler version 1.1.02.15 (454 Life Science, Branford, CT). Newbler considers the normalized intensity of each nucleotide flow, instead of individual base calls, which is more suited to assembly with sequencing by synthesis like 454. However, Newbler v.1.1.02.15 does not mask sequence repeats like the adaptors used for the cDNA normalization. Thus, after initial assembly of all reads, except those containing adaptor sequences, in Newbler further assembly was performed with Paracel Transcript Assembler (PTA) version 3.0.0 (Paracel Inc., Pasadena, CA). All contigs and singletons resulting from the Newbler assembly were combined with adaptor-containing reads for input into PTA. Using a threshold of 15, all sequences were masked for the presence of oligonucleotide adaptors used during cDNA library preparation and normalization. Low base-call quality (score ≤ 10) data was trimmed from the ends of individual sequences. Low complexity sequence regions (simple sequence repeats) are identified and excluded from consideration during initial pair-wise comparison, but are included during final alignment and consensus building. Assembly is performed in two stages. The first stage uses "Haste" algorithm to build groups (or clusters) of sequences sharing a minimal amount of identifiable sequence similarity (threshold = 50). The second stage carefully assembles sequences within individual clusters into consensus transcripts using the software defaults, except parameters MinCovRep (500), InOverhang (30), EndOverhang (30), RemOverhang (30), QualSumLim (300), MaxInternalGaps (15) and PenalizeN (0). Files containing reads' sequence and quality scores were deposited in the Short Read Archive of the National Center for Biotechnology Information (NCBI) [accession number SRA001122]. Newbler and PTA assembly files were also submitted to NCBI.
To detect SNPs in the cDNA pool, we used the consensus assembly generated from all sequencing runs as a reference sequence to which individual reads were aligned using GS Reference Mapper (454 Life Science, Branford, CT). Each read was aligned to only a single best homologous site in the reference sequence. Reads aligning equally well in more than one location in the reference were discarded. GS Reference Mapper only scores polymorphisms where two or more reads contain the variant allele. Additionally, at least two reads with the alternative allele must include ≥ 20 bases upstream and downstream of the variable nucleotide and no more than one additional sequence polymorphism in this window. For the analysis reported here, we considered only single nucleotide polymorphisms (SNPs), excluding all indels and variants involving more than one nucleotide. We also imposed the constraint that the variant allele appears in at least 10% of the total number of reads covering the polymorphic site.
To validate a sample of detected SNPs, we designed primers to amplify 500–700 bp of transcripts containing a large number of putative SNPs. Using the normalized cDNA as template, fragments were amplified using PCR with 30 thermocycles of 94°C for 30s, 55°C for 30s and 72°C for 35s. The amplified fragments (amplicons) were purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). Each amplicon was sequenced with both forward and reverse primers using standard dideoxy-based technology analyzed on the ABI3730 platform (Applied Biosystems). Sequencing chromatograms were visually analyzed with Chromas 2.32 (Technelysium Pty. Ltd.), and SNPs were identified as overlapping nucleotide peaks. For the SNP to be verified, the variant allele must have a chromatogram peak at least 50% higher than background peaks.
Protein coding sequence for each consensus was delimited using the best BlastX hit (E value 10-5) against the Arabidopsis thaliana peptides translated from TAIR7 gene models. Codons for each consensus were identified and nucleotide degeneracy determined. The number of synonymous and non-synonymous sites was calculated for each contig. Next, we determined whether SNPs positioned in consensus coding sequences introduce synonymous or nonsynonymous mutations by comparing the translated amino acids from the reference and variant sequences. We calculated the proportion of nonsynonymous to synonymous mutation (Ka/Ks) for each consensus following Hartl and Clark , with the exception that one unit was added to both number of synonymous and non-synonymous substitutions. This was important to allow Ka/Ks estimation in cases where either type of substitution was not found. Had we not included this modification, genes with an excess of synonymous but no observed non-synonymous substitutions would all have Ka/Ks equal zero, regardless of their Ks value. On the other hand, genes without any observed synonymous substitution would have undefined Ka/Ks because of division by zero.
We thank the Brazilian Network of Eucalypt Genome (Genolyptus Project) for kindly offering their dideoxy sequenced ESTs for comparison. We thank Carolina Novaes for helping with the dideoxy-based SNP validation. We also thank Dr. Charles Baer for providing useful comments on the manuscript. This work was supported by The Consortium for Plant Biotechnology Research, Inc. (CPBR).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.