Plant material for Sanger sequencing
Plant material used to generate the cDNA and SSH libraries were collected from different tissues, developmental stages or after different treatments. Below we summarize (into six categories) the 20 libraries that were produced for Q. robur and Q. petraea (see Table 1):
i/libraries H and I: Quercus petraea seedlings were grown at INRA Nancy (North East of France). Acorns were harvested in a local forest and sown in 10 L containers with a peat and sand mixture (1/1: v/v). A complete fertilisation (4.5 g L-1 of slow-release fertiliser Nutricote T100; N/P/K/Mg: 13/13/13/2 + trace elements and 0.2 g L-1 of lime) was provided at the beginning. All individuals were watered daily to field capacity with deionized water. The seedlings were 22-24 weeks old when submitted to five different treatments. Three different seedlings were submitted to each stress. For the cold (10°C for 3 days) and heat (35°C for 4 days) stress treatments, growth cabinets were used. The elevated CO2 treatment (700 ppm for 18 days) was done using a climate controlled greenhouse. For the drought stress treatment, irrigation was stopped, the soil humidity was measured daily using a TDR (Trase 6050X1, Soilmoisture Equipment Corp., Santa Barbara CA, USA) and seedlings were harvested at about 17% remaining soil humidity. For root hypoxia, the seedlings were placed into water tight containers and the water level was maintained 1 cm above soil level. All stresses were pooled and three biological replicates were extracted and mixed in equal amount for cDNA synthesis.
ii/libraries B, J, K, L, P, S: adult trees from South West of France (INRA Pierroton forestry station) were sampled in their natural area. Leaves were taken in summer, differentiating xylem in spring and vegetative buds both in winter and spring to provide different bud development stages from the same genotypes.
iii/libraries A, C, D, E: vegetative buds from one- or two-year old seedlings from North West of France were acclimated in the nursery of INRA Pierroton (South West of France).
iv/library M: roots from one-year old trees were grown under optimal conditions in a greenhouse at INRA Nancy.
v/libraries Q, R: second flush sun-leaves were harvested in greenhouse grown plants in summer 2004 on one-year old cuttings from high and low water-use efficient genotypes identified in Brendel et al. .
vi/libraries F, G, N, O, T: cuttings were obtained from 10 sessile and 10 pedunculate oak trees in August 2006 at INRA Pierroton. After three months of growth, cuttings were placed into oxygenated water tight containers for 2 weeks in a growth chamber providing a 16-h photoperiod, a day/night temperature of 25°C/20°C, a day/night relative humidity of 85%/70% and a quantum flux of 260 μmol m-2 s-1. Root hypoxia was then imposed by using deoxygenated water obtained by bulling with N2, in order to maintain the O2 concentration below 5 mg L-1. White roots from each genotype and species were sampled after 6, 24 and 48 hours of hypoxia. In all cases, white roots were immediately dipped in liquid nitrogen to prevent degradation and stored at -80°C until RNA extraction.
RNA extraction for Sanger sequencing
Oak material was collected from either field grown trees or seedlings raised in greenhouse or phytotrons. Plant material used as RNA source for cDNA and subtractive library construction was detailed above. For each library devoted to Sanger sequencing, total RNA was extracted following the procedure described by Le Provost et al.  with a final purification step using the RNAeasy kit (QIAGEN, Courtaboeuf, France).
cDNA library construction for Sanger sequencing
Depending on the quantity of each tissue, cDNA libraries were constructed using either CloneMiner cDNA Library Construction Kit (Invitrogen Corporation, Carlsbad, CA, USA) and Stratagene cDNA synthesis kit (Stratagene, La Jolla, CA, USA) when plant material was abundant, or Creator SMART cDNA Library Construction Kit (Clontech Laboratories Inc., Mountain View, CA, USA) when plant material was limiting.
For libraries constructed with CloneMiner kit, mRNA was isolated from 200 μg of total RNA of each genotype using Dynabeads mRNA Purification Kit (Invitrogen Corporation) according to the manufacturer's protocol. cDNA synthesis was performed using 2 μg of mRNA. Ligation to attB1 adapter, cDNA size-fractionation and recombination reaction between pDONR222 vector and cDNAs were performed as described in the User Manual. Finally, ElectroMax DH10B T1 Phage Resistant Cells (Invitrogen Corporation) were transformed with recombinant plasmids by electroporation. Each library titer was estimated on Kanamycin LB plates (50 μg mL-1). Libraries were stored at -80°C in glycerol Super Optimal broth with Catabolite repression (SOC) medium until sequencing.
Libraries constructed using the Stratagene kit were done according to the manufacturer's recommendations. The resulting cDNAs were packaged into λ ZAP II phages using the Gigapack III Gold packaging kit (Stratagene).
For libraries constructed with Creator kit, equal quantities of total RNA from each genotype were mixed and cDNA synthesis was then performed from 1 μg of total RNA by long distance PCR according to the manufacturer's instructions. Sfi1 digestion, cDNA size fractionation, plasmid ligation and transformation into ElectroMax DH10B T1 Phage resistant cells were done according to the manufacturer's protocol. An aliquot of each transformation was spread on Chloramphenicol LB plates (30 μg mL-1) to determine the percentage of recombinant clones. When more than 75% of recombinant clones were obtained, transformation mixtures were pooled together to constitute the library. If necessary, new ligations were performed to give a final library of approximately 106 clones: the titer was estimated on Chloramphenicol LB plates. Glycerol SOC medium was added and libraries were stored at -80°C.
For suppression subtractive hybridization libraries, we used the method, originally described by Diatchenko et al. [46
], that is based on selective amplification of differentially expressed sequences. All libraries were comprised of a unique tissue collected on several genotypes of Quercus
species. Total RNA was extracted using the method described above. Double-stranded tester cDNA and driver cDNA were prepared from 1 μg total RNA of each sample using the SMART™PCR cDNA Synthesis Kit (Clontech) according to the manufacturer's protocol. The forward subtracted libraries were constructed using the PCR-select cDNA Subtraction Kit (Clontech). Amplified, differentially expressed cDNA fragments were cloned into either the pGEM T easy vector (Promega, Madison, WI, USA) or the PCR4 TOPO kit from Invitrogen. ESTs were obtained from the following tissues:
Bud: three SSH libraries (C, D, E), obtained as detailed in Derory et al. 
Leaf: two SSH libraries (Q, R) obtained using total RNA extracted from leaves sampled on 5 genotypes displaying extreme phenotypes for water-use-efficiency (WUE). The libraries were obtained by subtracting RNA form High WUE phenotypes vs. Low WUE phenotypes, and vice-versa .
Root: four SSH libraries (N and O for Q. robur, F and G for Q. petraea) constructed by subtracting sessile against pedunculate mRNA and vice-versa for early (6 hours) and late (24 and 48 hours pooled) flooding stress.
Sanger DNA sequencing
Sequencing was completed using the standard Sanger method as described by Sanger et al. . Briefly, clones were randomly isolated and arranged individually in 384-well microtitre plates for storage and processing and subjected to high-throughput single-path sequencing from either their 5'- and/or 3'-ends. cDNA libraries were sequenced at "Centre National de Séquençage" (Genoscope, Evry, France). SSH libraries were sequenced at the "Genome & Transcriptome" facility of Bordeaux. Briefly recombinant clones were re-amplified using M13 universal primer. PCR products were then purified using Multiscreen PCR micro 96 kit (Millipore, San Francisco, CA, USA) and subjected to single pass sequencing from their 5'-and/or 3'-end using BigDye version 3.1 kit (Applied Biosystems, Foster city, CA, USA) according to the manufacturer's instructions. All the sequences were run on an ABI 3730 (Applied Biosystems, Foster city, CA, USA) sequencing machine.
Plant material for pyrosequencing
Plant material was collected on different genotypes (Table 4):
i/libraries I to IV: buds were taken from two populations (30 Q. petraea genotypes each) at endo- and eco-dormancy to provide clues about genes differentially expressed between these two developmental stages.
ii/libraries V and VI: leaves and buds were sampled from two species (Q. robur and Q. petraea, 10 genotypes each) for the discovery of genes involved in species divergence.
iii/libraries VII and VIII: pollen and flowers were collected on 2 Q. petraea and 2 Q. robur genotypes, to enrich the tissue panel with reproductive organs.
iv/libraries IX to XIV: buds and leaves were collected on 6 parental trees of three full-sub pedigrees to detect polymorphic markers for genetic linkage mapping.
RNA extraction for pyrosequencing
Total RNA was extracted as described by Le Provost et al. 
cDNA library construction for pyrosequencing
We used the SMART PCR cDNA synthesis kit (Clontech) and MINT cDNA synthesis Kit (Evrogen) according to manufacturer's instructions. Library normalization was done for libraries IX to XIV by Beckman Coulter Genomics (Grenoble, France) using Duplex-Specific-Nuclease (Evrogen) according to manufacturer's instructions.
cDNA nebulisation, adaptor ligation, emulsion PCR and sequencing were done at Beckman Coulter Genomics (Danvers, MA, USA). Sequencing was performed using a Roche-454 Genome Sequencer platform (FLX or Titanium technology).
We have summarized the approach that was followed in Figure 1 (blue boxes).
A total of 145,827 Sanger ESTs were cleaned using the S eq U ence R epository and F eature detection pipeline (SURF, http://surf.toulouse.inra.fr/cgi-bin/surf.cgi, user name: oak, password: oak1). The documentation of SURF can be found at http://genome.jouy.inra.fr/doc/bioinfo/edition/surf-1.0/SURF.pdf. SURF includes: i/base calling by phred [12, 13], ii/masking and clipping of library specific vectors/adaptors using cross_match  with -minmatch 10 -minscore 15, iii/masking low complexity regions (mononucleotide repeats) using RepeatMasker , iv/screening of PCR kit sequences using cross_match against short UniVec ftp://ftp.ncbi.nih.gov/pub/UniVec/ sequences with parameters -minmatch 8 -minscore 10, v/screening and elimination of possible contaminants in putative insert using cross_match on the UniVec with -minmatch 10 -minscore 25 and short UniVec with -minmatch 10 -minscore 15, as well as on E. coli and yeast sequences with -minmatch 100 -minscore 150, and vi/detection of chloroplast sequences using cross_match with -minmatch 100 -minscore 150. The Quercus robur chloroplast genome sequence was kindly provided by F Sebastiani and GG Vendramin from CNR (Florence, Italy). The detailed parameters used in SURF are documented on the SURF web site. Three other features, namely "doubtful", "pcrkitful" and "not valid" were added by SURF. If library specific vectors, adaptors and primers were detected inside an insert, SURF judged the sequence as "doubtful" (possible chimera), if SURF detected short UniVec sequences inside an insert, the sequence was labelled as "pcrkitful" and if SURF detected contaminants inside an insert, "not valid" status was attached to the sequence. Too short sequences (length < 100) with high quality (phred QV > 20) were discarded for further analysis.
Prior to submission of sequences to the EMBL database, reads were further processed by qualityTrimmer in Euler-SR package  with the option -minQual 20. As indicated in Table 2, a total of 125,886 ESTs were finally submitted to EMBL: 57,750 for Q. petraea and 68,136 for Q. robur. These sequences can be accessed by submitting the following query: Quercus [Organism name] and Frigerio [Authors] to EMBL. Quality scores can be downloaded from SURF web site.
Sanger reads were also processed by trace2dbEST (ver. 3.0.1) pipeline  using the default parameters except that the minimum sequence length, required for sequences to pass further analysis, was set at 60 bp. The phred [12, 13] parameter for error probability was set to 0.05 (default value). The cross_match  parameters for vector (including adaptor) and E. coli masking were set at -minmatch 10 -minscore 20 and -minmatch 20 and -minscore 30, respectively. These cross_match parameters corresponded to the default values for trace2dbEST. Screening libraries, which included any sequences of adaptors, adaptor and vector junctions and vectors, were constructed for each vector and adaptor combinations. Poly A/T sequences with repeats ≥12 were also screened and masked. Possible chimera sequences were suggested by trace2dbEST if vector and/or E. coli sequences were detected in the middle of an insert. They were then discarded from further analysis.
The 454-reads (1,948,579 sequences) were screened by cross_match  for primers and adaptors and then masked. For each 454-read, the longest non-masked region was extracted and further cleaned-up by SeqClean . The shorter regions were discarded in order to take care of potential chimeras. This process resulted into 1,578,192 clean reads. The sequencing statistics for 454-reads were calculated by NG6 system (http://vm-bioinfo.toulouse.inra.fr/ng6/, user name: oak, password: quercus33). SRA (Sequence Read Archive) accession number is SRA012448 and can be accessed at http://www.ncbi.nlm.nih.gov/sra. In NG6, four kinds of analysis were performed in parallel for each library: i/in the first analysis, contaminants were searched for E. coli, phage and yeasts with Blast, ii/in the second, the quality of each read was analyzed, iii/in the third, reads were analyzed by pyrocleaner, to remove too short or too long sequences (sequences with more than or less than two standard deviations from the mean), dirty sequences (sequences with more than 4% of N), low complexity sequences and duplicated reads [35, 36], and iv/in the fourth, reads were assembled by Newbler (Roche) within each library. The duplicated reads in step iii were defined as clusters by megablast with minimum hit score of 100, percent identity cut-off of 98, alignment of reads starting exactly at the same position and ending in a 70 bp window of the end of the longest sequence.
Coverage of transcripts within libraries
In order to estimate the coverage of transcripts within libraries, we first randomized links between a EST sequence and the corresponding contig. Secondly, for each library, 100 EST sequences were selected and the number of corresponding contigs was counted. The second step was repeated until no EST sequences were left for the library. The relationship between number of EST sequences (X) and the number of contigs (Y) were modelled by the following equation:
where A and B are coefficients determined by nls function of R language http://www.r-project.org/. The coefficient A indicates the expected maximum number of contigs (transcripts) in a library under this model. The coefficient B as well as A relates to the rate of the increase of the number of contigs. The library coverage was defined as the ratio of the observed number of contigs by the coefficient A. We excluded libraries C, D, E, G, N, Q and R for this analysis, because they did not contain enough ESTs.
Assembly of Sanger and 454-reads was first carried out using the SIGENAE system http://www.sigenae.org/ that is based on the TGICL software http://compbio.dfci.harvard.edu/tgi/software/. This software uses the CAP3 assembler  that takes into account the quality of sequenced nucleotides into the computation of the alignment score. The different steps of the assembly are highlighted in red in Figure 1. First, Sanger reads were assembled into 15,835 tentative consensus sequences (TC) using TGICL (mgblast, a modified version of megablast  and CAP3). The software FrameDP  was then used to predict complete ORF of 2,000 TC (the longest ones) with SWISS-PROT  as a reference, resulting in 224 TCs which potentially contain full-length coding sequences (starting with a methionine residue and ending with a stop codon). The TCs were used to split large clusters built by mgblast using the sclust program within TGICL. Global assembly was performed using TGICL with 125,925 Sanger ESTs, 224 TCs and the 1,578,192 454-ESTs with sequence qualities and options of (-l) minimum overlap length of 100, (-p) minimum percent identity for overlap of 96 and (-s) splitting clusters larger than 100,000. The resulting 222,671 contigs and singletons were called OakContigV1 and are available at http://genotoul-contigbrowser.toulouse.inra.fr:9092/Quercus_robur/index.html (user: oak, password: quercus33). Because all of the sequences resulted from assembly were directly loaded into the OakContigV1 database, low quality regions were not filtered out.
MIRA (V3rc4) software [18, 55] was also used to directly perform hybrid assembly of Sanger and 454-reads (Figure 1, purple boxes), instead of assembling the consensus of 454 data with Sanger reads as in TGICL. MIRA was run with a standard options (-job = denovo, est, normal, sanger, 454) and no XML files. The assembly by MIRA was compared with OakContigV1. Contigs showing Reciprocal best Blast Hit (RBH) was searched between sequences in OakContigV1 and MIRA assembly by using BlastN with "soft" filtering option. Soft filtering makes it efficient to detect orthologous sequences .
We also used the PartiGene (ver. 3.0.5) pipeline  to compare OakContigV1 assembly to the assembly constructed using Sanger reads only (Figure 1, green boxes). The sequences cleaned by trace2dbEST were grouped, using CLOBB , into clusters based on Blast similarity. Sequences in the same cluster were then assembled using Phrap  (using the default parameters of PartiGene). The resulting 40,944 unigene elements (23,730 singletons and 17,214 contigs) were used for peptide prediction using FrameDP . The resulting 36,883 peptide sequences were finally used to detect comparative orthologous markers (see section below). Assemblies within each Sanger library were also conducted by PartiGene to estimate redundancy rate of libraries. The assembly by PartiGene was compared with OakContigV1. Contigs showing RBH was searched between sequences in OakContigV1 and PartiGene assembly by using BlastN with "soft" filtering option .
Finally, MIRA (V3.0.0) software was used to map 454-reads to the unigene elements constructed by PartiGene with the following options (job = mapping, normal, 454) and no XML files for 454-reads.
Annotation and similarity searches
A functional annotation was assigned for each contig and singleton of Oakcontigv1 (Figure 1, gray boxes). The strategy is based on homology search with public protein and nucleic acid sequence databases. BlastX  was carried out against SWISS-PROT (Release 57.1 of 14-Apr-2009)  and RefSeq_protein (Release 34 of 6-March-2009)  with e-value cut-off at 1e-5. Conserved protein domains were searched against Pfam (Release 23.0 of 20-Jul-2008)  with e-value cut-off at 1e-5. Blastn was carried out against OakContigV1, Refseq_RNA (Release 34 of 6-March-2009) and TIGR gene indices with e-value cut-off of 1e-30, 1e-5 and 1e-2, respectively. We used the following TIGR gene indices : AGI (Arabidopsis_thaliana release_14), HAGI (Helianthus_annuus release_6), NTGI (Nicotiana_tabacum release_5), MTGI (Medicago_truncatula release_9), OGI (Oryza_sativa release_17), PPLGI (Populus release_4), SGI (Picea release_3) and VVGI (Vitis_vinifera release_6). These gene indices were selected so as to represent phylogenetic relationships of land plants as follows: gymnosperm (SGI), monocots (OGI), rosids (VVGI), rosid I (MTGI and PPLGI), rosid II (AGI), asterid I (NTGI) and asterid II (HAGI). The Gene Ontology (GO)  annotation was based on the best hit in SWISS-PROT. The GO terms were mapped upon plant GOslim terms using Blast2GO software .
Detection of unique peptides based on FrameDP peptide prediction
Because oak ESTs contain sequences from about 200 individuals of Q. robur and Q. petraea, we expected to detect not only polymorphisms within species but also substitutions between species in the combined assembly. In addition, ESTs were collected from multi-stressed libraries (Table 1), which are likely to increase the number of splice variants. These factors may split EST clusters into multiple contigs. In order to estimate the minimum unigene sets, FrameDP  was used, translating assembled sequences into peptide sequences (Figure 1, orange boxes). The TAIR9_pep sequences (available at ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/) were used as reference sequences in FrameDP. The resulting peptide sequences from OakContigV1 were further clustered by BLASTClust, a part of the BLAST package  to provide a set of unique peptide that was further used to retain in silico SNPs within coding sequences.
Identification of candidate genes
Candidate genes for bud phenology related genes (list kindly provided by M. Lascoux & G. Zaina) were searched against peptide sequences estimated by FrameDP. The predicted peptide sequences were used in BlastP with e-value cut-off set at 1e-5. Drought stress resistance candidate genes with emphasis on cuticle formation in Arabidopsis thaliana were searched within OakContigV1 by BlastP or BlastX with e-value cut-off of 1e-10. Genes of the phenylpropanoid pathway within OakContigv1 were also targeted. We used GO annotation (see previous section) to convert GO terms into enzyme code (EC) using Blast2GO software . Those EC corresponding to the phenylpropanoid pathway were mapped on the corresponding KEGG map . In order to detect genes related to cell wall formation in OakContigV1, tBlastX was performed against the MAIZEWALL database  with e-value cut-off of 1e-10.
In silico mining of SSRs and SNPs
To detect simple sequence repeats (SSRs), we used mreps  and listed microsatellites with minimum repeat number of five, four, three, three and three for di-, tri-, tetra-, penta- and hexa-SSRs, respectively. SSRs were also screened within eight gene indices used for the annotation (see the section "Annotation and similarity searches"). To visualize phylogenetic similarity of the SSR motif distribution, data were analyzed by self organizing map (SOM) using som_pack (ver. 3.1) , which utilizes unsupervised pattern recognition algorithms.
For SNP detection, we first took into account the presence of duplicated reads in order to avoid false SNP detection [35, 36], i.e. a single representative was kept for the analysis. Then, putative SNPs were screened for contigs with a coverage depth of more than six sequences. If the less frequent allele count was more than two and 100% identical for four bases before and after the polymorphic site, we considered this site as a putative SNP. SNPs were summarized according to the following categories: transition/transversion, synonymous/non-synonymous and coding/non-coding. The frame-corrected nucleotide sequences inferred by FrameDP  as coding regions were used for the identification of "coding SNPs". By fasty35 program in FASTA package , we re-mapped the frame-corrected nucleotide sequence onto corresponding OakContigV1 original sequence. The peptide sequences inferred by FrameDP were used as references to identify non-synonymous mutations at the SNP site. Using tfasty35, non-synonymous mutations were identified by alignment of the allelic peptide sequence with the reference peptide sequence. Finally, a more stringent set of SNPs was retained, considering contigs for which a single protein was predicted by FrameDP. In addition, SNPs whose unigene elements represented significant blast hits with organelle sequences (Q. robur chloroplast kindly provided by F Sebastiani and GG Vendramin from CNR, Florence, Italy and Vitis vinifera mitochondria ) were detected. The percent identity and coverage threshold was set at 90% and 80%, respectively, for chloroplast and 60% and 70%, respectively, for mitochondria. These SNPs useful for population genetic studies were discarded from the nuclear SNP data set. Both nuclear and organelle SNPs data sets have been made available at Quercus portal https://w3.pierroton.inra.fr:8443/QuercusPortal/Home.jsf. SNPs whose unigene elements presented Sanger reads with significant hit with Q. robur chloroplast sequence in the SURF process were also eliminated.
Gene diversity (
), where n is the number of reads included for the calculation and P
is allele frequency, was estimated for Quercus. At SNP sites in each assembly, we randomly selected one read from each library to avoid multiple sampling of the same allele in the same individual. We only targeted SNP sites with number of reads ≥8 for each species.
Synteny and duplication analysis
We used the two parameters recently defined by Salse et al. [41–43] to increase the stringency and significance of Blast sequence alignment by parsing Blast results and rebuilding HSPs (High Scoring Pairs) or pairwise sequence alignments to identify accurate paralogous and orthologous relationships.
Distribution of KS distances (MYA scale) for paralogous and ortholougous gene pairs
We performed the sequence divergence as well as speciation event datation analysis based on the rate of nonsynonymous (Ka) vs. synonymous (Ks) substitutions calculated with PAML (Phylogenetic Analysis by Maximum Likelihood) . The average substitution rate (r) of 6.5 × 10-9 substitutions per synonymous site per year for grasses is classically used to calibrate the ages of the considered gene [64, 65]. The time (T) since gene insertion is then classically estimated using the formula T = Ks/r.