Genomic DNA and library preparation
We extracted DNA from the type strain of S. arboricolus H-6T (CBS 10644TT) isolated in China from the bark of Quercus fabri. For the Roche 454 library construction and sequencing, 5 μg of high molecular weight genomic DNA was used to make standard shotgun DNA library as described in the Roche GS FLX Titanium General Library Preparation Method Manual with the exception of DNA fragmentation, which was done with Covaris S2 sonicator (fragmentation parameters: Duty cycle - 5%, Intensity - 1, cycles/burst - 200, time - 85 s, bath temperature - 5°C). 15 μg of high molecular weight genomic DNA was used to make the 8 Kb paired end library as stated in the Roche GS FLX Titanium 8 Kb Span Paired end library preparation method manual. Exceptions include: DNA extraction was done using QIAquick Gel extraction kit (Qiagen, Cat no. 28760) instead of Electroelution as stated in the manual and fragmentation of circularised DNA was done using Covaris S2 sonicator (Duty cycle −5%, Intensity - 3, cycles/burst - 200, time - 120 s, bath temperature - 5°C). Sequencing of standard shotgun fragment library was carried out on ¾ of a PTP and the 8 Kb paired end library was sequenced on a full PTP using Roche 454 Titanium sequencing chemistry. For SOLiD library construction and sequencing, 500 ng of high molecular weight genomic DNA was used to make a barcoded DNA fragment library as stated in the SOLiD 4 library preparation guide. Enzymes and reagents were used from NEBNext DNA sample prep Master mix set 3 (NEB, Cat no. E6060L). The barcoded DNA fragment library was quantified using Kapa Library Quantification kit (Kapa Biosystems, Cat no. KK4823). 200-300 bp library size selection was carried out using 2% SizeSelect E-Gel (Life Technologies, Cat no. G6610-02). SOLiD EZ Bead System was used according to manufacturer’s guide to prepare ePCR and templated bead enrichment. Sequencing was performed on a SOLiD 4 analyser according to the manufacturer’s instructions to generate 50 bp reads in colour space.
We assembled the genome of S. arboricolus using the Newbler algorithm (v2.3, Roche) for de novo assembly of reads generated by the 454 pryosequencing platform. Combinations of read datasets, reads added in assembly iterations, and assembler parameters were explored before selecting the optimal combination according to assembly metrics (number of scaffold sequences and contigs, the average and longest contig length and N50 value). All reads were trimmed against a dataset of adapter and vector sequences in the initial step of the assembly process.
The selected assembly parameters used an expected coverage value of 40X with all other settings remaining at default values. Two assembler iterations were employed; the first iteration included all 734,353 single fragment reads and one set of 583,674 paired-end reads. The second iteration incorporated an additional set of 518,434 paired-end reads. A third set of paired-end reads was excluded from the assembly due to decreased performance with their inclusion.
The resulting genome assembly comprised of 32 scaffold sequences with a total length of 11,465,281 bp. The scaffolds were comprised of 266 contigs (≥500 bp) with an average length of 43,102 bp (538,482 bp max.) and an N50 value of 136,945 bp. The mapped read coverage of the assembly was 49X.
Pyrosequencing error correction
In order to resolve small errors in the assembly arising from pyrosequencing artifacts, such as homopolymer sequence regions
[43, 44], we acquired deep sequence coverage (~100X) from short reads. We generated a total of 31,316,59 short (50 bp) reads from a SOLiD 4 single fragment library. Subsequent gapped read alignment and variant calling was achieved using Bioscope 1.3.1 (Life Technologies).
An iterative correction process was devised in which errors in the assembled sequence were identified from the SOLiD read alignment data as either a single nucleotide polymorphisms (SNP) for single base errors, or as small InDels (insertion/deletion) for homopolymer pyrosequencing errors. Each iteration of the assembly correction process involved the initial mapping of SOLiD reads against the 454 assembly, followed by SNP calling. Selected putative SNPs were then integrated into the assembly sequence and SOLiD reads were remapped to allow InDels to be called and integrated. This process was repeated until no additional variants were detected. In subsequent iterations additional reads were mapped allowing the identification and correction of a small number of further errors. Both SNPs and InDels were calculated from alignment data using Bioscope 'high stringency' variant parameter settings. Additionally, integrated variants were required to represent a minimum of 60% of the alignment data.
Gene annotation and orthology assignments
S. cerevisiae was used as the reference proteome for the program exonerate, which uses comparative approaches for gene finding based on protein sequence similarity. An initial pass with the protein2dna model and a refine boundary of 2000 was used to find the best orthologous candidate of each S. cerevisiae gene. For intronic genes, the max intron size was limited to 1500 bp and the model used was protein2genome.
To annotate genes within the S. arboricolus genome, we first identified gene orthologs with conserved synteny. To do so, we analysed the top hit by exonerate for each S. cerevisiae gene. When three neighbouring genes within S. cerevisiae all identified three neighbouring genes within S. arboricolus, we assigned the S. arboricolus gene in the middle (flanked by its two neighbours) as a syntenic ortholog. This initial step discovered most of the syntenic genes within S. arboricolus. Other genes within S. cerevisiae that had not been assigned an ortholog were further analysed with the hypothesis that these may have exonerate hits within the expected positions but were not the most similar sequence within S. arboricolus. We looked at the top 10 exonerate hits of the remaining S. cerevisiae genes for matches in S. arboricolus between the initially assigned syntenic ortholog. When only one hit was found between these syntenic orthologs, we used this hit as a newly discovered syntenic ortholog. This process was repeated until no more syntenic orthologs could be found. Finally, we assigned the top exonerate hit of few remaining S. cerevisiae genes that were still not assigned a syntenic ortholog as the non-syntenic ortholog provided that they did not overlap with another gene prediction.
De novo gene prediction on the S. arboricolus genome was performed using GeneMark-ES, version 2
. The total number of the predicted genes was 5005 within the 16 assembled chromosomes (5038 in total). Of these, 95 genes had non-overlapping coordinates with the genes predicted by Exonerate within the 16 assembled chromosomes (106 in total when including the 19 scaffolds that did not assemble into the chromosomes).
A significant issue when using a comparative-based method, such as exonerate, for gene prediction is that gene boundaries are often incorrectly predicted if there is a lack of homology at these ends. Initially, a large number of predicted genes did not contain a start or stop codon (637 genes and 1121 genes respectively). We have attempted to rectify these starts and ends by extending or truncating the predicted CDS. First, CDSs were extended if a stop codon could be found within 9 codons from the end of our gene prediction. This corrected 857 cases of missing stop codons and further extension only slightly improved the annotation. Second, for start codons, the methionine can be on either side of the predicted gene start. We therefore extended the predicted gene until a methionine was found, but only when a methionine could be found within 9 codons and without any intervening stop codons. In the cases where a stop codon occurred before a suitable methionine was identified, we truncated the CDS to a downstream methionine if it occurred within 9 codons. This corrected 348 cases of missing start codons. Finally, intron-containing genes were left untouched as missing starts and ends for these genes could be due to a missing exon. We note that for intron-containing genes, we specifically use the protein2genome model in exonerate that explicitly attempts to predict all exons found in S. cerevisiae genes. This assumes that the presence of introns is conserved between S. cerevisiae and S. arboricolus.
We aligned the protein sequence orthologs for the sensu stricto using MAFFT
 with default settings, either with or without S. castellii orthologs as an outgroup. For the coding sequence analysis we inserted the gaps back into the DNA sequences. Phylogenetic analysis was performed using PAML
, either with the codon model for the DNA sequence analysis or with empirical model for the amino acid analysis. Because we are only concerned with the placement of S. arboricolus within the established sensu stricto yeast phylogeny, we compared the likelihood of several putative tree topologies that differ only in the position of S. arboricolus (Figure S2).
To annotate tRNA coding sequences, we predicted tRNAs using tRNAscan-SE
 with default settings on the 16 assembled chromosomes. To determine whether or not these predicted tRNAs are syntenic with respect to S. cerevisiae we used an analogous strategy to that described above for gene annotations. tRNA coding sequences were annotated as syntenic orthologs if they were flanked by genes within S. arboricolus that were assigned as syntenic orthologs and if a tRNA was also found in S. cerevisiae between those genes. In all but one cases, the syntenic tRNAs code for the same amino acids.
Chromosomal structure plots
Chromosome structure plots for the Saccharomyces sensu stricto species were constructed using Mauve
. Assembled chromosomes for S. paradoxus (strain CBS432) were obtained from
 and for S. mikatae (IFO 1815 T), S. kudriavzevii (IFO 1802 T) and S. bayanus var. uvarum (strain CBS 7001) from
. As these chromosome assemblies have been constructed partly by using the S. cerevisiae genome to orient and order scaffolds, alignments were also made to the unordered scaffolds using MUMmer
 to confirm the relative orientation of chromosomal segments inverted between species.
Mapping of the phenotype landscape of S. arboricolus
The bulk of the phenotypic data was taken from our recent publication
 on sensu stricto phenotypes where it was included for completeness but where S. arboricolus phenotypes were not specifically analyzed or considered. The data displayed as growth curves in this study correspond to novel confirmatory runs performed to ensure the reliability of specific statements. Three diploid isolates of Saccharomyces arboricolus were collected as described previously
 and long time stored in 20% glycerol at -80C. Isolates were subjected to high throughput phenotyping by micro-cultivation (n=2) in an array of environments (Additional file
5: Table S2) essentially as previously described
. For pre-cultivations, strains were inoculated in 350 μL of SD medium (0.14% yeast nitrogen base, 0.5% ammonium sulfate and 1% succinic acid; 2% (w/v) glucose; 0.077% Complete Supplement Mixture (CSM, ForMedium), pH set to 5.8 with NaOH or KOH) and incubated for 48 h at 30°C. For experiments where the removal of a specific media component was studied, the pre-culture was performed in absence of this component in order to completely deplete the component in question. For experiments where alternative nitrogen sources were used, two consecutive pre-cultures were performed, the first containing low amounts of ammonium sulphate (0.05%), the second replacing ammonium with the indicated nitrogen source in amounts corresponding to equivalent moles of N. For all experimental runs, strains were inoculated to an OD of 0.03 - 0.1 in 350 μL of SD medium and cultivated for 72 h in a Bioscreen analyser C (Growth curves Oy, Finland). Optical density was measured using a wide band (450-580 nm) filter. Incubation was at 30.0°C (±0.1°C) with ten minutes preheating time. Plates were subjected to shaking at highest shaking intensity with 60s of shaking every other minute. OD measurements were taken every 20 min. Strains were run in duplicates on separate plates with ten replicates of the universal S. cerevisiae reference strain BY4741 or its prototrophic mother S288C, in randomised (once) positions on each plate as a reference. The reproductive rate (population doubling time), lag (population adaptation time) and efficiency (population total change in density) were extracted from high density growth curves and put in relation to the corresponding fitness variables of the reference strain BY4741, or in conditions directly involving alterations of nitrogen content, its prototrophic mother S288C, as described previously
. The derived Log2 ratios (Log2 (BY4741/isolate) or, in case of efficiency, Log2 (isolate/BY4741) were used for subsequent analysis.
Raw sequencing reads are available from the European Nucleotide Archive (EBI ENA) for the SOLiD reads [EMBL: ERP001702], Roche 454 single fragment reads [EMBL: ERP001703] and Roche 454 paired-end reads [EMBL: ERP001704]. The assembled genome is available from NCBI as Saccharomyces arboricola [GenBank: ALIE00000000].