Chromosome level assembly and comparative genome analysis confirm lager-brewing yeasts originated from a single hybridization

Background The lager brewing yeast, S. pastorianus, is a hybrid between S. cerevisiae and S. eubayanus with extensive chromosome aneuploidy. S. pastorianus is subdivided into Group 1 and Group 2 strains, where Group 2 strains have higher copy number and a larger degree of heterozygosity for S. cerevisiae chromosomes. As a result, Group 2 strains were hypothesized to have emerged from a hybridization event distinct from Group 1 strains. Current genome assemblies of S. pastorianus strains are incomplete and highly fragmented, limiting our ability to investigate their evolutionary history. Results To fill this gap, we generated a chromosome-level genome assembly of the S. pastorianus strain CBS 1483 from Oxford Nanopore MinION DNA sequencing data and analysed the newly assembled subtelomeric regions and chromosome heterozygosity. To analyse the evolutionary history of S. pastorianus strains, we developed Alpaca: a method to compute sequence similarity between genomes without assuming linear evolution. Alpaca revealed high similarities between the S. cerevisiae subgenomes of Group 1 and 2 strains, and marked differences from sequenced S. cerevisiae strains. Conclusions Our findings suggest that Group 1 and Group 2 strains originated from a single hybridization involving a heterozygous S. cerevisiae strain, followed by different evolutionary trajectories. The clear differences between both groups may originate from a severe population bottleneck caused by the isolation of the first pure cultures. Alpaca provides a computationally inexpensive method to analyse evolutionary relationships while considering non-linear evolution such as horizontal gene transfer and sexual reproduction, providing a complementary viewpoint beyond traditional phylogenetic approaches.


Background
The lager-brewing yeast Saccharomyces pastorianus is an interspecies hybrid between S. cerevisiae and S. eubayanus. Lager brewing emerged in the late middle ages and was carried out during winter months at temperatures between 8 and 15°C, followed by a prolonged maturation period referred to as lagering [1,2]. While S. cerevisiae is a well-studied species frequently used in biotechnological processes [3], S. eubayanus was only discovered in 2011 and has thus far only been isolated from the wild [4]. Therefore, the ancestral S. pastorianus hybrid likely emerged from a spontaneous hybridization between an ale brewing S. cerevisiae yeast and a wild S. eubayanus contaminant, and took over lager brewing due to increased fitness under these conditions [4][5][6]. Indeed, laboratorymade S. cerevisiae x S. eubayanus hybrids demonstrated hybrid vigour by combining the fermentative capacity and sugar utilisation of S. cerevisiae and the ability to grow at lower temperatures of S. eubayanus [7,8].
The genomes of S. pastorianus strains are highly aneuploid, containing 0 to 5 copies of each chromosome [5,[9][10][11][12][13]. Between 45 and 79 individual chromosomes were found in individual S. pastorianus genomes, compared to a normal complement of 32 chromosomes in euploid Saccharomyces hybrids. The degree of aneuploidy of S. pastorianus is exceptional in the Saccharomyces genera, and likely evolved during its domestication in the brewing environment [9]. Nevertheless, two groups can be distinguished based on their genome organisation: Group 1 strains, which have approximately haploid S. cerevisiae and diploid S. eubayanus chromosome complements; and Group 2 strains, which have approximately diploid to tetraploid S. cerevisiae and diploid S. eubayanus chromosome complements [5,10,11,14].
Group 1 and Group 2 strains in S. pastorianus were initially thought to have originated from two different hybridization events. Some lager-specific genes from Group 2 strains are absent in Group 1 strains, and the subtelomeric regions of Group 1 and Group 2 strains differ substantially [15,16]. Based on these differences, Group 1 and Group 2 strains were hypothesized to have emerged from different independent hybridization events, involving a haploid S. cerevisiae for Group 1 strains and a higher ploidy S. cerevisiae strain for Group 2 strains [5,17]. Indeed, crosses between S. cerevisiae and S. eubayanus strains with varying ploidies could be made in the laboratory, all of which performed well in the lager brewing process [18]. Comparative genome analysis between Group 1 and Group 2 strains revealed that there were more synonymous nucleotide differences in the S. cerevisiae subgenome than in the S. eubayanus subgenome [19]. As accumulation of synonymous mutations was presumed to equally affect both genomes, the authors hypothesized that Group 1 and 2 strains originated from two hybridizations, with a similar S. eubayanus parent and different S. cerevisiae parents.
More recent studies now support that Group 1 and Group 2 strains originated from the same hybridization event. Identical recombinations between the S. cerevisiae and S. eubayanus subgenomes were found at the ZUO1, MAT, HSP82 and XRN1/KEM1 loci in all analysed S. pastorianus strains [11,13,14], which did not emerge when such hybrids were evolved under laboratory conditions [20]. These conserved recombinations indicate that all S. pastorianus strains share a common S. cerevisiae x S. eubayanus hybrid ancestor, and that the differences between Group 1 and Group 2 strains emerged subsequently. Sequence analysis of ten S. pastorianus genomes revealed that the S. cerevisiae sub-genome in Group 1 strains is relatively homozygous, while Group 2 strains possess heterozygous sub-regions [11]. Moreover, heterozygous nucleotide stretches in Group 2 strains were composed of sequences highly similar to Group 1 genomes and of sequences from a different S. cerevisiae genome with a 0.5% lower sequence identity. As a result, the authors formulated two hypotheses to explain the emergence of Group 1 and Group 2 strains from a shared ancestral hybrid: (i) the ancestral hybrid had a heterozygous S. cerevisiae subgenome, and Group 1 strains underwent a massive reduction of the S. cerevisiae genome content while Group 2 did not, or (ii) the ancestral hybrid had a homozygous Group 1-like genome and Group 2 strains were formed by a subsequent hybridization event of such a Group 1-like strain with another S. cerevisiae strain, resulting in a mixed S. cerevisiae genome content in Group 2 strains.
Since the exact S. cerevisiae and S. eubayanus ancestors of S. pastorianus are not available, the evolutionary history of S. pastorianus has so far been based on the sequence analysis using available S. cerevisiae and S. eubayanus reference genomes [5,11]. However, these reference genomes are not necessarily representative of the original parental genomes of S. pastorianus. Although S. pastorianus genomes are available, they were sequenced with short-read sequencing technology [10][11][12][13] preventing assembly of large repetitive stretches of several thousand base pairs, such as TY-elements or paralogous genes often found in Saccharomyces genomes [21]. The resulting S. pastorianus genomes assemblies are thus incomplete and fragmented into several hundred or thousand contigs [10][11][12][13].
Single-molecule sequencing technologies can output reads of several thousand base pairs and span entire repetitive regions, enabling near complete chromosomelevel genome assemblies of Saccharomyces yeasts [22][23][24][25][26][27]. In addition to the lesser fragmentation, the assembly of regions containing repetitive sequences reveals large numbers of previously unassembled open reading frames, particularly in the sub-telomeric regions of chromosomes [24,25,27]. Sub-telomeric regions are relatively unstable [28], and therefore contain much of the genetic diversity between different strains [29,30]. In S. pastorianus, notable differences were found between the sub-telomeric regions of Group 1 and Group 2 strains [15,16], which could be used to understand their origin. Moreover, repetitive regions are enriched for genes with functions determining the cell's interaction with its environment, such as nutrient uptake, sugar utilization, inhibitor tolerance and flocculation [31][32][33][34]. As a result, the completeness of sub-telomeric regions is critical for understanding genetic variation and evolutionary relationships between strains, as well as for understanding their performance in industrial applications [24,29,30].
Here, we used Oxford Nanopore MinION sequencing to obtain a chromosome-level assembly of the Group 2 S. pastorianus strain CBS 1483 and analysed the importance of new-found sequences relative to previous genome assemblies, with particular focus on industrially-relevant subtelomeric gene families. As the CBS 1483 genome contains multiple non-identical copies for many chromosomes, we analysed structural and sequence-level heterozygosity using short-and long-read data. Moreover, we developed a method to investigate the evolutionary origin of S. pastorianus by evaluating the genome similarity of several Group 1 and Group 2 S. pastorianus strains relative to a large dataset of S. cerevisiae and S. eubayanus genomes, including an isolate of the Heineken A-yeast® lineage which was isolated by dr. Elion in 1886 and is still used in beer production today.

Results
Near-complete haploid assembly of CBS 1483 We obtained 3.3 Gbp of whole genome sequencing data of the Saccharomyces pastorianus strain CBS 1483 using 4 flow cells on Oxford Nanopore Technology's MinION platform. Based on a genome size of 46 Mbp accounting for all chromosome copy numbers, the combined coverage was 72x with an average read length of 7 Kbp (Additional file 2: Figure S1). We assembled the reads using Canu [35] and performed manual curation involving circularization of the mitochondrial DNA, scaffolding of ScXII (chromosome XII of the S. cerevisiae sub-genome) and resolution of assembly problems due to inter-and intra-chromosomal structural heterozygosity in ScI and ScXIV (Fig. 1). Assembly errors were corrected with Pilon [36] using paired-end Illumina reads with 159x coverage. We obtained a final assembly of 29 chromosome contigs, 2 chromosome scaffolds, and the complete mitochondrial contig leading to a total size of 23.0 Mbp ( Fig. 2 and Table 1). The assembly was remarkably complete: of the 31 chromosomes (in CBS 1483 ScIII and SeIII recombined into a chimeric SeIII-ScIII chromosome [10], 29 were in single contigs; 21 of the chromosomes Fig. 1 Structural heterozygosity within multiple copies of the S. cerevisiae chromosome I of CBS 1483. a Layout of S. cerevisiae chromosome I in the assembly graph. Paths 1 and 2 (blue text and arrows) represent alternative contigs in the right-end of the chromosome-the gene UIP3 is deleted in path 2. b Sequencing coverage of long-read alignments of CBS 1483 in the right-end of chromosome I after joining path 1 and discarding path 2. The location of the UIP3 gene is indicated. c Alignment overview of five raw long-reads supporting the introgression of a~14 Kbp in chromosome I (salmon colour) to a region at the right-end of chromosome XIV (brown colour) in the S. cerevisiae sub-genome. The additional alignments (pink and orange) are alignments to computationally confirmed Ty-2 repetitive elements. d Schematic representation of the two chromosome architectures of S. cerevisiae chromosome XIV (brown colour) due to translocation of an additional copy of the right arm of chromosome I (salmon colour) to the left arm of chromosome XIV Fig. 2 Overview of the long-read-only de novo genome assembly of the S. pastorianus strain, CBS 1483. For each chromosome, all copies are represented as coloured rectangles. Genomic material originating from S. cerevisiae (blue) and from S. eubayanus (red) are shown, and the position of the centromere is indicated by the constricted position within each rectangle. Heterozygous SNP calls are represented as vertical, black lines and are drawn with transparency to depict the density of SNP calls in a given region. Underlying chromosome copy number data and the list of heterozygous SNPs is available in Additional file 3: Figure S2 and Additional file 1 F contained both telomere caps; 8 contained one of the caps; and 2 were missing both caps. Some chromosomes contain sequence from both parental sub-genomes due to recombinations; those chromosomes were named SeIII-ScIII, SeVII-ScVII, ScX-SeX, SeX-ScX and SeXIII-ScXIII, in accordance with previous nomenclature [10]. Annotation of the assembly resulted in the identification of 10,632 genes (Additional file 1a). We determined chromosome copy number based on coverage analysis of short-read alignments to the genome assembly of CBS 1483 ( Fig. 2 and Additional file 3: Figure S2).

Comparison between Oxford nanopore minION and illumina assemblies
In order to compare our novel long-read assembly of CBS 1483 to the previous assembly generated using short-read data, we aligned contigs of CBS 1483 from van den Broek et al. [10] to our current long-read assembly, revealing a total 1.06 Mbp of added sequence. The added sequence overlapped with 323 ORFs (Additional file 1b). Conversely, aligning the long-read assembly to the van den Broek et al. 2017 assembly revealed that only 14.9 Kbp of sequence were lost, affecting 15 ORFs (Additional file 1c). Gene ontology analysis of the added genes showed enrichment of several biological processes, functions, and components such as flocculation (P-value = 7.44 × 10 − 3 ) as well as transporter activity for several sugars including mannose, fructose and glucose (P-value ≤1.5 × 10 − 5 ) (Additional file 1d). Among the added genes were various members of subtelomeric gene families such as the FLO, SUC, MAL, HXT and IMA genes (Additional file 1e). Due to their role in the brewing-relevant traits such as carbohydrate utilization and flocculation, the complete assembly of subtelomeric gene families is crucial to capture different gene versions and copy number effects.
The assembly of CBS 1483 contained 9 MAL transporters, which encode for the ability to import maltose and maltotriose [37][38][39], constituting 85% of fermentable sugar in brewer's wort [40]. The S. cerevisiae subgenome harboured ScMAL31 on ScII, ScMAL11 on ScVII and on SeVII-ScVII, and ScMAL41 on ScXI (Additional file 1b and e). However, the ScMAL11 gene, also referred to as AGT1, was truncated, and there was no ScMAL21 gene due to the complete absence of ScIII, as reported previously [10,12]. In the S. eubayanus subgenome, MAL31-type transporter genes were found in SeII, SeV, and SeXIII-ScXIII, corresponding to the location of the S. eubayanus transporter genes SeMALT1, SeMALT2 and SeMALT3, respectively [25]. In addition, a MAL11-like transporter was found on SeXV. Consistently with previous reports, no MTY1-like maltotriose transporter was found in CBS 1483 [10]. Due to the absence of MTY1 and the truncation of ScMAL11, maltotriose utilisation is likely to rely on the SeMAL11 transporter in CBS 1483. Indeed, a MAL11-like transporter was recently shown to confer maltotriose utilisation in an S. eubayanus isolate from North Carolina [41].
The assembly also contained 14 FLO genes encoding flocculins which cause cell mass sedimentation upon completion of sugar consumption [34,42,43]. The heavy flocculation of S. pastorianus cells simplifies biomass separation at the end of the brewing process, and resulted in their designation as bottom-fermenting yeast [44]. Flocculation is mediated by flocculins: lectin-like cell wall proteins which effect cell-to-cell adhesion. In CBS 1483, we identified 12 flocculin genes, in addition to two FLO8 transcriptional activators of flocculins (Additional file 1e). Flocculation intensity has been correlated to the length of flocculin genes [45][46][47]. Specifically, increased length and number of tandem repeats within the FLO genes caused increased flocculation [47,48]. We therefore analysed tandem repeats in S. cerevisiae, S. eubayanus and S. pastorianus genomes and found that most FLO genes contain a distinct repeat pattern: two distinct, adjacent sequences each with variable copy number ( were unable to find clear repeat patterns for FLO10 and FLO11 in this genome. In S. pastorianus CBS 1483, the repeat lengths of FLO genes corresponded to the subgenome they were localized in (Table 2). Compared to the non-flocculent S288C and CBS 12357 strains, FLO genes were systematically shorter in CBS 1483, contrasting with available theory [42][43][44][45][46][47][48][49][50]. The intense flocculation phenotype of S. pastorianus was previously attributed to Table 2 Tandem repeat analysis in FLO genes. We found seven repeat sequences when analysing flocculation genes FLO1, FLO5, FLO9, FLO10, and FLO11 in S. cerevisiae (S288C) and S. eubayanus (CBS 12357) genomes. These sequences are referred to as sequence A (135 nt), B (15 nt), C (189 nt), D (45 nt), E (132 nt), F (156 nt), and G (30 nt). We used these sequences to analyse the copy numbers of each repeat within all FLO genes in our long-read-only assembly of CBS 1483 using the long-read-only S288C assembly as a control. Their respective copy numbers are shown below. Repeat sequences are indicated in Additional file 1 H a gene referred to as LgFLO1 [49,51,52]. However, alignment of previously published partial and complete LgFLO1 sequences did not confirm the presence of a similar ORF in CBS 1483. Moreover, the annotated FLO genes had higher identity with S. eubayanus and S. cerevisiae FLO genes, than with LgFLO1. Therefore, flocculation is likely to rely on one or several of the identified FLO genes from S. cerevisiae or S. eubayanus subgenomes ( Table 2).

Sequence heterogeneity in CBS 1483
As other Group 2 S. pastorianus strains, CBS 1483 displays heterozygosity between different copies of its S. cerevisiae subgenome [11]. We therefore systematically identified heterozygous nucleotides in its genome and investigated the ORFs with allelic variation. Using 159x coverage of pairedend Illumina library of CBS 1483, we found a total of 6367 heterozygous SNPs across the genome (Additional file 1f).
Although the heterozygous SNPs are present across the whole genome, they affect primarily the S. cerevisiae subgenome, with the majority clustered around centromeres (Fig. 2). Of these positions, 58% were located within ORFs, resulting in 896 ORFs with allelic variation consisting of 1 to 30 heterozygous nucleotides. A total of 685 ORFs showed heterozygosity which would result in amino acid sequence changes, including 16 premature stop codons, 4 lost stop codons and 1566 amino acid substitutions (Additional file 1f). Gene ontology analysis of the ORFs affected by heterozygous calls revealed no significant enrichment in processes, functions of compartments. However, it should be noted that several industrially-relevant genes encoded more than one protein version, such as: the BDH1 and BDH2 genes, encoding butane-diol dehydrogenases involved in reduction of the off flavour compound diacetyl [53], the FLO5 and FLO9 genes encoding flocculins [50], and the OAF1 gene encoding a regulator of ethyl-ester production pathway [54].

Structural heterogeneity in CBS 1483 chromosomes
We investigated whether information about structural heterogeneity between chromosome copies could be recovered despite the fact that current assembly algorithms reduce genome assemblies to consensus sequences. Information about structural and sequence variation between different chromosome haplotypes is not captured by consensus assemblies. However, raw read data contains information for each chromosome copy. To identify structural heterogeneity, we identified ORFs whose predicted copy number deviated from that of the surrounding region in the chromosome based on read coverage analysis (Additional file 4: Figure S3). We found 213 ORFs with deviating copy number (Additional file 1 G). While no enrichment was found by gene ontology analysis, many of these ORFs are located in subtelomeric regions [29].
Nevertheless, a few regions contained adjacent ORFs with deviating copy number, indicating larger structural variation between chromosome copies. For example, 21 consecutive ORFs in the right-end of the ScXV appear to have been deleted in 2 of the 3 chromosome copies (Additional file 4: Figure S3). UIP3, one of the genes with deviating copy number, was located on the right arm of chromosome ScI. This region was previously identified as having an additional copy in CBS 1483, although it could not be localized based on short read data [10]. The assembly graph showed two possible structures for ScI, which were collapsed into a single contig in the final assembly (Fig. 1a). Sequence alignment, gene annotations and sequencing coverage indicated two versions of the ScI contigs: one with and one without the gene UIP3 (Fig. 1b). Sequence alignments of raw-long-reads revealed five reads (from 20.6 to 36.7 Kbp) linking the right arm of ScI to the left arm of ScXIV at position~561 Kbp (Fig. 1c). This location corresponded to a Ty-2 repetitive element; known to mediate recombination within Saccharomyces genomes [21]. In addition to the increased coverage of the right arm of ScI, the left arm of ScXIV showed decreased sequencing coverage up until the~561 Kbp position. Together, these results suggest that the left arm of one copy of ScXIV was replaced with an additional copy of the right arm of ScI (Fig. 1d). As no reads covered both the recombination locus and the UIP3 locus, it remained unclear if UIP3 is present in the ScI copy translocated to chromosome ScXIV. The resolution of two alternative chromosome architectures of ScI and ScXIV illustrates the ability of long-read alignment to resolve structural heterozygosity.
Differences between group 1 and 2 genomes do not result from separate ancestry S. pastorianus strains can be subdivided into two separate groups-termed Group 1 and Group 2-based on both phenotypic [55] and genomic features [5,11]. However, the ancestral origin of each group remains unclear. The two groups may have emerged by independent hybridization events [19]. Alternatively, Group 1 and Group 2 strains may originate from the same hybridization event, but Group 2 strains later hybridized with a different S. cerevisiae strain [11]. In both cases, analysis of the provenance of genomic material from Group 1 and Group 2 genomes could confirm the existence of separate hybridization events if different ancestries are identified. Pan-genomic analysis of S. cerevisiae strains indicated that their evolution was largely non-linear, involving frequent horizontal gene transfer and sexual backcrossing events [56]. Especially if the evolutionary ancestry of S. pastorianus involves admixture of different S. cerevisiae genomes [11], approaches considering only linear evolution such as phylogenetic trees are insufficient [57]. Complex, non-linear evolutionary relationships could be addressed with network approaches [58]. However, such algorithms are not yet fully mature and would involve extreme computational challenges [59,60]. Therefore, we developed Alpaca: a simple and computationally inexpensive method to investigate complex non-linear ancestry via comparison of sequencing datasets [61]. Alpaca is based on short-read alignment of a collection of strains to a partitioned reference genome, in which the similarity of each partition to the collection of strains is independently computed using k-mer sets [61]. Reducing the alignments in each partition to k-mer sets prior to similarity analysis is computationally inexpensive. Phylogenetic relationships are also not recalculated, but simply inferred from previously available information on the population structure of the collection of strains [61]. The partitioning of the reference genome enables the identification of strains with high similarity to different regions of the genome, enabling the identification of ancestry resulting from non-linear evolution. Moreover, since similarity analysis is based on read data, heterozygosity is taken into account.
We used Alpaca to identify the most similar lineages for all non-overlapping 2 Kbp sub-regions in the genome of the Group 2 S. pastorianus strain CBS 1483 using a reference dataset of 157 S. cerevisiae strains [62] and 29 S. eubayanus strains [63]. We inferred population structures for both reference datasets by using previously defined lineages of each strain along with hierarchical clustering based on genome similarity using MASH [64]. For the S. eubayanus subgenome, almost all sub-regions of CBS 1483 were most similar to strains from the Patagonia B -Holartic lineage [63] (Fig. 3). In fact, 68% of all sub-regions were most similar to the Himalayan isolate CDFM21L.1 [65] and 27% to two highly-related North-American isolates (Additional file 5: Figure S4), indicating a monophyletic ancestry of the S. eubayanus genome. Analysis of S. pastorianus strains CBS 2156 and WS 34/70 (Group2), and of CBS 1503, CBS 1513 and CBS 1538 (Group 1), indicated identical ancestry of their S. eubayanus subgenomes (Additional file 5: Figure S4). Overall, we did not discern differences in the S. eubayanus subgenomes of S. pastorianus strains, which seem to descend from a strain of the Patagonia B -Holartic lineage and which is most closely related to the Himalayan isolate CDFM21L.1.
In contrast, for the S. cerevisiae sub-genome of CBS 1483, the most similar S. cerevisiae strains varied across the subregions of every chromosome ( Fig. 4 and Additional file 6: Figure S5). No strain of the reference dataset was most similar for more than 5% of sub-regions, suggesting a high degree of admixture ( Fig. 5 and Additional file 7: Figure S6). However, 60% of sub-regions were most similar to the Beer 1 lineage, 12% were most similar to the Wine lineage and 10% to the Beer 2 lineage [62]. In order to determine Alpaca's ability to differentiate genomes with different admixed ancestries, we analysed the genomes of 8 S. cerevisiae strains: six ale-brewing strains and the laboratory strains CEN.PK113-7D and S288C. The strains CBS 7539, CBS 1463 and A81062 were identified as similar to the Beer 2 lineage, CBS 1171 and CBS 6308 as similar to the Beer 1 lineage, CBS 1487 as similar to the Wine lineage, and CEN.PK113-7D and S288C as similar to the mosaic laboratory strains (Fig. 4 and Additional file 6: Figure S5). In addition, the distribution of similarity over the S. cerevisiae population tree differed per strain (Fig. 5 and Additional file 7: Figure  S6). While no single strain was most similar for more than 8% of the sub-regions for CBS 1487 and CBS 6308, for CBS 7539 67% of sub-regions were most similar to the strain beer002. As both beer002 and CBS 7539 are annotated as Bulgarian beer yeast [56,62], this similarity likely reflects common origin. The different similarity profiles of all S. cerevisiae strains indicate that Alpaca can differentiate different ancestry by placement of genetic material within the S. cerevisiae population tree, whether a genome has a linear monophyletic origin or a non-linear polyphyletic origin.
To identify possible differences in genome compositions within the S. cerevisiae subgenomes of S. pastorianus, we analysed other Group 1 and 2 strains using Alpaca, including an isolate of the Heineken A-yeast® lineage (Hei-A), which was isolated in 1886 and represents one of the earliest pure yeast cultures. Whole genome sequencing, alignment to the CBS 1483 assembly and sequencing coverage analysis revealed that the ploidy of the Hei-A isolate corresponds to that of a Group 2 strain (Additional file 8: Figure  S7). Analysis of Hei-A and the other S. pastorianus Group 2 strains CBS 2156 and WS 34/70 using Alpaca yielded almost identical patterns of similarity at the chromosomelevel as CBS 1483 ( Fig. 4 and Additional file 6: Figure S5). Moreover, similarity was distributed across the S. cerevisiae population tree almost identically as in CBS 1483 ( Fig. 5 and Additional file 7: Figure S6). The Group 1 S. pastorianus strains CBS 1503, CBS 1513 and CBS 1538 displayed different patterns of similarity at the chromosome-level relative to Group 2 strains. While various chromosome regions harboured almost identical similarity patterns, some regions differed significantly, such as: ScI, the middle of ScIV, the left arm of ScV, ScVIII, the right arm of ScIX, ScX-SeX, ScXI and ScXIII (Fig. 4 and Additional file 6: Figure S5). However, at the genome level, similarity was distributed across the S. cerevisiae population tree almost identically as in Group 2 strains, except for a slightly higher contribution of the Beer 2 and Wine lineages, at the expense of a lower contribution of the Beer 1 lineage (Fig. 5 and Additional file 7: Figure S6). The almost identical distribution of all Group 1 and Group 2 strains over the S. cerevisiae population tree indicate that they have the same S. cerevisiae ancestry. The spread of similarity across the S. cerevisiae population tree advocates for an admixed, possibly heterozygous ancestry of the S. cerevisiae subgenome of S. pastorianus. Furthermore, the different patterns of similarity at the chromosome level between both groups are compatible with an initially heterozygous S. cerevisiae subgenome which was subjected to independent loss of heterozygosity events in each group, resulting in differential retention of each haplotype. The lower relative contribution of Beer 1 strains in Group 1 strains may be explained by the complete absence of S. cerevisiae chromosomes with high similarity to Beer1 strains, such as ScV, ScXI and ScXv-ScXI.

Discussion
In this study, we used Oxford Nanopore MinION longread sequencing to study the genome of CBS 1483, an alloaneuploid Group 2 S. pastorianus strain. The presence of extensively aneuploid S. cerevisiae and S. eubayanus subgenomes substantially complicates analysis of S. pastorianus genomes [10]. We therefore explored the ability of longread sequencing to generate a reference genome in the Fig. 3 Tree-tracing of the genome-scale similarity across the S. eubayanus (sub-)genomes of Group 1 and 2 S. pastorianus strains, as determined using Alpaca. The frequency at which a genome from the reference data set of 29 S. eubayanus genomes from Peris et al [63] was identified as most similar for a sub-region of the CBS 1483 genome is depicted. The reference dataset is represented as a population tree, upon which only lineages with similarity are indicated with a thickness proportional to the frequency at which they were found as most similar ('N' being the total sum of the number of times all samples appeared as top-scoring). The complete reference population tree (a), the genomes of Group 1 strains CBS 1503, CBS 1513 and CBS 1538 (b-d) and for the genomes of Group 2 strains CBS 1483, CBS 2156 and WS34/70 (e-g) are shown. In the figure the strain CDFM21L.1 is labeld with its alias AS_2_4948 presence of multiple non-identical chromosome copies, and investigated the extent to which structural and sequence heterogeneity can be reconstructed. Despite its aneuploidy, we obtained a chromosome-level genome haploid assembly of CBS 1483 in which 29 of the 31 chromosomes were assembled in a single contig. Comparably to assemblies of euploid Saccharomyces genomes [22][23][24][25][26][27], long-read sequencing resulted in far lesser fragmentation and in the addition of considerable sequences compared to a shortread based assembly of CBS 1483, notably in the subtelomeric regions [10]. The added sequences enabled more complete identification of industrially-relevant subtelomeric genes such as the MAL genes, responsible for maltose and maltotriose utilisation [37][38][39], and the FLO genes, responsible for flocculation [34,42,43]. Due to the instability of subtelomeric regions [28][29][30], the lack of reference-based biases introduced by scaffolding allows more certainty about chromosome structure [24]. Since subtelomeric genes encode various industrially-relevant traits [31][32][33][34], their mapping enables further progress in strain improvement of lager brewing yeasts. Combined with recently developed Cas9 gene editing tools for S. pastorianus [66], accurate localisation and sequence information about subtelomeric genes is critical to investigate their contribution to brewing phenotypes by enabling functional characterization [67].
Despite the presence of non-identical chromosome copies in CBS 1483, the genome assembly only contained one contig per chromosome. While the assembly did not capture information about heterogeneity, mapping of shortread data enabled identification of sequence heterozygosity across the whole genome. In previous work, two alternative chromosome structures could be resolved within a population of euploid S. cerevisiae strain CEN.PK113-7D by alignment of long-reads [24]. Therefore, we evaluated the ability to identify structural heterogeneity by aligning long-read data to the assembly. Indeed, long-read alignments enabled the identification of two versions of chromosome ScI: with and without an internal deletion of the gene UIP3. Furthermore, the length of long-reads enabled them to span a TYelement, revealing that one of the copies of the right arm of ScI was translocated to the left arm of ScXIV. While the two alternative structures of ScI constitute a first step towards the generation of chromosome copy haplotypes, long-reads only enabled the hypothesis-based resolution of suspected heterogeneity. Assembly algorithms which do  not generate a single consensus sequence per chromosome are emerging [68,69]. However, haplotyping is particularly difficult in aneuploid and polyploid genomes due to copy number differences between chromosomes [68]. A further reduction of the relatively high error rate of long-reads, or the use of more accurate long-read sequencing technologies, could simplify the generation of haplotype-level genome assemblies in the future by reducing noise [70]. We used the chromosome-level assembly of CBS 1483 to study the ancestry of S. pastorianus genomes. Due to the importance of non-linear evolution in the domestication process of Saccharomyces strains [56], and to the admixed hybrid nature of S. pastorianus [11,63], we used the newly-developed method Alpaca to analyse the ancestry of CBS 1483 instead of classical phylogenetic approaches using reference datasets of S. cerevisiae and S. eubayanus strains [62,63]. All S. pastorianus genomes displayed identical distribution of similarity across the reference S. eubayanus population tree, both at the chromosome and whole-genome level. All S. pastorianus genomes also showed identical distribution of similarity across the reference S. cerevisiae population tree at the whole genome level; however, Group 1 and Group 2 strains displayed different similarity patterns at the chromosome level. The absence of differences in the S. cerevisiae genome at the whole genome level and recurrence of identical chromosomal break points between Group 1 and 2 strains discredit previous hypotheses of different independent hybridization events in the evolution of Group 1 and 2 strains [11,19]. Instead, these results are compatible with the emergence of Group 1 and 2 strains from a single shared hybridization event between a homozygous S. eubayanus genome closely related to the Himalayan isolate CDFM21L.1 and an admixed heterozygous S. cerevisiae genome with a complex polyphyletic ancestry. Loss of heterozygosity is frequently observed in Saccharomyces genomes [56,71], and therefore likely to have affected both the genomes of Group 1 and 2 strains [11,72,73]. The different chromosome-level similarity patterns in both groups likely emerged through different loss of heterozygosity events in Group 1 and 2 strains [72,73]. In addition, the lower S. cerevisiae chromosome content of Group 1 is consistent with observed loss of genetic material from the least adapted parent during laboratory evolution of Saccharomyces hybrids [74][75][76][77]. In this context, the lower S. cerevisiae genome content of Group 1 strains may have resulted from a rare and serendipitous event. For example, chromosome loss has been observed due to unequal chromosome distribution from a sporulation event of a allopolyploid Saccharomyces strain [78]. Such mutant may have been successful if loss of S. cerevisiae chromosomes provided a selective advantage in the lowtemperature lager brewing environment [74,75]. The loss of the S. cerevisiae subgenome may have affected only Group 1 strains due to different brewing conditions during their domestication. However, the high conservation of similarity within Group 1 and Group 2 strains indicate that the strains within each Group are closely related, indicating a strong population bottleneck in their evolutionary history.
Such a bottleneck could have been caused by the isolation and propagation of a limited number S. pastorianus strains, which may have eventually resulted in the extinction of other lineages. The first S. pastorianus strains isolated in 1883 by Hansen at the Carlsberg brewery were all Group 1 strains [13,79]. Due to the industry practice of adopting brewing methods and brewing strains from successful breweries, Hansen's Group 1 isolates likely spread to other breweries as these adopted pure culture brewing [1]. Many strains which were identified as Group 2 by whole genome sequencing were isolated in the Netherlands [5,11]: Elion isolated the Heineken A-yeast® in 1886 [80], CBS 1484 was isolated in 1925 from the Oranjeboom brewery [5], CBS 1483 was isolated in 1927 in a Heineken brewery [10], and CBS 1260, CBS 2156 and CBS 5832 were isolated from unknown breweries in the Netherlands in 1937, 1955 and 1968, respectively [5,81]. Analogously to the spread of Group 1 strains from Hansen's isolate, Group 2 strains may have spread from Elion's isolate. Both Heineken and Carlsberg distributed their pure culture yeast biomass to breweries over Europe and might therefore have functioned as an evolutionary bottleneck by supplanting other lineages with their isolates [82,83]. Overall, our results support that the differences between Group 1 and 2 strains emerged by differential evolution after an initial shared hybridization event, and not by a different S. eubayanus and/or S. cerevisiae ancestry.
(See figure on previous page.) Fig. 5 Tree-tracing of the genome-scale similarity across the S. cerevisiae (sub-)genomes of various Saccharomyces strains, as determined using Alpaca. The frequency at which a genome from the reference data set of 157 S. cerevisiae strains from Gallone et al [62] was identified as most similar for a sub-region of the CBS 1483 genome is depicted. The reference dataset is represented as a population tree, upon which only lineages with similarity are indicated with a thickness proportional to the frequency at which they were found as most similar ('n' being the total sum of the numxber of times all samples appeared as top-scoring). The genomes of S. pastorianus Beyond its application in this study, we introduced Alpaca as a method to evaluate non-linear evolutionary ancestry. The use of short-read alignments enables Alpaca to account for sequence heterozygosity when assessing similarity between two genomes and is computationally inexpensive as they are reduced to k-mer sets. Moreover, Alpaca leverages previously determined phylogenetic relationships within the reference dataset of strains to infer the evolutionary relationship of the reference genome to the dataset of strains. Due to the presence of non-linear evolutionary processes in a wide range of organisms [84,85], the applicability of Alpaca extends far beyond the Saccharomyces genera. For example, genetic introgressions from Homo neanderthalensis constitute about 1% of the human genome [86]. Horizontal gene transfer is even relevant across different domains of life: more than 20% of ORFs of the extremely thermophilic bacteria Thermotoga maritima were more closely related to genomes of Archaea than to genomes of other Bacteria [87]. Critically, horizontal gene transfer, backcrossing and hybridization have not only played a prominent role in the domestication of Saccharomyces yeasts [56], but also in other domesticated species such as cows, pigs, wheat and citrus fruits [88][89][90][91]. Overall, Alpaca can significantly simplify the analysis of new genomes in a broad range of contexts when reference phylogenies are already available.

Conclusions
With 29 of the 31 chromosomes assembled in single contigs and 323 previously unassembled genes, the genome assembly of CBS 1483 presents the first chromosome-level assembly of a S. pastorianus strain specifically, and of an alloaneuploid genome in general. While the assembly only consisted of consensus sequences of all copies of each chromosome, sequence and structural heterozygosity could be recovered by alignment of short and long-reads to the assembly, respectively. We developed Alpaca to investigate the ancestry of Group 1 and Group 2 S. pastorianus strains by computing similarity between short-read data from S. pastorianus strains relative to large datasets of S. cerevisiae and S. eubayanus strains. In contrast with the hypothesis of separate hybridization events, Group 1 and 2 strains shared similarity with the same reference S. cerevisiae and S. eubayanus strains, indicating shared ancestry. Instead, differences between Group 1 and Group 2 strains could be attributed to different patterns of loss of heterozygosity subsequent to a shared hybridization event between a homozygous S. eubayanus genome closely related to the Himalayan isolate CDFM21L.1 and an admixed heterozygous S. cerevisiae genome with a complex polyphyletic ancestry. We identified the Heineken A-yeast® isolate as a Group 2 strain. We hypothesize that the large differences between Group 1 and Group 2 strains and the high similarity within Group 1 and 2 strains result from a strong population bottleneck which occurred during the isolation of the first Group 1 and Group 2 strains, from which all currently known S. pastorianus strains descend. Beyond its application in this study, the ability of Alpaca to reveal non-linear ancestry without requiring heavy computations presents a promising alternative to phylogenetic network analysis to investigate horizontal gene transfer, backcrossing and hybridization.

Yeast strains, cultivation techniques and genomic DNA extraction
Saccharomyces strains used in this study are indicated in Table 3. S. pastorianus strain CBS 1483, S. cerevisiae strain S288C and S. eubayanus strain CBS 12357 were obtained from the Westerdijk Fungal Biodiversity Institute (http://www.westerdijkinstitute.nl/). S. eubayanus strain CDFM21L.1 was provided by Prof. Feng-Yan Bai. An isolate from the S. pastorianus Heineken A-yeast® lineage (Hei-A) was obtained from HEINEKEN Supply Chain B.V., Zoeterwoude, the Netherlands. All strains were stored at − 80°C in 30% glycerol (vol/vol). Yeast cultures were inoculated from frozen stocks into 500-mL shake flasks containing 100 mL liquid YPD medium (containing 10 g L − 1 yeast extract, 20 g L − 1 peptone and 20 g L − 1 glucose) and incubated at 12°C on an orbital shaker set at 200 rpm until the strains reached stationary phase with an OD 660 between 12 and 20. Genomic DNA

De novo genome assembly
The genome of CBS 1483 was assembled de novo using only the long-read sequencing data generated in this study. The assembly was generated using Canu [35], polished using Pilon [36] and annotated using MAKER2 [92], as previously described [24] with some modifications: Pilon (version 1.22) was only used to polish sequencing errors in the long-read-only de novo assembly, and Minimap2 [93] (version 2.7) was used as the longread aligner to identify potential misassemblies and heterozygous structural variants, which were visualized using Ribbon [94]. The resulting assembly was manually curated: (i) a contig of 24 Kbp comprised entirely of "TATATA" sequence was discarded; (ii) three contigs of 592, 465, and 95 Kbp (corresponding to the rDNA locus of the S. cerevisiae sub-genome) and complete sequence up and downstream of this locus were joined with a gap; (iii) four contigs corresponding to S. cerevisiae chromosome I (referred to as ScI) were joined without a gap into a complete 208 Kbp chromosome assembly (Fig. 2a); (iv) two contigs corresponding to ScXIV were joined with a gap (Fig. 2d) Comparison between long-read-only and short-read-only genome assemblies Gained and lost sequence information in the long-read assembly of CBS 1483 was determined by comparing it to the previous short-read assembly [10], as previously described [24] with the addition of using minimum added sequence length of 25 nt.

FLO gene analysis
We used Tandem Repeat Finder (version 4.09) [96] with recommended parameters to identify tandem repeat sequences in FLO1 (SGDID:S000000084), FLO5 (SGDID: S000001254), FLO8 (SGDID:S000000911), FLO9 (SGDID: S000000059), FLO10 (SGDID:S000001810), and FLO11 (SGDID:S000001458) of S. cerevisiae strain S288C [97] as well as in FLO1, FLO5, FLO8, FLO9, FLO10 and FLO11 of S. eubayanus strain CBS 12357 [25]. The resulting tandem repeat sequences were then used as proxies to characterize FLO genes in our assembly of CBS 1483, in a previously generated assembly of S. cerevisiae strain CEN.PK113-7D [24] and the Lg-FLO1 genes previously described in S. cerevisiae strain CMBSVM11 (GenBank HM358276) and S. pastorianus strain KBY001 (GenBank D89860.1) [51,52]. BLASTN (version 2.2.31+) [98] was then used to align the tandem sequences to each FLO gene. The alignments were further processed via an in-house script in the Scala programming language to identify repeat clusters by requiring a minimum alignment coverage of 0.5 and a maximum gap between two repeats of 3x times the repeat sequence length. The total number of copies was estimated by dividing the total size of the cluster by the repeat sequence length.

Intra-chromosomal heterozygosity
Sequence variation was identified by aligning the shortread Illumina reads generated in this study to the longread-only assembly with BWA [99] and calling variants with Pilon [36] using the --fix "bases"," local" and --diploid parameters. To restrict false positive calls, SNPs were disregarded within 10 Kbp of the ends of the chromosomes, if minor alleles had a frequency below 15% allele frequency, and if the coverage was below 3 reads. Copy number variation for all chromosomes were estimated by aligning all short-reads to the long-read-only assembly. Reads were trimmed of adapter sequences and low-quality bases with Trimmomatic [100] (version 0.36) and aligned with BWA [99] (version 0.7.12). The median coverage was computed using a non-overlapping window of 100 nt, copy number was determined by comparing the coverage to that of the chromosome with the smallest median coverage. Additionally, copy number variation at the gene-level was also investigated based on whether the coverage of an individual gene significantly deviated from the coverage of the surrounding region. First, we defined contiguous chromosomal sub-regions with fixed copy number (Additional file 10: Table S2). The mean and standard deviation of coverages of these sub-regions were then computed using long-read-only alignments. Mean coverages of every gene was then computed and an uncorrected Z-test [101] was performed by comparing a gene's mean coverage and the corresponding mean and standard deviation of the pre-defined sub-region that the gene overlapped with.
Similarity analysis and lineage tracing of S. pastorianus sub-genomes using alpaca We developed Alpaca [61] to investigate non-linear ancestry of a reference genome based on large sequencing datasets. Briefly, Alpaca partitions a reference genome into multiple sub-regions, each reduced to a k-mer set representation. Sequence similarities of the sub-regions are then independently computed against the corresponding sub-regions in a collection of target genomes. Non-linear ancestry can therefore be inferred by tracing the population origin of the most similar genome(s) in each sub-region. Detailed explanation Alpaca can be found in our method description [61].
Alpaca (version 1.0) was applied to the long-read CBS 1483 genome assembly to investigate the similarity of sub-regions from both sub-genomes to previously defined population lineages. For partitioning the CBS 1483 genome into sub-regions, we used a k-mer size of 21 and a sub-region size of 2 Kbp and used the short-read Illumina data of CBS 1483 produced in this study to assure accurate k-mer set construction. For investigating mosaic structures in the S. cerevisiae subgenome, we used 157 brewing-related S. cerevisiae genomes (project accession number PRJNA323691) which were subdivided in six major lineages: Asia, Beer1, Beer2, Mixed, West-Africa, Wine and Mosaic [62]. For the S. eubayanus subgenome, we used 29 available genomes (project accession number PRJNA290017) which were subdivided in three major lineages: Admixed, Patagonia-A, and Patagonia-B [63]. Raw-reads of all samples were trimmed Trimmomatic and filtered reads were aligned to CBS 1483 genome using BWA [99]. Alpaca was also applied to several Saccharomyces genomes to investigate evolutionary similarities and differences between Group 1 and Group 2 S. pastorianus genomes. We used Group 1 strains CBS 1503, CBS 1513, and CBS 1538, and Group 2 strains CBS 2156 and WS34/70 (project accession number PRJDB4073) [11]. As a control, eight S. cerevisiae genomes were analysed: ale strains CBS 7539, CBS 1463, CBS 1171, CBS 6308, and CBS 1487 (project accession number PRJEB13017) [56] and A81062 (project accession number PRJNA408119) [18], and laboratory strains CEN.PK113-7D (project accession number PRJNA393501) [24] and S288C (project accession number PRJEB14774) [23]. Similarly, raw-reads for all strains were trimmed with Trimmomatic and aligned to the long-read CBS 1483 genome assembly using BWA. Partitioning of the additional S. pastorianus and S. cerevisiae genomes with Alpaca was performed by deriving k-mer sets from read-alignments only, assuring direct one-toone comparison of all sub-regions across all genomes. K-mer size of 21 and sub-region size of 2 Kbp were used. The S. cerevisiae and S. eubayanus sequencing data were used to identify potential mosaic structures in these genomes. Lastly, S. cerevisiae and S. eubayanus strains were subdivided into subpopulations according to previously defined lineages [62,63]. MASH (version 2.1) [64] was then used to hierarchically cluster each genome based on their MASH distance using k-mer size of 21, sketch size of 1,000,000, and minimum k-mer frequency of 2. The resulting trees were used as population reference trees for Alpaca [61].
value represents the location in the corresponding chromosome and the y-value the estimated copy number based on coverage analysis of longread-only alignments. Dots in yellow indicate ORFs whose coverage significantly deviates from the surrounding genomic region, indicating copy number variation of the ORF within different copies of the same chromosome.
Additional file 5: Figure S4. Similarity profiles of the S. eubayanus (sub-)genomes of Group 1 and 2 S. pastorianus strains, as determined using Alpaca. Each S. eubayanus chromosome of the CBS 1483 assembly was partitioned in non-overlapping sub-regions of 2 Kbp. The colors represent the most similar lineages based on k-mer similarity of 29 S. eubayanus strains from Peris et al [63]: admixed (purple), Patagonia-A (red), Patagonia-B (blue). Similarity patterns are shown for the Group 2 strains CBS 1483, CBS 2156 and WS34/70 and the Group 1 strains CBS 1503, CBS 1513 and CBS 1538. Additional file 7: Figure S6. More complete version of the tree tracing of Fig. 5. Tree-tracing of the genome-scale similarity across the S. cerevisiae (sub-)genomes of various Saccharomyces strains, as determined using Alpaca. The frequency at which a genome from the reference data set of 157 S. cerevisiae strains from Gallone et al [62] was identified as most similar for a sub-region of the CBS 1483 genome is depicted. The reference dataset is represented as a population tree, upon which only lineages with similarity are indicated with a thickness proportional to the frequency at which they were found as most similar. In addition to the genomes of CBS 1513, CBS 1483, CBS 7539 and CBS 1171, the treetracing figures of S. pastorianus Group 1 strains CBS 1503 and CBS 1538, of S. pastorianus Group 2 strains CBS 2156, WS34/70 and Hei-A, and of S. cerevisiae strains CBS 1463, A81062, CBS 6308, CBS 1487, CEN.PK113-7D and S288C are shown.
Additional file 8: Figure S7. The ploidy estimates for each chromosome of the S. pastorianus isolate of the Heineken A-yeast® lineage based on alignment of short-read data to the chromosome-level assembly of CBS 1483.
Additional file 9: Table S1. Whole genome sequencing data generated in this study, under NCB Accession PRJNA522669.
Additional file 10: Table S2. Homozygous sub-regions of the S. cerevisiae sub-genome used for the SNP-based phylogenetic analysis.