Genome sequencing
The three Komagataella strains - (1) K. pastoris (NRRL Y-1603 or ATCC 28485), (2) K. phaffii ‘WT’ (NRRL Y-11430 or ATCC 76273) and (3) K. phaffii GS115 (ATCC 20864) were grown overnight in YPD (BD Difco, Cat. # 242820). DNA was extracted using the YeaStar Genomic DNA Kit (Zymo Research, Cat. # D2002) and RNA using the YeaStar™ RNA Kit (Zymo Research, Cat. # R1002). The extracted DNA was sequenced on the Pacific Biosciences (PacBio) single molecule real-time (SMRT) platform. Genomic DNA was also sequenced on Illumina HiSeq2000 from both fragment and jumping libraries. Illumina fragment libraries were generated as previously described [46] with the following modifications. For each sample, 100 ng of genomic DNA was sheared to 200 bp in size using a Covaris LE220 instrument (Covaris, MA) with the following parameters: temperature: 7–9 °C; duty cycle: 20 %; intensity: 5; cycles per burst: 200; time: 90 s; shearing tubes: Crimp Cap microTUBES with AFA fibers Covaris, MA). DNA fragments were end repaired, 3′ adenylated, ligated with indexed Illumina sequencing adapter, and PCR enriched, as previously described [47]. The resulting Illumina fragment sequencing libraries were normalized and were size selected to contain inserts of 180 bp ±3 % in length using a Pippen Prep system (Sage Science, MA) following the manufacturer’s recommendations. In jumping libraries, JUMP processing deletes the DNA in between the sections of interest that are far apart and combines them in order to be sequenced. Initial genomic DNA was sheared to get the sample to 5 kb in 150 ul. A caliper quality check was performed after end repair to insure proper shearing, and a critical circularization step was performed for 16 h. A second shear was performed to lower the size of the DNA to 500 bp after an exonuclease cleanup. Immobilization, a second end repair, an A-base addition, and PCR was performed for 18 cycles, which were all followed by washes. Adaptor ligation with Illumina paired end adapters also was performed before PCR to ensure the samples can be pooled before being sequenced. The multiple wash steps ensures clean PCR product is being loaded on sequencers.
Illumina sequencing libraries were quantified using quantitative PCR (KAPA Biosystems, MA) following the manufacturer’s recommendations. Libraries were normalized to 2 nM and denatured using 0.1 N NaOH.
Sequencing Flowcell cluster amplification was performed according to the manufacturer’s recommendations using the V3 TruSeq PE Cluster Kit and V3 TruSeq Flowcells (Illumina, CA). Flowcells were sequenced with 101 base paired end reads on an Illumina HiSeq2000 instrument, using V3 TruSeq Sequencing by synthesis kits and analyzed with the Illumina RTA v1.12 pipeline (Illumina, CA).
Genome assembly
Pacbio reads were assembled using a hierarchical genome-assembly process (HGAP) [48] and the Illumina paired-end reads were aligned [49] to the assemblies for error correction and assembly improvement using Pilon [50]. Assemblies were refined by manual curation. Pairwise alignments were carried out using BLAST (version 2.2.27). For each genome assembly, contigs were examined and removed if redundant (i.e. aligning to any other contig in the same assembly with >90 % identity). All contigs containing rDNA repeats were excluded from the above step. Large contigs were manually connected to construct telomere-to-telomere sequences and checked for consistency with the previously reported genome [12]. One gap in the wildtype K. phaffii assembly was closed by using corresponding and overlapping sequence from K. phaffii GS115 to bridge the missing segment. The validity of this bridging process was supported by manual examination of PacBio and Illumina sequences and raw sequencing reads that documented the manual junctions. The genomic sequencing data and assembled and annotated genomes are deposited at NCBI under bioproject accession numbers PRJNA304627 (K. pastoris), PRJNA304977 (K. phaffii wildtype), and PRJNA304986 (K. phaffii GS115).
Transcriptome analysis
The three Komagatella strains were grown in shake flask (30 °C, 250 rpm, μavg = 0.26) using complex glycerol-containing media (BMGY, Teknova, Cat. # B8000) to an OD600 of 2.0 (low density) or to an OD600 of 20 (high density). Following initial biomass accumulation, the cells were harvested by centrifugation and re-suspended in either glycerol- (BMGY), glucose- (YPD, BD Difco, Cat. # 242820) or methanol-containing media (BMMY, Teknova, Cat. # B8100). Samples were collected before changing media (0 h) and after resuspension into fresh media (6, 24 and 48 h). RNA was extracted from three independent cultivations for each time point sampled using the RNAeasy Kit (Qiagen, Cat. # 74104) and analyzed to ensure that RNA Integrity Number (RIN) score was >7. RNA sequencing libraries were constructed using the Truseq mRNA stranded HT kit (Illumina, Cat. # RS-122-2103) and sequenced on the Illumina NextSeq platform to generate 75-nucleotide paired-end reads at a read depth of at least 3 million reads per sample.
To assess the technical quality of RNA-seq reads for each condition sampled, each raw data set was down-sampled to 1 M paired-end reads and aligned to the assembly using BWA 0.7.5a. Then, Bedtools (version 2.17.0) was used to overlap the resulting alignments to the annotations to count the reads falling into genes, coding regions, intronic regions, 5′ or 3′ UTRs, flanking 3 kb genic regions and intergenic regions. Other basic statistics, including mapping rate, unique mapping rate, multiple mapping rate, number of perfect match reads, number of alignments with 1 or 2 mismatches and ratio of sense vs. anti-sense reads were also collected for each sample (see Additional file 3, Table S2 with quality control data for each RNA-seq sample). The complete data set set for each condition and time point was used for all analyses reported. RNA-seq data are deposited at NCBI under the bioproject accession number PRJNA304627.
Copy number analysis
DNA sequencing reads were down-sampled to ~2 M reads for both K. pastoris and wildtype K. phaffii WT samples; HMMcopy (version 0.1.1) was used to evaluate the copy number in 1000 bp windows. Mappability and GC content tracks were generated as control following the HMMcopy documentation. BWA (version 0.7.10) was used to map the reads to the reference (default options were used).
Genome annotation
For initial annotation purposes, the standard fungal annotation pipeline used by the Broad Institute Genome Sequencing Platform [22] was deployed on these genomes. Given the low proportion of spliced genes (<5 %), protein-coding gene predictions were more accurately made using a prokaryotic ab
initio tool, thus Prodigal [51] was employed. The rRNA loci were predicted by RNAmmer (version 1.2) [52] and tRNA by tRNAscan-SE (version 1.12) [53]. RNA was also sequenced on the Illumina platform (see above) and a subset of the resulting RNA-seq reads were assembled using Trinity (version r20140717) [54] and aligned to the genomes. PASA [22] was then run using the Trinity assembly alignments to update the prodigal genesets with splicing and UTR information. The genes were then filtered to remove all genes smaller than 300 nucleotides without evidence of overlap to either Hmmer, GeneWise (version 2.2.0) or RNA-seq data. Genes were then repeat-filtered using an in-house repeat-filtering pipeline (TPSI: e-value of 1e-10 and a minimum of 30 % Query Coverage; RepBase; repeat Hmmer domains; and Multihits: > = 8 times without non-repeat domains, minLength = 100, and percentId > = 90 %). Quality control was performed and genes were repaired so that none contained partial codons, in-frame stops, or unknown bases. Finally, protein-coding genes demonstrating 70 % overlap with non-coding genes (rRNA or tRNA annotations) were filtered out. Gene product was assigned based on BLAST best hit (E-value < 1e-10, version 2.2.25) against three databases in the following order of precedence: i) Swissprot (release 2011_03; at > =60 % identity and > =60 % query coverage, <30 % length difference), ii) TIGRfam (TIGRfam13), iii) KEGG (version 65; at > =60 % identity and > =60 % query coverage, and must have KO#). Genes without a match had their product defined as “hypothetical proteins”. Following initial gene identification, unmapped and intergenic RNA-seq reads from all culture conditions were subjected to a second Trinity assembly in an effort to identify conditionally dependent novel transcripts. The resulting transcripts from this second trinity run were included in subsequent annotation and gene expression analyses.
The genome annotations for each of the three strains were linked to the two other publicly-banked complete K. phaffii genomes with annotations [12, 14]. The refseq proteins from GS115_644223 and all proteins from CBS7435_981350 (both banked with NCBI) were used as separate BLAST (version 2.2.27) targets for all proteins from our three genomes. The best BLAST hit, as defined by highest bit score, per protein are reported for each genome (Additional file 4, Table S3).
Orthology assignment and gene naming
Sequence clustering at the mRNA level was used to identify orthologs between strains. For each pairwise strain comparison, mRNA sequences were pooled into a FASTA file and used as input for the application CD-HIT-EST (version 4.6.1) [55]. For comparison of orthology in wildtype K. phaffii versus K. phaffii GS115, a clustering threshold of 95 % identity was used. For comparison of orthology in K. pastoris versus K. phaffii, a lower clustering threshold of 80 % identity was used due to divergence between strains. Orthology percent identity analysis was derived from CD-HIT clustering results using custom scripts.
This automated analysis resulted in identification of 4,601 orthologous pairs between K. pastoris and K. phaffii. Manual analysis of the clustering using a custom script, TIBCO Spotfire and Integrated Genomics Viewer (version 2.3.72) resulted in annotation of an additional 48 ortholog pairs. Of the 595 remaining sequence clusters with unclear orthology, 33 contain only Trinity transcript groups of varying complexity. 129 have complex relationships, such as 2:1 (91 examples) or 2:2 (38 examples), typically caused by gene prediction artifacts during annotation of adjacent genes, including fragmented gene prediction or incomplete UTR annotation. 7 genes found in K. phaffii are not found in K. pastoris, those located on the linear plasmid. 398 clusters contain a single sequence that cannot be further associated by clustering at lower threshold and may represent species-specific genes. To address the possibility that these genes result from data contamination from other organisms, these 398 proteins were used to query Swissprot with BLASTP (2.2.27), but we observe no high-identity matches to any known proteins. The resulting alignments are consistent with species-specific genes, usually of fungal origin, rather than contamination with genetic material from other species.
For wildtype K. phaffii and K. phaffii GS115, 4,996 ortholog pairs were identified from the 5,169 sequence clusters by the automated analysis described above. (Note: the 5,169 sequence clusters do not correspond to genes, but are collections of related sequences.) Of the remaining 163 sequence clusters with unclear orthology, 30 contain only Trinity transcript groups of varying complexity and 46 clusters have complex relationships, such as 2:1 (25 examples), 2:2 (20 examples), or 3:2 (1 example). Forty-seven sequence clusters are found in wildtype, but not GS115, including the seven genes encoded on the linear plasmid; 40 clusters are found in GS115 but not wildtype. 50 of these 80 sequence clusters form 25 pairs of orthologs when a lower identity threshold is used during clustering with CD-HIT (85 % instead of 95 %). To address the remaining 30 sequences, TBLASTN (version 2.2.27) analysis was performed using proteins found in one strain to query the genome of the other strain. These analyses suggest that 1 wildtype genes and 13 GS115 genes are present in the other strain despite the lack of gene prediction during annotation. 1 additional wildtype gene (GQ67_00697) that is situated in a complex locus was unannotated in GS115, though underlying nucleotide sequences in the strains align with high identity. Alignment between the predicted protein from wildtype gene GQ67_04936 and the DNA sequence from GS115 indicates a frame shift; this gene may by inactivated in GS115. GS115 genes GQ68_05325 and GQ68_05326 are located near the terminus of Chr 4 in that strain, but a neighboring gene (GQ68_05329) has an ortholog ~200 Kb from the end of Chr 4 in wildtype indicating a small gene rearrangement between the strains.
Additional analysis with Kraken [56] (version 0.10.6) and the MiniKraken database, consisting of bacterial, archaeal and viral genomes, demonstrated a low degree of sequence contamination in our genomic reads at 0.19 % in K. phaffi WT, 0.37 % in K. phaffi GS115 and 3.23 % in K. pastoris, with the majority of contaminating sequences matching bacterial taxonomies. BLASTX (2.2.27) alignment of potentially contaminating reads to the predicted proteins from all 3 of our genomes yielded shortened or low identity results, indicating that these reads are non-identical to any predicted proteins in the genome annotations. These results indicate that sequence contamination does not contribute to gene prediction and annotation, nor to the sequence sets that lack clear orthology between strains.
Komagataella genes were associated with S. cerevisiae genes using BLAST-based approaches (version 2.2.27). The results were parsed and best hits were compared using a combination of custom scripts, Tibco Spotfire (version 6.5.3.12) and MySQL. Komagataella orthologs with consistent reciprocal best hits to S. cerevisiae genes were named according to the S. cerevisiae convention. Three thousand five hundred sixty-six Komagataella ortholog groups were thus named with S. cerevisiae gene names. An additional 30 genes associated either with flocculation [25], or central carbon metabolism [14, 26] (including the methanol utilization (MUT) pathway) were manually assigned, though these genes may not correspond to S. cerevisiae (see Additional file 5: Table S4 for a complete list of named orthologs).
Phylogenetic analysis
Multiple sequence alignment was carried out using alignments derived from ten randomly selected proteins found in all three Komagataella strains and with apparent 1:1 orthology relationships in any pairwise comparison between species. After gaps were removed, the phylogeny was constructed using a concatenated alignment and reliability was assessed using bootstrapping. Phylogenetic trees were calculated with neighbor-joining, distance-based, maximum likelihood and maximum parsimony methods using the Phylip (version 3.6.96) package of programs. The highest confidence clades having 100 % bootstrap support in all methods were highlighted.
Codon usage
Codon usage was determined using ANACONDA (version 2.0.1.15). The coding sequences were extracted from the genome annotations and used as input for ANACONDA. ANACONDA determines codon usage as Relative Synonymous Codon Usage (RSCU).
Gene expression analysis
RNA-seq analysis was performed using RSEM (version 1.2.15) with bowtie2 (version 2.2.3) and a transcriptome alignment target containing a combination of original annotated transcripts plus transcripts derived from de novo Trinity assembly. Gene and isoform FPKM and count data were processed for analysis with custom scripts and Tibco Spotfire (version 6.5.3.12). Count data for differential expression testing was done using DESeq (version 1.10.1) and R (version 2.15.3). Unsupervised hierarchical clustering (Ward’s Method) was used to examine data relationships for the three biological replicates performed for each cultivation condition. In all but 3 cases, consistent clustering of biological replicates were observed. For those 3 cases, (wildtype K. phaffii, 48 h Glucose Replicate 1; wildtype K. phaffii, 24 h Glycerol Replicate 1; and K. phaffii GS115, 48 h Glucose Replicate 3) the inconsistent replicate was excluded from downstream analyses. A table of raw expression data (log2 FPKM and integer count) for all genes and replicates included in the analyses of each of the three genomes is provided (Additional file 13: Table S12).
For a subset of cultivation conditions, RNA-seq was performed using RNA collected from a higher density cultivation (see above). Unsupervised hierarchical clustering suggested expression data were highly correlated between similar cultivation conditions sampled at two different densities. To confirm this correlation, a custom R script called bivariatetrelliscompact. R was prepared to calculate Pearson correlation coefficients between vectors of gene expression data for each strain, culture, and density condition. The input vectors were averages of 3 biological replicates from the initial Tophat-based processing of the data. The correlation coefficients obtained range between 0.969 and 0.995 (Additional file 2: Figure S14).
Alternative splicing analysis
Alternatively spliced isoforms that result in alterations to protein coding sequences were identified in the initial gene annotation using Bedtools (version 2.20.1); the expression values associated with these isoforms was extracted from the RSEM output (see above). Results were used for downstream expression analysis of alternatively spliced genes in various cultivation conditions. In order to address the scale of uncaptured alternative spliced isoforms, reads from two diverse low-density cultivations (Glycerol, 0 h and Methanol, 48 h) were aligned to the corresponded genomes using Tophat (version 2.0.12), allowing for novel exon junction discovery. Novel junctions that overlapped coding sequences and had >5 distinct supporting reads were identified using custom scripts and Bedtools (version 2.20.1). We defined this additional resulting exon-junction set as containing “potential CDS changing junctions” (Additional file 7: Table S6).
Linear plasmid mismatch rate analysis
PacBio reads from each strain were aligned to the linear plasmid found in wildtype K. phaffii with bwa-sw (version 0.7.10). Illumina reads were aligned with bwa-mem (version 0.7.10) and mismatch rates per 1000 bases per 1000 reads were calculated with custom scripts.
Analysis of expression data sampled from batch cultivation using Self Organizing Maps (SOMs)
Gene expression data processing was guided by previous methods [38]. A low-expression, low-variance filter was implemented to exclude genes that may not be expressed. Genes with average log2FPKM <1 and variance <0.5 across the 10 cultivation condition averages were excluded. No condition-specific fold change filter was implemented. The GenePattern module PreprocessDataset was used to row normalize the data by setting averages to 0 and variances to 1. The preprocessed data were then used as input to the GenePattern module SOMClustering with a cluster range of 2–50. Elbow analysis was performed to identify the optimal number of clusters that minimizes degeneracy, where the variance captured by additional clusters was less than 0.01. The corresponding odf file was selected for additional analysis. Representative profiles for each map were generated by averaging expression data at each time point within a given map.
Secretome identification and analysis
Potential secreted proteins were identified using Signalp (version 4.1) and custom processing scripts. Genes with a predicted signal peptide and having clear orthologs in all strains were used to create a gene set for single sample Gene Set Enrichment Analysis (ssGSEA) [42]. ssGSEA was performed using R (version 3.2.2) and scripts freely downloaded from the Broad GenePattern server (http://genepattern.broadinstitute.org/). The resulting ssGSEA projections were normalized using PreprocessDataset.
Mutational variant calling
Variant calling was carried out using the GATK Best Practices workflow [57]. Alignments were performed using Burrows-Wheeler Aligner (BWA-MEM, version 0.7.5a) [49]. K. phaffii GS115 Illumina sequencing reads were mapped to the wildtype K. phaffii PacBio genome assembly and wildtype K. phaffii Illumina sequencing reads were mapped to the K. phaffii GS115 PacBio genome assembly. PCR duplicates were removed using Picard (version 1.94) MarkDuplicates after sorting the sequences using SortSam. Samtools (version 0.1.19) was used for the first round of SNP and Indel calling. These high quality Indels and SNPs were then selected as the input for GATK Best Practice Indel local realignment and base quality recalibration steps (version 3.1.1). Variation calling output by GATK was annotated to identify non-synonymous SNPs using Snpeff (version 2.0.5d, http://snpeff.sourceforge.net) (Additional file 12: Table S11). This larger set of variations was further refined to a list of high confidence variations by identifying reciprocal genotype calls in both strains.
The potential function of these mutations was characterized in more detail using a two-fold approach. First, the variants were annotated with respect to protein domain using SMART (http://smart.embl-heidelberg.de/) and Pfam (http://pfam.xfam.org/). Second, conservation-based evaluation of the impact of the observed amino acid substitutions were scored using SIFT (http://sift.jcvi.org/). The reference orthologous protein sequences for use in SIFT analysis were obtained from the Fungal Orthogroups Repository (http://www.broadinstitute.org/regev/orthogroups/).