Genome-wide metabolic (re-) annotation of Kluyveromyces lactis
© Dias et al.; licensee BioMed Central Ltd. 2012
Received: 15 February 2012
Accepted: 6 August 2012
Published: 1 October 2012
Even before having its genome sequence published in 2004, Kluyveromyces lactis had long been considered a model organism for studies in genetics and physiology. Research on Kluyveromyces lactis is quite advanced and this yeast species is one of the few with which it is possible to perform formal genetic analysis. Nevertheless, until now, no complete metabolic functional annotation has been performed to the proteins encoded in the Kluyveromyces lactis genome.
In this work, a new metabolic genome-wide functional re-annotation of the proteins encoded in the Kluyveromyces lactis genome was performed, resulting in the annotation of 1759 genes with metabolic functions, and the development of a methodology supported by merlin (software developed in-house). The new annotation includes novelties, such as the assignment of transporter superfamily numbers to genes identified as transporter proteins. Thus, the genes annotated with metabolic functions could be exclusively enzymatic (1410 genes), transporter proteins encoding genes (301 genes) or have both metabolic activities (48 genes). The new annotation produced by this work largely surpassed the Kluyveromyces lactis currently available annotations. A comparison with KEGG’s annotation revealed a match with 844 (~90%) of the genes annotated by KEGG, while adding 850 new gene annotations. Moreover, there are 32 genes with annotations different from KEGG.
The methodology developed throughout this work can be used to re-annotate any yeast or, with a little tweak of the reference organism, the proteins encoded in any sequenced genome. The new annotation provided by this study offers basic knowledge which might be useful for the scientific community working on this model yeast, because new functions have been identified for the so-called metabolic genes. Furthermore, it served as the basis for the reconstruction of a compartmentalized, genome-scale metabolic model of Kluyveromyces lactis, which is currently being finished.
KeywordsGenome annotation Kluyveromyces lactis Metabolic functions Transport systems Merlin
The yeast Kluyveromyces lactis (K. lactis) has long been considered a model organism for studies in genetics and physiology. As pointed out by Fukuhara in 2006, interest in this organism began in academia, mainly due to its ability to metabolize the beta-glycoside lactose and other properties such as its GRAS (generally regarded as safe) status. Biotechnological applications started to be investigated later and, as depicted on the report by van Ooyen et al. in 2006, recombinant protein expression has probably been the most widely explored application with K. lactis. There are reports that at least two of these proteins, namely prochymosin and lactase (or beta-galactosidase), reached industrial production[3, 4].
A common approach used by the scientific community active on K. lactis is to either literally work in parallel to or at least in comparison with Saccharomyces cerevisiae (S. cerevisiae). The Baker’s yeast is not only the best described Eukaryote (it was the first Eukaryote ever to have its genome completely sequenced), but it is also the most employed organism in industry, at least in terms of production volumes.
Energy metabolism is the physiological aspect that mostly distinguishes both species. While the Crabtree-positive yeast S. cerevisiae has a strong tendency to ferment, even under aerobic conditions, K. lactis is considered Crabtree-negative and preferably uses respiration for energy generation, unless oxygen becomes limiting[6, 7]. Another crucial difference between the two yeasts is that K. lactis, in contrast to S. cerevisiae, is not capable of growing under complete anaerobiosis.
Research on K. lactis (a.k.a. milk yeast) is quite advanced and includes aspects such as the glucose sensing and repression cascade[9, 10], the molecular basis for the Crabtree-negative characteristic of this yeast, the improvement of secretory pathways for heterologous protein expression[12, 13], the engineering of post-translational modifications with the aim of avoiding hypermannosilation of heterologous proteins, the oxidative stress response[15, 16], the molecular basis for the incapacity of growing anaerobically[8, 17], the description of its transcriptional regulators, and an exhaustive study of its cell wall. Remarkably, many of the physiological differences between K. lactis and S. cerevisiae seem related to the whole-genome duplication event, which affected S. cerevisiae, but not K. lactis.
One of the key aspects of research on K. lactis is the fact that most of the work performed in the past decades has been based on a single strain, namely CBS 2359 (a.k.a. NRRL Y-1140). This has facilitated enormously the interpretation of results and the interaction among laboratories throughout the world. Another important factor is that, in spite of all historical changes in terms of taxonomic methods, mainly the recent adoption of criteria purely based on gene sequences, K. lactis remains K. lactis, even after a recent redefinition of the Kluyveromyces and related genera[21, 22].
K. lactis is one of the few yeast species with which it is possible to perform formal genetic analysis. Additionally, due to some recent advances[19, 23, 24], molecular tools have been developed, facilitating the generation of mutants, a task which can now be considered as simple to perform with this yeast as it is with S. cerevisiae. Also, its full genome sequence was made available some years ago, allowing for the improvement of our understanding on eukaryotic genome evolution by comparing the genomes of different yeast species. Within this context, a number of works have been published on particular aspects of yeast genomes[26–33].
There are several reasons to re-annotate a genome, such as: new genes or protein functions being discovered, a research group trying to determine the reproducibility of an existing annotation, or just because the information associated to a specific organism is known to be out-dated. Thus, the re-annotation of a genome, especially for genes classified as hypothetical proteins, is very important for assuring an up-to-date gene annotation and not compromising future similarity alignments for newly sequenced genes.
Functional annotation can be defined as the inference and assignment of functions to genes or proteins. Such information is often obtained by similarity to formerly characterized sequences, found in several online or local databases. Likewise, the re-annotation process can be depicted as the annotation of a previously annotated gene or full genome[34, 35].
Though being uncommon, there are some examples of genome wide re-annotations, such as Campylobacter jejuni NCTC11168Mycobacterium tuberculosis H37Rv, and Arabidopsis thaliana. All of the above annotations assigned new functions to genes that had been previously identified as “hypothetical proteins” and corrected some of the previous annotations.
A genome-wide metabolic functional annotation is a thorough effort which has the objective of trying to determine and label the genes involved in the metabolism of the organism of interest, skipping the regulatory and other genes annotation. Therefore, only the genes that encode enzymes or transporter proteins will be assigned with a function and included in this re-annotation.
Kluyveromyces lactis genome does not have an official genome-wide functional metabolic or other annotation in the GenBank and Reference Sequences (RefSeq) databases (http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genome&Cmd=ShowDetailView&TermToSearch=17850). The annotation available in GenBank files (ftp://ftp.ncbi.nih.gov/genbank/genomes/Fungi/Kluyveromyces_lactis_NRRL_Y-1140_uid12377 any *.gbk) in the GenBank database only characterizes the gene products by applying the same code used for the gene identification, followed by a “p” instead of a “g”; for instance, /locus_tag=”KLLA0A00132g” was assigned with /product = "KLLA0A00132p". On the other hand, RefSeq (ftp://ftp.ncbi.nih.gov/genomes/Fungi/Kluyveromyces_lactis_NRRL_Y-1140_uid12377/ any *.gbk) database assigns all proteins as hypothetical proteins. Nevertheless, all genes have descriptions in the GenBank “\notes” field. For example, the KLLA0A08492g gene is described as encoding a "conserved hypothetical protein", the KLLA0A08536g gene has "some similarities with uniprot|P25587 Saccharomyces cerevisiae YCL005W" and the KLLA0A08624g gene is "highly similar to uniprot|Q75ET0 Ashbya gossypii AAL002W AAL002Wp and similar to YCL001W uniprot|P25560 Saccharomyces cerevisiae YCL001W RER1 Protein…”. Other genes have more explicit annotations, for instance gene KLLA0A00891g is described as "uniprot|P53768 Kluyveromyces lactis KLLA0A00891g HAP2 Transcriptional activator HAP2", KLLA0F13530g is a "uniprot|P49385 Kluyveromyces lactis ADH4 Alcohol dehydrogenase IV, mitochondrial precursor" and KLLA0D00231g is described as "uniprot|Q9Y844 Kluyveromyces lactis mal22 Maltase" (in agreement with the new annotation). However, these descriptions are not considered annotations, because relevant information, such as the gene product and, when available, the Enzyme Commission (EC) number, is not provided in most cases. Furthermore, when available, such information should be delivered in the correct GenBank field (“/product” and “/EC number” instead of the “/notes” field) for easier manipulation using bioinformatics tools and user appraisal. Other databases such as KEGG (http://www.genome.jp/kegg/kegg2.html) perform metabolic annotations, with fairly acceptable results, though failing in some annotations and missing several genes with metabolic functions. (Universal Protein Resource) UniProt (http://www.ebi.ac.uk/UniProt/), on its hand, is composed by two databases, Swiss-Prot and TrEMBL, which are curated and non-curated, respectively. The curated database provides information that was manually annotated and reviewed, even if it was obtained electronically. Such database contains some information about the microorganism studied during this work, though somewhat scarce.
Hence, in this work we propose a genome-wide metabolic (re-)annotation of the proteins encoded in the Kluyveromyces lactis complete sequenced genome, identifying the genes involved in metabolites conversion and carriage throughout the cell, which is imperative for the reconstruction of a robust genome-scale metabolic model.
Genome-scale reconstructed metabolic models
Full genome sequences have been used, among many other applications, to reconstruct metabolic networks of different microorganisms such as Escherichia coli or Saccharomyces cerevisiae. This allows for the establishment of the so-called genome-scale metabolic models, which are developed bottom-up from the genome up to the reactions catalysed by the enzymes encoded in such set of genes. It is an iterative process that culminates in a reaction set that is used to simulate in silico the phenotype of the studied organism, under several environmental or genetic conditions. The use of such models has resulted in insight gaining and hypothesis testing, such as the enhancement of sesquiterpene production in Saccharomyces cerevisiae, the improvement of the production of succinic acid in Escherichia coli or finding new targets in drug research.
For the reconstruction of a robust genome-scale model, it is mandatory to have a proper annotation of the genome. For a metabolic model, all genes with metabolic roles, such as enzymes and transporters, have to be identified. The reconstruction of a metabolic model is a laborious and extensive process that has been described by Thiele and Palsson in 2010 as a 96 steps protocol, which takes a long time to be completed, depending on data availability. Such work also describes the first step “1| Obtain genome annotation” as a critical step, thus the importance of a robust annotation for the reconstruction process.
Although the genome of K. lactis has been publicly available for some years, a complete functional annotation was not made available to the public yet. In 2009, Souciet et al. re-annotated the genome of K. lactis, together with the sequencing and annotation of other yeast genomes, with the aim of performing comparative genomics. However, such annotation did not propose a functional annotation for each K. lactis gene. Here we present a work which identifies genes with metabolic functions and assigns functions to those genes, such as EC numbers, Transporter Classification Superfamily (TCS) numbers and Transporter Classification (TC) numbers. Whenever a complete EC number (‘class’.’subclass’.’sub-subclass’.’enzyme serial number’) was not available, a partial EC number was assigned to such enzymes (‘class’.’subclass’.’sub-subclass.-, ‘class’.’subclass’.-.- and ‘class’.-.-.-).
The re-annotation of the proteins encoded in the K. lactis CBS 2359 metabolic genome was performed in a semi-automatic manner by combining the use of the software merlin, developed in-house and available for download (athttp://www.merlin-sysbio.org) and manual inspection. The annotated genome of this organism brings some new insights on its capabilities and allowed the reconstruction of the Kluyveromyces lactis genome-scale metabolic model (currently being finalized). merlin’s dynamic annotation tool was used to perform first an automatic re-annotation of the complete genome followed by a manual curation of the enzymatic annotation. merlin’s transporter annotation tool was used to identify genes that encode transporter proteins, as well as the metabolites transported by such systems. In the end, a new, re-annotated, GenBank file was created by merlin for each K. lactis chromosome.
We believe that this re-annotation not only served as the basis for the assembly of a genome-scale metabolic model for K. lactis, but also provides relevant biological information for the scientific community dealing with this organism and yeasts in general.
Several online databases were used throughout this work. A brief description of each one is available bellow:
The first Basic Local Alignment Search Tool (BLAST) similarity search performed with merlin used All non-redundant sequences (including GenBank coding sequences translations, RefSeq Proteins, Brookhaven Protein Data Bank (PDB), SwissProt, Protein Information Resource (PIR), Protein Research Foundation (PRF) databases) (nrDB) available in the National Center for Biotechnology Information (NCBI) databases to find any protein sequence similar to translated K. lactis genes.
A second BLAST search used NCBI’s yeast database (yeastDB), which is a single curated set of Saccharomyces cerevisiae protein sequences available at the NCBI's RefSeq database.
The Entrez Protein (http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein) database is a collection of sequences from several sources, including GenBank CDS translations, RefSeq Proteins, SwissProt, PIR, PRF, and PDB. Entrez Protein provided all information that merlin retrieved for each Kluyveromyces lactis homologue gene.
The UniProtKB/Swiss-Prot (http://www.UniProt.org/) database is a manually curated protein sequences database which provides annotations with minimal redundancy and high level of integration with other databases. Thus, UniProtKB/Swiss-Prot was selected as a reference resource during the Kluyveromyces lactis genomic re-annotation.
The Saccharomyces Genome Database (SGD –http://www.yeastgenome.org/) project collects information and maintains a database of the molecular biology of the yeast Saccharomyces cerevisiae. This database includes a variety of genomic and biological information and is maintained and updated by curators. The SGD was selected as the second reference database for this project.
The Comprehensive Enzyme Information System BRaunschweig ENzyme DAtabase (BRENDA –http://www.brenda-enzymes.info/) provides enzyme functional data obtained directly from literature by professional curators. This database was used to confirm the information gathered in the previous two databases, thus being the third reference database selected for this work.
The Transporter Classification Database (TCDB –http://www.tcdb.org/) details a comprehensive classification system, approved by the International Union of Biochemistry and Molecular Biology (IUBMB), for membrane transporter proteins known as the Transporter Classification (TC) system. The TC system is analogous to the Enzyme Commission system for classification of enzymes, except that it incorporates both functional and phylogenetic information. This database was selected to annotate transporter proteins.
MEtabolic models reconstruction using genome-scaLe INformation (merlin)
In which the frequency score is related with the number of times a given function (EC number) appears in the set of homologues and the taxonomy score is related with the taxonomic proximity between the studied organism and those in which those functions had been identified. The user can choose to give more relevance to the frequency score or to the taxonomy score, just by altering the alpha value in merlin’s interface (see Additional file1: Figure S1 of the supplemental material). If the user considers the frequency more relevant than the taxonomy of the homologue genes the alpha value should be set between 0.5 and 1. If taxonomy is preferred over frequency the value should be between 0 and 0.5. In this work, the α value was set to 0.2, so that the yeasts’ annotations could be given more relevance than other organisms’ annotations.
However, in this work merlin's automatic annotation was fully reviewed to maximize the re-annotation confidence.
Moreover, merlin’s interface was used throughout the (re)annotation process to assign functions and protein names to each metabolic gene. merlin’s interface is particularly user friendly, providing “drop down boxes” (see Additional file1: Figure S1 of the supplemental material) for the annotation of each gene. merlin allows exporting the annotation as an Excel file or in the GenBank format, during or after the end of the annotation process.
Identification of genes that encode enzymes
To retrieve enzymatic information, merlin performs remote BLAST similarity searches to the NCBI databases. When the purpose of performing BLAST similarity searches is to retrieve metabolic information for a genome re-annotation, the output of a BLAST similarity search can be too minimalistic and very confusing. Anyone that has tried one of the many BLAST search tools available in the internet (such ashttp://blast.ncbi.nlm.nih.gov/Blast.cgi orhttp://www.UniProt.org/blast/) knows that the output of a BLAST search is not much helpful for the collection of metabolic data (see Additional file1: Figure S2 of the supplemental material), because the user has to follow several links to retrieve the data: to retrieve metabolic data, the user has to go over all identified homologue genes, retrieve enzymatic information and compile such information for all genes of the studied genome. To avoid such massive effort, merlin was used to implement the remote similarity alignments between the user set of genes (or full genome as was the case) and the previously selected remote NCBI database, as well as retrieve and classify each homologue’s annotation, providing comprehensible information.
Then merlin performed the remote BLAST similarities search, configuring the algorithm with the parameters also depicted in the first step of the figure. At the time of the similarity search (January 2010) the nrDB was a collection of 10,140,583 sequences and the yeastDB encompassed 6298 sequences.
The program used to perform the remote blast search was the blastp (version 2.2.22+ at the time of the BLAST). The e-value is used to create a significance threshold for returning results. A lower e-value will result in a shorter list with more quality homologues, thus the maximum e-value threshold was set to 1E-30.
The matrices referred in Figure1, are parameters of the BLAST algorithm, and are used to evaluate the quality of a pairwise sequence alignment by assigning scores for the alignment of any possible pair of residues. BLOSUM 62 was used as the default matrix for the similarity search algorithm configuration and was changed to PAM30 for the shorter sequences that could not be aligned with the first matrix merlin takes approximately 24 h to automatically assign a functional annotation to every protein encoded in a given genome, depending on the NCBI servers’ availability and the genome size.
For each Kluyveromyces lactis gene, the top 100 most similar homologues were retrieved and the information displayed in Figure1 – Step 2 was collected. If less than 100 homologues were available, only those were processed. Afterwards, merlin accessed the Entrez Protein webservice to download and save several data for each homologue acquired in the previous step. Such data is listed in Figure1 – Step 3.
Using internal heuristics, described in and briefly represented above in equation 1, merlin automatically selected a candidate annotation for each protein encoding gene of the studied genome based on confidence scores. The similarity result (gene product, EC number) with the highest confidence score was selected by merlin to automatically annotate each protein encoding gene of the studied genome. Moreover, merlin reduced the curation efforts, as it allows the user to browse through all similarity search results and change the automatic annotations provided by the software.
When the first automatic annotation results were analysed, a pattern emerged. The homologues’ taxonomic distribution was, as it will be shown in the Results section, biased. Indeed, whenever a Saccharomyces cerevisiae homologue was available, merlin would consistently select the baker’s yeast gene annotation to annotate the Kluyveromyces lactis gene. Thus, the baker’s yeast was selected as a reference organism for the EC numbers annotation because the two microorganisms share the phylogenetic lineage all the way to the taxonomic family level and S. cerevisiae is the best studied, annotated and curated Fungus. Hence, two projects were initiated with merlin, allowing the software tool to use all data available in the NCBI database (nrDB) to annotate the Kluyveromyces lactis genome in the first project, while for the later project only data from the NCBI’s yeastDB were used. Each K. lactis gene assigned by merlin with enzymatic functions on either the first or the second similarity search was labelled as an enzyme encoding gene candidate (EEGC).
The second set encompassed those genes which were identified as EEGC’s on the first BLAST search (nrDB assigned) but not on the second similarity alignment. Such set presented a high number of genes that, although being automatically annotated with metabolic functions, were later discarded by the annotation pipeline depicted in Figure3 (false positives).
The third set of genes was the most troublesome. It was the group of genes assigned with different enzymes on each merlin project (distinct). Such collection was carefully reviewed, with the purpose of selecting the correct gene function without reservations.
The last set (yeastDB assigned) encompassed milk yeast EEGC’s which were not automatically annotated as enzymes by merlin in the first alignment, but when the search was performed against NCBI’s yeastDB, at least one Saccharomyces cerevisiae metabolic homologue was identified for each K. lactis gene. merlin did not assign any annotation on the first similarity search probably because each of those K. lactis EEGC had more than 100 homologues in organisms other than S. cerevisiae on such alignment.
The EEGC’s were manually verified by following several confirmation steps as depicted in the functional annotation pipeline (Figure3). The described methodology can be recurrently executed, re-annotating a given genome whenever the user wants to, taking advantage of the up to date information available in NCBI remote BLAST databases.
Despite using merlin, all of the Kluyveromyces lactis functional EEGC’s automatic assignments were reviewed according to the schema depicted in Figure3, so that the minimum number of false positives would be included in this annotation. For that purpose, the main criteria were, in first priority, the existence of information in curated databases for the K. lactis genes and, in second priority, the existence of curated S. cerevisiae homologues. Only when none of the previous information was available the search was extended to curated homologues of other organisms.
Initially, for each EEGC, a query was performed in UniProt, using the gene locus identifier (locus tag), to assess the existence of a reviewed annotated record for such gene. If UniProt had already identified such gene’s product on a reviewed record, or any literature was available and confirmed the proposed gene annotation, the assignment was accepted and the gene was annotated (after EC number confirmation in BRENDA – Figure3-C).
On the other hand, if UniProt had no reviewed match for such gene, then a S. cerevisiae gene was sought in the BLAST hits (Figure3-B) kept by merlin for such gene. So, if a baker’s yeast homologue was available, its identifier (YXX####x) was searched in both UniProt and SGD databases. After the analysis of the UniProtKB/Swiss-Prot and the SGD entries two situations could arise (Figure3-B1): the records could be either identical or distinct. When identical, the gene was annotated; else, the records would be thoroughly examined and the SGD entries would be always favoured. As explained above, both UniProtKB/Swiss-Prot and SGD are manually curated databases, thus both results are reliable. Nevertheless, the SGD is favoured when a conflict arises between both databases because it is specific for Saccharomyces cerevisiae, and consequently the curators of this database are specialized in the analysis of the baker’s yeast genome. Hence, if the similarity between the K. lactis and the S. cerevisiae gene sequences is acceptable (e- value < 1E-30) the K. lactis gene is considered homologous to the baker’s yeast one and the first is assigned with the same function as the latter.
For the EEGC’s that did not have any S. cerevisiae homologue (Figure3-B2), a specific similarity search was performed in the NCBI BLAST web interface, restraining the possible outcomes to Swiss-Prot reviewed records and the organism to the 4932 taxID (Saccharomyces cerevisiae). This step was performed because merlin’s scorer was configured to calculate the function scores using the first 100 homologues retrieved from the BLAST similarity search. However, the S. cerevisiae homologue could have a cardinality of more than 100. When performing this specific homology search, the number of hits is considerably reduced, thus the acceptable e-value is also decreased to e < 1E-10. If there was an entry that complied with the previous conditions, the gene was annotated; else, the BLAST similarity search was unrestricted, organism wise. Again, if there was an entry that complied with the previous conditions, the gene was annotated as homologue of the first hit, else it was discarded.
Whatever was the source of the candidate enzyme assigned to a given gene, such information was revised in BRENDA to verify the function about to be annotated to such gene (Figure3-C). Some of the enzymes encoded in the genome were assigned with partial EC numbers by the studied databases. BRENDA was also used to try to identify complete EC numbers for such genes, by searching for the names of those gene products in that database.
Finally, the information collected in the previous steps is assigned to the EEGC, as depicted in Figure3-D, rendering the EEGC a metabolic gene or discarding such gene as metabolic.
Classification of manual curation results
When using the annotation pipeline to analyse the EEGC’s, a limited number of logical jumps were detected. Therefore, an alpha-numeric cross classification system was developed to log and identify the gene classification patterns, encompassing the origin of the entry chosen in the final annotation (nrDB or yeastDB) and the database(s) that provided the information that motivated the choice made. A detailed description of such classification is available in Additional file2 of the supplemental material.
Identification of genes that encode transporter proteins
Only four Kluyveromyces lactis’ genes are available in TCDB as transporter protein encoding genes (see Additional file3: Table S1 of the supplemental material). Therefore, it was necessary to implement a methodology to further identify transporter proteins using homology analysis.
Although merlin uses remote BLAST similarity searches to classify gene products, the transporter information is obtained by performing local smith-waterman (SW) similarity alignments with the TCDB, to identify the TCS (Transporter Classification Superfamily) number of the genes that encode transporter proteins. This methodology was also developed in-house and will be included in merlin’s 2.0 version. An article with the detailed description of this methodology (Genome-wide semi-automated annotation of transporter systems, Dias et al., 2012) has been recently submitted.
Unlike enzymes, transporter proteins cannot be directly classified from homology. Enzymes are represented by EC numbers that classify the catalysed reactions and a gene can be annotated with several EC numbers. TC numbers are associated to proteins that transport a specific range of substrates and are often associated to a single gene. For example, a gene that encodes a carrier that is able to transport a range of substrates is assigned with a single TC number and not a range of TC numbers, as is the case with EC numbers. TC numbers are grouped in TC families. For example, the 2.A.1.1 – The Sugar Porter (SP) Family encompasses transport proteins that transport sugars. Likewise, TC families are grouped in TCS. For example, the 2.A.1–The Major Facilitator Superfamily (MFS) includes the 2.A.1.1. The Sugar Porter (SP) Family, the 2.A.1.2 – The Drug:H + Antiporter Family and several other families. Therefore, for the classification of the genes that encode transporter proteins, the approach was somewhat different and is concisely described next.
The process of performing genome-wide similarity searches using the SW algorithm, despite being more accurate than BLAST, can be very time–consuming, as such alignments are very demanding. Therefore, the number of K. lactis genes aligned against TCDB was reduced via the TransMembrane prediction using Hidden Markov Models (TMHMM) software. TMHMM is a prediction algorithm that identifies the number of transmembrane helices in a protein using hidden Markov models.
Thus, all genes that had one or more transmembrane helices were considered transporter protein encoding gene candidates (TPGC) and were aligned to the TCDB. The similarity threshold, when performing the SW similarity searches, was of 10%, because the transporter database was very small (6100 records at the time of the alignment – September 2011). Moreover, merlin uses internal heuristics to lower the threshold, inversely to the number of transmembrane helices of the gene.
A TPGC can have similarities to different families and super-families of the same TC class that can nevertheless have similar functions. Thus, the TC family numbers, as well as the metabolites, of the TCDB genes similar to each TPGC were classified with the same algorithm used by merlin to classify the EC numbers of each EEGC. Such algorithm classified the TC family numbers and metabolites associated to each TPGC, using the taxonomy of each of the TCDB homologue genes and the frequency of the TC family numbers or metabolites, within all similar genes. In the end of this process, each gene identified as a TPGC was either discarded (not considered a transporter protein) or effectively annotated as a transporter protein encoding gene. In the latter case, a TCS number, as well as the metabolites transported by such protein, were assigned to each transporter protein encoding gene. Since it was considered that the transporter family number could be too restrictive, it was decided to go up a level and the TCS number was chosen instead.
Results and discussion
The annotation pipeline for genes that encode enzymes (described on Figure3 of the Methods section) reviewed a total of 1699 EEGC’s and the transporter annotation function within merlin provided 349 genes. However, 48 genes identified as transporter systems encoding genes were also annotated by the annotation pipeline with EC numbers. Hence, such genes were annotated with both transport (TCS or TC numbers) and reaction facilitation (EC numbers) activities.
The annotation pipeline ruled out 241 K. lactis EEGC’s as non-metabolic genes because the implemented routine suggested that such EEGC’s homologues were either wrongly assigned as similar to K. lactis or incorrectly annotated. The other 1458 genes were confirmed and annotated as metabolic genes.
As depicted in Figure4, most of those 1458 genes were annotated with at least one EC number and 301 were annotated as exclusively transporters, being assigned with TCS or TC numbers. Summing up, 1759 genes were classified as metabolic genes, of which 1410 are exclusively enzymatic, 301 exclusively transporter proteins and 48 have both functions.
The final annotation of each EEGC is available in Additional file3: Table S2 of the supplemental material.
The Kluyveromyces lactis genome had been sequenced by the Génolevures consortium; however, the genes identified by the consortium were not assigned with EC or TC numbers. Also, despite holding the genome sequencing data, GenBank does not provide any functional annotation. Thus, the new annotation provided by this work was compared with the data available in KEGG, UniProt and to a lesser extent with BRENDA and TCDB.
Comparison of the results reached in this work and previous annotations available
BRENDA EC #
# of genes
34* (38 )**
Comparison with KEGG
The comparison between the new annotation and KEGG’s annotation is depicted in Additional file3: Table S3 of the supplemental material. The new annotation matched 844 (~90%) of genes annotated by KEGG, adding 850 new gene annotations. Moreover, there are 32 genes with annotations different from KEGG.
Also, 19 genes were assigned with more enzymes on the present annotation than on the KEGG annotation. For instance, KEGG annotates the KLLA0B02717g gene with the EC number 220.127.116.11. However, our new annotation assigns 6 EC numbers (18.104.22.168, 22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11) to such gene; thus, KEGG’s annotation is not incorrect but it is a subset of the present study’s annotation. On the other hand, there were 9 genes that were assigned with more enzymes on KEGG than on the present annotation. Finally, the annotation pipeline ruled out 29 genes annotated as metabolic on KEGG, due to several reasons. For instance, KEGG assigns the EC number 18.104.22.168 to KLLA0C11341g, and the new annotation identified such gene as an “Accessory subunit of DNA polymerase zeta” with no catalytic activity. KEGG assigns the EC number 22.214.171.124 to KLLA0C08041g. However, the new annotation identified that gene as a general negative regulator of transcription. These two, along with 27 other ruled out genes are described in Additional file3: Table S4 of the supplemental material.
New annotation versus KEGG annotation
S. cerevisiae homologue
glutamate synthase [NADH]
Since KEGG’s annotation does not provide any transporter information, whenever a gene encoded a protein with both EC and TC(S) numbers, the transport system was ignored in the assessment, which helped to raise the number of matches between both annotations.
Summary of genes not available on KEGG’s annotation but annotated in this work
Genes not annotated in KEGG
Number of genes
EC numbers + TC(S) numbers
EC numbers + TC(S) numbers
TC + TCS
Comparison with UniProt
All 354 genes annotated with enzymatic functions by UniProt were included in the present annotation by the annotation pipeline, as described in Additional file3: Table S5 of the supplemental material. For some (48) of those genes more information was collected, either by adding more enzymatic functions (e.g. KLLA0E01959g was annotated with 126.96.36.199 by UniProt and with 188.8.131.52 and 184.108.40.206 in the new annotation) or just by providing a complete EC number to a partial UniProt annotation (e.g. KLLA0B01265g is annotated with 3.2.2.- in UniProt and with 220.127.116.11 in the new annotation).
Comparison with BRENDA
BRENDA’s annotation assessment was somewhat different from the other annotations evaluation, as BRENDA does not provide gene information. Hence, the EC numbers provided by BRENDA were sought in the new annotation to confirm if there was at least one gene that encoded such enzyme. The new annotation included all 34 EC numbers assigned by BRENDA to K. lactis, as depicted in Additional file3: Table S6 of the supplemental material. However, there were 4 other EC numbers associated to K. lactis on BRENDA that were not found in the new annotation. One of those EC numbers (18.104.22.168) was associated to K. lactis because it has an annotation declaring that there is “no activity in Kluyveromyces lactis”. Another one of those EC numbers was from a plasmid and the other two were from vectors inserted in a K. lactis strain to test the viability of the organism as a recombinant protein producer.
Homologues taxonomic distribution
Translated genomes of different organisms were used as reference when performing the homology-based genomic annotation. Thus, an analysis of the phylogenetic distribution of those genes was performed. The approach developed for the transport systems annotation does not allow this analysis to be performed because the database was small and thus the available organisms span was reduced, rendering such analysis too biased.
Percentage of K. lactis genes annotated as S. cerevisiae or other organisms homologues
K. lactis genes with S. cerevisiae metabolic homologues
K. lactis genes with other homologue organisms
TCS families annotation
K. lactis TC annotation
Kluyveromyces lactis, unlike S. cerevisiae, did not undergo whole genome duplication; nevertheless, it is likely that at least part of the 66 genes with repeated metabolic functions in K. lactis are a result of other gene duplication events.
The 4 genes (see Additional file3: Table S1 of the supplemental material) reported in the transporter classification database (TCDB) were not inferred from another organism, thus not being included in the other organism’s annotation.
An example of homologues of organisms other than S. cerevisiae is the LAC4 gene (KLLA0B14883g), which encodes the β-galactosidase protein (see Additional file3: Table S7 of the supplemental material; Escherichia coli - 22.214.171.124) which affords K. lactis with the ability of converting lactose into galactose and glucose, hence being able to use lactose as sole carbon source.
The genes annotated by homology to organisms other than S. cerevisiae constitute less than 3% (43 genes) of the K. lactis genome annotated with metabolic functions. Additional file3: Table S7 of the supplemental material lists the 25 organisms (other than S. cerevisiae) used for the new annotation of those 43 K. lactis genes, as well as the distinct EC numbers encoded on such genes. 5 of the 25 aforementioned organisms were of the Bacteria superkingdom. Although K. lactis is included in the Eukaryota superkingdom, along with the remaining 20 organisms, previous works have demonstrated the relevance of horizontal gene transfer from prokaryotic to fungal genomes[61, 62].
K. lactis genes which encode enzymes not available in the baker's yeast genome
K. lactis tag
pseudouridine kinase, --
Escherichia coli (strain K12)
Escherichia coli O6
carbonyl reductase (NADPH), sorbose reductase
carbonyl reductase (NADPH), sorbose reductase
As shown in Table5 the Schizosaccharomyces pombe homologue genes lead the group of functions not available in S. cerevisae, with five enzymes. Those enzymes were D-amino-acids oxidase (126.96.36.199), pseudouridine kinase (188.8.131.52), membrane dipeptidase (184.108.40.206), hydroxyisourate hydrolase (220.127.116.11) and agmatinase (18.104.22.168) which hydrolyses agmatine to putrescine and urea. Also, Kluyveromyces marxianus provides the β-glucosidase (22.214.171.124) enzyme encoding gene homologue, which releases β-D-glucose from polysaccharides containing glucose. Mortierella isabellina genome has a gene that encodes the δ-12 fatty acid desaturase (126.96.36.199) that catalyses the desaturation of oleic acid to linoleic acid, and K. lactis has two homologues of such gene (KLLA0B00473g and KLLA0F07095g). Escherichia coli strains have two K. lactis homologue genes not present in the S. cerevisiae genome: cyclopropane fatty acid synthase, (188.8.131.52), and the aforementioned β-galactosidase (184.108.40.206).
Annotation scheme and manual curation results
merlin’s automatic scored similarity results were manually curated by the authors, using the annotation pipeline described on the methods section. The outcome of such classification is shown in Additional file3: Table S8 of the supplemental material. It represents the results obtained using the cross classification developed and applied throughout this work. This table shows that most annotations were supported by all databases (SGD, UniProt and BRENDA), which means that the present annotation is robust and supported by information provided by several data sources.
Also, almost half (calculation details on Additional file2) of the incorrect merlin automated gene annotations were reclassified by BRENDA. Most of the reclassifications dictated by BRENDA corresponded to partial EC numbers for which a complete EC number was now available in BRENDA.
BRENDA was also important for other reasons. For example, one of the K. lactis genes that had a baker’s yeast homologue was assigned with a completely different function in both genomes. The XYL1 (KLLA0E21627g – 220.127.116.117) K. lactis gene is homologue to the GRE3 (YHR104W-18.104.22.1686) S. cerevisiae gene. However, on the first case it encodes a NADPH-dependent D-xylose reductase, but on the second organism it encodes a NADPH-dependent aldose reductase. This is a major difference because the baker’s yeast, despite having xylose transporters, cannot use xylose as the single carbon source. The XYL1 gene is identified in UniProt [Swiss-Prot: P49378] as a “NAD(P)H-dependent D-xylose reductase”; yet, UniProt provides a partial EC number (1.1.1-) and KEGG annotates such gene as an hypothetical protein [KEGG: kla:KLLA0E21627g]. BRENDA was used to confirm EC number assignments, describing the reactions catalysed by those enzymes, allowing a more precise gene annotation.
Another carbon source that S. cerevisiae is unable to metabolise is lactose. However, in this case, the gene did not have a baker’s yeast homologue (it was an Escherichia coli homologue). That gene was well known to be encoded in K. lactis, the previously mentioned LAC4 gene (β-galactosidase – 22.214.171.124).
Additional file3: Table S9 of the supplemental material lists the seven genes for which literature was considered through the annotation process. The curation of those genes was based on previous knowledge of the authors regarding specificities of Kluyveromyces lactis metabolism.
Assignment of enzyme commission numbers
More than 80% of the genes to which a metabolic function was assigned were classified with at least one EC number. Indeed, as shown in Table6, 1325 (1107 + 218) genes were assigned with only one EC number (monofunctional genes). Nevertheless, three other gene groups were identified while classifying the protein encoding genes, originating 4 distinct groups:
genes with EC and TC(S) numbers
Enzyme encoding genes classification
with TC(S) number
with TC(S) number
The multifunctional genes set includes enzyme encoding genes that were assigned with two or more EC numbers of the same class, according to the Enzyme Commission classification (e.g. KLLA0F20163g - 126.96.36.199, 188.8.131.52). The multiclass genes encompassed enzyme encoding genes assigned with EC numbers classified in more than one class. For the last subgroup, the approach was somewhat different. The proteins may not have various functions, but had at least one EC number and one TC(S) number assigned to them. Hence, despite the distinctive classification, the function of the protein may well be the same in both classification systems.
Regardless of the previous sorting, the genes were also divided in two major categories: the ones that encoded enzymes with complete EC numbers (e.g. 184.108.40.206) and the ones that encoded enzymes with partial (e.g. 1.-.-.-) EC numbers. These two categories were then subdivided in the four sets presented above as depicted in Table6. Thus, any gene that encoded at least one enzyme with one partial EC number was clustered with the partial entries, even for the ones that were simultaneously classified with TC(S) numbers.
Finally, the gene assignments were also cross-classified according to the EC class of the encoded proteins, those being Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases and Ligases.
The cross-classification of enzyme encoding genes assigned to the multiclass group in the EC class followed a simple rule. When classifying a gene product, such gene was assigned to the subgroup of whatever enzyme was annotated first, because such function was assumed as the main function (e.g. gene KLLA0E15357g is associated with EC numbers 220.127.116.11 and 18.104.22.168; the gene was assigned to the Ligases multiclass group instead of the Transferases multiclass group because it is assumed that the ligase function is more significant). The final result of all cross-classifications is presented in the Table6.
As depicted in Table6, most of the identified complete monofunctional genes encode Transferases. On the other hand, most of the genes that encode enzymes for which only a partial EC number is available are hydrolases. Table6 also indicates that Oxidoreductases, Transferases and Hydrolases represent almost 85% of the identified enzyme encoding genes. Thus, Lyases, Isomerases and Ligases represent just a small quota of this organism’s genome.
Most enzyme encoding genes were assigned with just one EC number (1325 genes), which means that such genes are monofunctional. Still, 218 genes encoding monofunctional enzymes have only partial EC numbers assigned. Thus, either the catalysed reactions are not completely known (and therefore the enzymes may be either mono or multifunctional), or the catalysed reaction is well known but the EC number has not been assigned yet.
22.214.171.124.23 - acyl-CoA + 1-acyl-sn-glycero-3-phosphocholine = CoA + 1,2-diacyl-sn-glycero-3-phosphocholine
126.96.36.199 - acyl-CoA + 1-acyl-sn-glycerol 3-phosphate = CoA + 1,2-diacyl-sn-glycerol 3-phosphate
These enzymes are O-acyltransferases that mediate the incorporation of unsaturated acyl chains into the sn-2 position of phospholipids.
There were also 22 genes in the K. lactis genome that encoded multiclass enzymes due to their diversified catalytic activity. For example, as previously mentioned, the gene KLLA0E15357g - 188.8.131.52, 184.108.40.206 encoded the homologue of the S. cerevisiae URA2 gene. Such protein catalyses the first two enzymatic steps in the de novo biosynthesis of pyrimidines: first L-glutamine is hydrolysed by the carbamoyl-phosphate synthase (220.127.116.11). Next, the aspartate carbamoyltransferase (18.104.22.168) uses the carbamoyl phosphate formed in the previous reaction and interacts with L-aspartate generating N-carbamoyl-L-aspartate with the release of one phosphate molecule. Hence, the gene was classified in the Ligases sub-group.
Assignment of transporter classification numbers
Throughout this work, some enzymes encoded in the milk yeast genome were identified and classified with both EC and TC(S) numbers. In some cases, the protein was assigned with the same function by both classification systems. An example of such annotations were the functions assigned to the gene KLLA0F20658g, which encodes the Sodium transport ATPase ENA1 (S. cerevisiae homologue). The protein was annotated by the enzyme commission with the EC number 22.214.171.124 (Na + exporting ATPase) and in the transporter classification database as belonging to the 3.A.3 P-type ATPase (P-ATPase) Superfamily 3.A.3.#.#, which includes proteins that promote cations, such as sodium, exchange or efflux. The transported metabolites analysis (to be published together with the transports classification methodology in Genome-wide semi-automated annotation of transporter systems, Dias et al., 2012), provided by merlin, confirms that such gene facilitates the efflux of sodium ions, among the transport of other cations.
The classification of two thirds of the transport systems available in K. lactis in the Porters sub-class suggests that this microorganism may be able to control the uptake and efflux of the nutrients, providing the organism with the ability to be selective about the carbon source it will use.
Additional file3: Table S10 of the supplemental material also demonstrates that at least 21 broad sugar porters encoding genes were identified, as well as several alcohols, organic acids and nitrogen sources and amino acid transport systems. It is accepted that non-ionized organic acids can penetrate cell walls by passive diffusion. Thus, evidences of organic acids transport systems may be related to the transport of ionized organic acids and with the need for controlling the uptake or excretion of those compounds.
Furthermore, Kluyveromyces lactis can use several alcohols as carbon sources, as demonstrated in[66–71]. Some of those alcohols are known as sugar alcohols (polyols) and are transported by the sugar porter family transport systems 2.A.1.1.# encoded in genes KLLA0E06755g and KLLA0E01783g. Three glycerol transport systems were also identified during the course of this work (KLLA0A03223g - 2.A.1.1.#, KLLA0F26246g - 2.A.1.1.#, KLLA0E19185g - 2.A.50.1.#, KLLA0E00617g - 1.A.8.5.#).
KEGG pathways annotation analysis
Number of enzymes in each Global pathway
Identified by KEGG
Identified in new annotation
Unidentified in K. lactis
01100 Metabolic pathways
01110 Biosynthesis of secondary metabolites
01120 Microbial metabolism in diverse environments
The new annotation provides new insights on the K. lactis metabolic capabilities, as it brings new information to the KEGG pathways, identifying several new enzymes in 56 KEGG metabolic pathways. Indeed, only 45 of such pathways are recognised by KEGG as K. lactis pathways. Thus, the other 11 pathways should be further studied to assess if the milk yeasts uses such paths to metabolise compounds, offering investigators new research opportunities.
Nevertheless, the new annotation also identified new enzymes that are not allocated to any pathway and proteins associated only with TC numbers.
Analysis of the Annotation of the Kluyveromyces lactis central carbon metabolism
The central carbon metabolism is a collection of pathways mainly composed by three ‘vias’, namely the Embden-Meyerhof-Parnas (EMP) Pathway, the Pentose Phosphate Pathway and the TCA Cycle. The new annotation presented in this work was able to identify the genes involved in such pathways.
The EMP pathway converts glucose to pyruvate, generating small amounts of ATP and NADH in the process. The uptake of glucose is done by hexose transporters such as RAG1 – KLLA0D13310g, HGT1 – KLLA0A11110g, KHT1 or KHT2. In some strains RAG1 is the unique low-affinity glucose transporter, whereas in other strains such function is divided by two genes (KHT1, KHT2). The strain studied throughout this work, Kluyveromyces lactis NRRL Y-1140 (CBS 2359), encoded the RAG1 gene.
The EMP pathway has only one hexokinase, RAG5 (KLLA0D11352g) which was identified in the new annotation. Breunig and Steensma (2003) confirm that it is the only hexokinase encoding gene, unlike in the case of S. cerevisiae, which has three hexokinases. RAG5 is an essential gene because its absence inhibits growth on glucose, fructose and higher sugars that produce these isomers. Glucose-6-phosphate isomerase RAG2 (KLLA0E23519g) was also identified in the new annotation, and, although K. lactis has only one phosphoglucose isomerase, RAG2 mutants grow well in glucose. Hence, RAG2 is not an essential gene.
Several ATP and NADH molecules are formed in the second half of glycolysis, which is known as the pay-off phase. There are NADPH dehydrogenases not present in the baker’s yeast reported to exist in the milk yeast genome. Such enzymes are NDE1 (KLLA0E21891g) and NDE2 (KLLA0A08316g), and were indeed annotated in the present work. Both genes re-oxidise NADH as well as NADPH. NDE1’s ability to bind NADPH was verified experimentally. However, NDE2 was reported to have a less important role in NADPH re-oxidation. NDI1 (KLLA0C06336g) also encodes a mitochondrial internal NADH oxidoreductase, though such enzyme does not oxidise NADPH. Neither NDE1, NDE2 or NDI1 are annotated in UniProt and are incorrectly annotated in KEGG.
The re-oxidation of NAD(P)H by mitochondrial external dehydrogenases supports the high activity of the pentose phosphate pathway, and the ability of the K. lactis RAG2 mutants to grow on glucose.
In Crabtree negative yeasts, such as K. lactis, ethanol formation only sets in when the oxygen supply becomes limiting. According to Van Urk et al. (1989), Crabtree negative yeasts can prevent the overflow metabolism, by regulating the glucose uptake using the available symport transport mechanisms to control the amount of glucose going inside the cells. The new annotation provided by this work demonstrates that more than 65% of the identified transport systems were classified in the 2.A – Porters (uniporters, symporters, antiporters), allowing K. lactis to regulate nutrients uptake and efflux. However, Breunig and Steensma state that the regulation of the glucose uptake is not enough to explain the Crabtree negative phenotypes. Only when the pyruvate dehydrogenase (Pdh) complex is down regulated, or blocked, the pyruvate decarboxylase (Pdc) can convert pyruvate to ethanol and acetaldehyde. The first step of the alcoholic fermentation, which only occurs at low oxygen concentrations, is promoted by the pyruvate decarboxylase PDC1 (identified in gene KLLA0E16303g). Therefore, a null mutation on the PDA1 (KLLA0F12001g), a gene which encodes the α subunit of the E1 component (the β subunit was identified in the new annotation, gene PDB1 _KLULA – KLLA0F09603g, not identified in UniProt) of the Pdh complex, can constrain growth on glucose, as PDA1 mutants show high ethanol formation. Such phenotype suggests that high Pdh activity is the reason for the Crabtree negative phenotype exhibited by the wild type strain.
The lactose metabolism in Kluyveromyces lactis has been well studied, because it is a distinct characteristic within yeasts. The lactose uptake is performed by the specific permease LAC12 (KLLA0B14861g) and the hydrolysis by the β–galactosidade LAC4 (KLLA0B14883g) into glucose and galactose. The lactose metabolism is induced by both lactose and galactose. Galactose is converted into galactose–1–phosphate by galactokinase GAL1 (KLLA0F08393g). Then, the GAL7 (KLLA0F08437g) gene that encodes the enzyme UDP-glucose-hexose-1-phosphate uridylyltransferase takes UDP–glucose and α–D–galactose–1–phosphate to synthesize α–D-glucose–1–phosphate and UDP–galactose. The UDP–galactose formed by this reaction will be again converted to UDP–glucose by the GAL10 bifunctional gene. This gene encodes two enzymes, the aforementioned UDP–glucose–4–epimerase and the aldose–1–epimerase, that converts α–D–glucose into β–D–glucose. All of the genes described earlier were annotated throughout this work.
Assessing the agreement of the new annotation to a previous comparison of the Kluyveromyces lactis genome to the one of Saccharomyces cerevisiae
In 1998 Ozier-Kalogeropoulos et al. studied the Kluyveromyces lactis unsequenced genome, and identified 296 K. lactis genes with homology to the baker’s yeast. The exploration of the genome was random, thus several types of genes were identified.
All S. cerevisiae genes identified in that study were reviewed in UniProt (SGD does not provide an application programming interface to expedite the results retrieval) to identify genes with metabolic (enzymatic or transport) functions. As depicted in Additional file3: Table S12 of the supplemental material, 113 of those S. cerevisiae genes had metabolic functions. The 113 metabolic genes identified in that study, and the corresponding milk yeast homologues, were predicted by the new annotation, except for four baker’s yeast transport systems which were not identified because the corresponding K. lactis homologues did not have any transmembrane domain.
Again, that work was in agreement with the results obtained with the approach undertaken throughout this study.
In conclusion, these examples illustrate that the new annotation not only confirms pre-sequencing knowledge but also, adds new gene annotations to the information currently available in databases such as KEGG or UniProt.
Since the genome sequence of K. lactis was published in 2004, the proteins encoded in the Kluyveromyces lactis genome had never been thoroughly reviewed and annotated; or at least this information was not published, to our knowledge.
In this work, 2000 genes with potential to be assigned with metabolic functions within the proteins encoded in the Kluyveromyces lactis genome were studied. Most of those, specifically 87.95% (1759 genes), were indeed classified as metabolic genes. The metabolic genes could be exclusively enzymatic (1410 genes), transporter proteins (301 genes) or have both metabolic activities (48 genes). The new annotation proposed in this work could only be accomplished as merlin provided semi-automatic scored results. Such results were then reviewed in other databases such as UniProt or BRENDA to maximize the confidence in the results. The new annotation includes novelties, such as the assignment of transporter superfamily numbers to genes identified as transporter proteins. Moreover, it was demonstrated that Oxidoreductases, Transferases and Hydrolases represent almost 85% of the identified enzymes. When the new annotation is compared to the annotations currently available in some databases, it is shown to be broader and reliable, as it encompasses most of the metabolic information in such databases.
Furthermore, the new annotation of the K. lactis metabolic genome confirmed the predictions of pre-genome sequencing studies. One of those studies compared random sequences of the K. lactis genome to the S. cerevisiae sequenced genome. All metabolic genes in that study were identified in the new annotation.
Also, the central carbon pathways were revised in this work to assess the robustness of the new annotation. The new annotation was in agreement with several publications that study Kluyveromyces lactis’ phenotypical behaviour.
The new annotation provided by this study, available in Additional file4 on the GenBank format, yields basic knowledge which might be useful for the scientific community working on this model yeast, as new functions have been identified for the so-called metabolic genes.
The methodology used throughout this work can be used by other groups to annotate other organisms and build a robust genome-scale model.
Furthermore, the new annotation served as the basis for the reconstruction of a compartmentalized, genome-scale metabolic model of Kluyveromyces lactis, which is currently being finished.
Basic local alignment search tool
BRaunschweig ENzyme DAtabase
enzyme encoding gene candidate
International union of biochemistry and molecular biology
- Merlin :
MEtabolic models reconstruction using genome-scaLe information
National center for biotechnology information
- nrDB :
All non-redundant sequences
Brookhaven protein data
Protein information resource
Protein research foundation
Saccharomyces genome database
Transporter classification database
Transporter classification superfamily
Transporter protein encoding gene candidates
- yeastDB :
Universal protein resource.
Acknowledgements and funding
This work was partially supported by the MIT-Portugal Program in Bioengineering (MIT-Pt/BS-BB/0082/2008) and a PhD grant (SFRH / BD / 47307 / 2008) from Portuguese FCT (Fundação para a Ciência e Tecnologia).
- Schaffrath R, Breunig KD: Genetics and molecular physiology of the yeast Kluyveromyces lactis. Fungal Genet Biol. 2000, 30: 173-190. 10.1006/fgbi.2000.1221.View ArticlePubMedGoogle Scholar
- Fukuhara H: Kluyveromyces lactis- a retrospective. FEMS Yeast Res. 2006, 6: 323-324. 10.1111/j.1567-1364.2005.00012.x.View ArticlePubMedGoogle Scholar
- van Ooyen AJJ, Dekker P, Huang M, Olsthoorn MMA, Jacobs DI, Colussi PA, Taron CH: Heterologous protein production in the yeast Kluyveromyces lactis. FEMS Yeast Res. 2006, 6: 381-392. 10.1111/j.1567-1364.2006.00049.x.View ArticlePubMedGoogle Scholar
- Becerra M, Prado SD, Siso MI, Cerdán ME: New secretory strategies for Kluyveromyces lactis beta-galactosidase. Protein Eng. 2001, 14: 379-386. 10.1093/protein/14.5.379.View ArticlePubMedGoogle Scholar
- Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG: Life with 6000 Genes. Sci. 1996, 274: 546-567. 10.1126/science.274.5287.546.View ArticleGoogle Scholar
- DE Deken RH: The crabtree effect: A regulatory system in yeast. Microbiol. 1966, 44: 149-156.Google Scholar
- Merico A, Galafassi S, Piskur J, Compagno C: The oxygen level determines the fermentation pattern in Kluyveromyces lactis. Fems Yeast Res. 2009, 9: 749-756. 10.1111/j.1567-1364.2009.00528.x.View ArticlePubMedGoogle Scholar
- Snoek ISI, Steensma HY: Why does Kluyveromyces lactis not grow under anaerobic conditions? Comparison of essential anaerobic genes of Saccharomyces cerevisiae with the Kluyveromyces lactis genome. Fems Yeast Res. 2006, 6: 393-403. 10.1111/j.1567-1364.2005.00007.x.View ArticlePubMedGoogle Scholar
- Hnatova M, Wesolowski-Louvel M, Dieppois G, Deffaud J, Lemaire M: Characterization of KlGRR1 and SMS1 genes, two new elements of the glucose signaling pathway of Kluyveromyces lactis. Eukaryotic Cell. 2008, 7: 1299-1308. 10.1128/EC.00454-07.PubMed CentralView ArticlePubMedGoogle Scholar
- Micolonghi C, Wésolowski-Louvel M, Bianchi MM: The Rag4 glucose sensor is involved in the hypoxic induction of KlPDC1 gene expression in the yeast Kluyveromyces lactis. Eukaryotic cell. 2011, 10: 146-148. 10.1128/EC.00251-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Bao W-G, Guiard B, Fang Z-A, Donnini C, Gervais M, Lopes Passos FM, Ferrero I, Fukuhara H, Bolotin-Fukuhara M: Oxygen-dependent transcriptional regulator hap1p limits glucose uptake by repressing the expression of the major glucose transporter gene RAG1 inKluyveromyces lactis. Eukaryotic Cell. 2008, 7: 1895-1905. 10.1128/EC.00018-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Raimondi S, Zanni E, Talora C, Rossi M, Palleschi C, Uccelletti D: SOD1, a newKluyveromyces lactishelper gene for heterologous protein secretion. Appl Environ Microbiol. 2008, 74: 7130-7137. 10.1128/AEM.00955-08.PubMed CentralView ArticlePubMedGoogle Scholar
- Ganatra MB, Vainauskas S, Hong JM, Taylor TE, Denson J-PM, Esposito D, Read JD, Schmeisser H, Zoon KC, Hartley JL, Taron CH: A set of aspartyl protease-deficient strains for improved expression of heterologous proteins in Kluyveromyces lactis. Fems Yeast Res. 2011, 11: 168-178. 10.1111/j.1567-1364.2010.00703.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu B, Gong X, Chang S, Yang Y, Song M, Duan D, Wang L, Ma Q, Wu J: Disruption of the OCH1 and MNN1 genes decrease N-glycosylation on glycoprotein expressed in Kluyveromyces lactis. J Biotechnol. 2009, 143: 95-102. 10.1016/j.jbiotec.2009.06.016.View ArticlePubMedGoogle Scholar
- Gonzalez-Siso MI, Garcia-Leiro A, Tarrio N, Esperanza Cerdan M: Sugar metabolism, redox balance and oxidative stress response in the respiratory yeast Kluyveromyces lactis. Microbial Cell Factories. 2009, 8: 46-10.1186/1475-2859-8-46.PubMed CentralView ArticlePubMedGoogle Scholar
- Garcia-Leiro A, Cerdan ME, Gonzalez-Siso MI: Proteomic Analysis of the Oxidative Stress Response in Kluyveromyces lactis and Effect of Glutathione Reductase Depletion. J Proteome Res. 2010, 9: 2358-2376. 10.1021/pr901086w.View ArticlePubMedGoogle Scholar
- Fang Z-A, Wang G-H, Chen A-L, Li Y-F, Liu J-P, Li Y-Y, Bolotin-Fukuhara M, Bao W-G: Gene responses to oxygen availability inKluyveromyces lactis: an insight on the evolution of the oxygen-responding system in yeast. PLoS One. 2009, 4: e7561-10.1371/journal.pone.0007561.PubMed CentralView ArticlePubMedGoogle Scholar
- Bussereau F, Casaregola S, Lafay JF, Bolotin-Fukuhara M: The Kluyveromyces lactis repertoire of transcriptional regulators. Fems Yeast Res. 2006, 6: 325-335. 10.1111/j.1567-1364.2006.00028.x.View ArticlePubMedGoogle Scholar
- Backhaus K, Heilmann CJ, Sorgo AG, Purschke G, de Koster CG, Klis FM, Heinisch JJ: A systematic study of the cell wall composition of Kluyveromyces lactis. Yeast. 2010, 27: 647-660. 10.1002/yea.1781.View ArticlePubMedGoogle Scholar
- Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 708-713. 10.1038/42711.View ArticlePubMedGoogle Scholar
- Kurtzman CP: Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceae, and the proposal of the new genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and Zygotorulaspora. FEMS Yeast Res. 2003, 4: 233-245. 10.1016/S1567-1356(03)00175-2.View ArticlePubMedGoogle Scholar
- Lachance M-A: Current status of Kluyveromyces systematics. FEMS Yeast Res. 2007, 7: 642-645. 10.1111/j.1567-1364.2006.00197.x.View ArticlePubMedGoogle Scholar
- Steensma HY, Ter Linde JJM: Plasmids with the Cre-recombinase and the dominant nat marker, suitable for use in prototrophic strains of Saccharomyces cerevisiae and Kluyveromyces lactis. Yeast. 2001, 18: 469-472. 10.1002/yea.696.View ArticlePubMedGoogle Scholar
- Kooistra R, Hooykaas PJJ, Steensma HY: Efficient gene targeting in Kluyveromyces lactis. Yeast. 2004, 21: 781-792. 10.1002/yea.1131.View ArticlePubMedGoogle Scholar
- Dujon B, Sherman DJ, Fischer G, Durrens P, Casaregola S, Lafontaine I, de Montigny J, Marck C, Neuveglise C, Talla E: Others: genome evolution in yeasts. Nature. 2004, 430: 35-44. 10.1038/nature02579.View ArticlePubMedGoogle Scholar
- Dujon B: Hemiascomycetous yeasts at the forefront of comparative genomics. Curr Opin Genet Dev. 2005, 15: 614-620. 10.1016/j.gde.2005.09.005.View ArticlePubMedGoogle Scholar
- Richard G-F, Dujon B: Molecular evolution of minisatellites in hemiascomycetous yeasts. Mol Biol Evol. 2006, 23: 189-202.View ArticlePubMedGoogle Scholar
- De Hertogh B, Hancy F, Goffeau A, Baret PV: Emergence of species-specific transporters during evolution of the hemiascomycete phylum. Genet. 2006, 172: 771-781.View ArticleGoogle Scholar
- Gbelska Y, Krijger JJ, Breunig KD: Evolution of gene families: the multidrug resistance transporter genes in five related yeast species. Fems Yeast Res. 2006, 6: 345-355. 10.1111/j.1567-1364.2006.00058.x.View ArticlePubMedGoogle Scholar
- Wong S, Wolfe KH, et al: Duplication of genes and genomes in yeasts. Univ Dublin Trinity Coll, Dept Genet, Smurfit Inst, Dublin 2, Ireland. 2006, Pringer-Verlag Berlin, Heidelberger Platz 3, Berlin, Germany, 79-99. 15Google Scholar
- Bolotin-fukuhara M, Casaregola S, Aigle M: Genome evolution: Lessons from Genolevures. 2006, D-14197 Berlin, Germany: Springer-Verlag Berlin, Heidelberger Platz 3, 165-196. Univ Bordeaux 2, CNRS, IBGC, Rue Camille St Saens, F-233077 Bordeaux, FranceGoogle Scholar
- Seret M-L, Diffels JF, Goffeau A, Baret PV: Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts. BMC Genomics. 2009, 10: 459-10.1186/1471-2164-10-459.PubMed CentralView ArticlePubMedGoogle Scholar
- Souciet J-L, Dujon B, Gaillardin C, Johnston M, Baret PV, Cliften P, Sherman DJ, Weissenbach J, Westhof E, Wincker P, Jubin C, Poulain J, Barbe VV, Segurens B, Artiguenave FF, Anthouard VV, Vacherie B, Val M-E, Fulton RS, Minx P, Wilson R, Durrens P, Jean GG, Marck C, Martin T, Nikolski M, Rolland T, Seret M-L, Casaregola S, Despons L, et al: Comparative genomics of protoploid Saccharomycetaceae. Genome Res. 2009, 19: 1696-1709.PubMed CentralView ArticlePubMedGoogle Scholar
- Ouzounis CA, Karp PD: The past, present and future of genome-wide re-annotation. Genome Biol. 2002, 3: 2001-CommentView ArticleGoogle Scholar
- Gundogdu O, Bentley SD, Holden MT, Parkhill J, Dorrell N, Wren BW: Re-annotation and re-analysis of the Campylobacter jejuni NCTC11168 genome sequence. BMC Genomics. 2007, 8: 162-10.1186/1471-2164-8-162.PubMed CentralView ArticlePubMedGoogle Scholar
- Camus JC, Pryor MJ, Medigue C, Cole ST: Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. Microbiol-Sgm. 2002, 148: 2967-2973.View ArticleGoogle Scholar
- Haas BJ, Wortman JR, Ronning CM, Hannick LI, Smith RK, Maiti R, Chan AP, Yu C, Farzad M, Wu D, White O, Town CD: Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol. 2005, 3: 7-10.1186/1741-7007-3-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA: GenBank. Nucleic Acids Res. 2000, 28: 15-18. 10.1093/nar/28.1.15.PubMed CentralView ArticlePubMedGoogle Scholar
- The NCBI Handbook: National Library of Medicine (US). 2002, USA: National Center for Biotechnology InformationGoogle Scholar
- Barrett AJ, Canter CR, Liebecq C, Moss GP, Saenger W, Sharon N, Tipton KF, Vnetianer P, Vliegenthart VFG: Enzyme Nomenclature. 1992, San Diego: Academic Press, 862-Google Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Martin MJ, O’Donovan C, et al: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011, 39: D214-D219.View ArticleGoogle Scholar
- Edwards JS, Palsson BO: The Escherichia coli MG1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc Natl Acad Sci U S A. 2000, 97: 5528-5533. 10.1073/pnas.97.10.5528.PubMed CentralView ArticlePubMedGoogle Scholar
- Förster J, Famili I, Fu P, Palsson BØ, Nielsen J: Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 2003, 13: 244-253. 10.1101/gr.234503.PubMed CentralView ArticlePubMedGoogle Scholar
- Rocha I, Förster J, Nielsen J: Design and application of genome-scale reconstructed metabolic models. Methods in Mol Biol (Clifton, N.J.). 2008, 416: 409-431. 10.1007/978-1-59745-321-9_29.View ArticleGoogle Scholar
- Asadollahi MA, Maury J, Patil KR, Schalk M, Clark A, Nielsen J: Enhancing sesquiterpene production in Saccharomyces cerevisiae through in silico driven metabolic engineering. Metab Eng. 2009, 11: 328-334. 10.1016/j.ymben.2009.07.001.View ArticlePubMedGoogle Scholar
- Lee SJ, Lee DY, Kim TY, Kim BH, Lee JW, Lee SY: Metabolic engineering of Escherichia coli for enhanced production of succinic acid, based on genome comparison and in silico gene knockout simulation. Appl Environ Microbiol. 2005, 71: 7880-7887. 10.1128/AEM.71.12.7880-7887.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Terstappen GC, Reggiani A: In silico research in drug discovery. Trends In Pharmacological Scie. 2001, 22: 23-26.View ArticleGoogle Scholar
- Thiele I, Palsson BØ: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc. 2010, 5: 93-121.PubMed CentralView ArticlePubMedGoogle Scholar
- Saier MH: A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev: Mmbr. 2000, 64: 354-411. 10.1128/MMBR.64.2.354-411.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Dias O, Rocha M, Ferreira EC, Rocha I: Merlin: Metabolic models reconstruction using genome-scale information. Proceedings of the 11th International Symposium on Computer Applications in Biotechnology (CAB 2010). Edited by: Banga JR, Bogaerts P, Impe JFM, Dochain D, Smets I. 2010, Leuven, Belgium: Oude Valk College, 120-125.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Cherry J: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26: 73-79. 10.1093/nar/26.1.73.PubMed CentralView ArticlePubMedGoogle Scholar
- Scheer M, Grote A, Chang A, Schomburg I, Munaretto C, Rother M, Söhngen C, Stelzer M, Thiele J, Schomburg D: BRENDA, the enzyme information system in 2011. Nucleic Acids Res. 2011, 39: D670-D676. 10.1093/nar/gkq1089.PubMed CentralView ArticlePubMedGoogle Scholar
- Saier MH, Tran CV, Barabote RD: TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 2006, 34: D181-D186. 10.1093/nar/gkj001.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.View ArticlePubMedGoogle Scholar
- Jablonowski D, Schaffrath R: Zymocin, a composite chitinase and tRNase killer toxin from yeast. Biochem Soc Trans. 2007, 35: 1533-1537. 10.1042/BST0351533.View ArticlePubMedGoogle Scholar
- Vohra A, Satyanarayana T: Phytases: Microbial sources, production, purification, and potential biotechnological applications. Crit Rev Biotechnol. 2003, 23: 29-60. 10.1080/713609297.View ArticlePubMedGoogle Scholar
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004, 428: 617-624. 10.1038/nature02424.View ArticlePubMedGoogle Scholar
- Rolland T, Neuvéglise C, Sacerdot C, Dujon B: Insertion of horizontally transferred genes within conserved syntenic regions of yeast genomes. PLoS One. 2009, 4: e6515-10.1371/journal.pone.0006515.PubMed CentralView ArticlePubMedGoogle Scholar
- Marcet-Houben M, Gabaldón T: Acquisition of prokaryotic genes by fungal genomes. Trends in Genet: TIG. 2010, 26: 5-8. 10.1016/j.tig.2009.11.007.View ArticlePubMedGoogle Scholar
- Billard P, Ménart S, Fleer R, Bolotin-Fukuhara M: Isolation and characterization of the gene encoding xylose reductase from Kluyveromyces lactis. Gene. 1995, 162: 93-97. 10.1016/0378-1119(95)00294-G.View ArticlePubMedGoogle Scholar
- Law CJ, Maloney PC, Wang D-N: Ins and outs of major facilitator superfamily antiporters. Annu Rev Microbiol. 2008, 62: 289-305. 10.1146/annurev.micro.61.080706.093329.PubMed CentralView ArticlePubMedGoogle Scholar
- Sable HZ: Letter: Transport of organic acids across cell membrane. N Engl J Med. 1974, 291: 582-PubMedGoogle Scholar
- Breunig KD, Dahlems U, Das S, Hollenberg CP: Analysis of a eukaryotic beta-galactosidase gene: the N-terminal end of the yeast Kluyveromyces lactis protein shows homology to the Escherichia coli lacZ gene product. Nucleic Acids Res. 1984, 12: 2327-2341. 10.1093/nar/12.5.2327.PubMed CentralView ArticlePubMedGoogle Scholar
- Dickson RC, Barr K: Characterization of lactose transport in Kluyveromyces lactis. J Bacteriol. 1983, 154: 1245-1251.PubMed CentralPubMedGoogle Scholar
- Entiani K-D, Barnett JA: Some genetical and biochemical attempts to elucidate the energetics of sugar uptake and explain the Kluyver effect in the yeast Kluyveromyces lactis. Curr Genet. 1983, 7: 323-325. 10.1007/BF00376078.View ArticlePubMedGoogle Scholar
- Lodi T, Saliola M, Donnini C, Goffrini P: Three target genes for the transcriptional activator Cat8p of Kluyveromyces lactis: Acetyl coenzyme A synthetase genes KlACS1 and KlACS2 and lactate permease gene KlJEN1. J Bacteriol. 2001, 183: 5257-5261. 10.1128/JB.183.18.5257-5261.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Lopez ML, Redruello B, Valdes E, Moreno F, Heinisch JJ, Rodicio R: Isocitrate lyase of the yeast Kluyveromyces lactis is subject to glucose repression but not to catabolite inactivation. Curr Genet. 2004, 44: 305-316. 10.1007/s00294-003-0453-9.View ArticlePubMedGoogle Scholar
- Zeeman AM, Kuyper M, Pronk JT, van Dijken JP, de Steensma HY: Regulation of pyruvate metabolism in chemostat cultures of Kluyveromyces lactis CBS 2359. Yeast. 2000, 16: 611-620. 10.1002/(SICI)1097-0061(200005)16:7<611::AID-YEA558>3.0.CO;2-Z.View ArticlePubMedGoogle Scholar
- Breunig KD, Steensma HY: Kluyveromyces lactis: genetics, physiology, and application. 2003, Functional Genetics of Industrial Yeasts, Yeasts, 171-205.Google Scholar
- Tarrío N, Díaz Prado S, Cerdán ME, González Siso MI: The nuclear genes encoding the internal (KlNDI1) and external (KlNDE1) alternative NAD(P)H:ubiquinone oxidoreductases of mitochondria from Kluyveromyces lactis. Biochim Biophys Acta. 2005, 1707: 199-210. 199.View ArticlePubMedGoogle Scholar
- Tarrío N, Becerra M, Cerdán ME, González Siso MI: Reoxidation of cytosolic NADPH in Kluyveromyces lactis. FEMS Yeast Res. 2006, 6: 371-380. 10.1111/j.1567-1364.2005.00021.x.View ArticlePubMedGoogle Scholar
- van Urk H, Postma E, Scheffers WA, van Dijken JP: Glucose transport in crabtree-positive and crabtree-negative yeasts. J Gen Microbiol. 1989, 135: 2399-2406.PubMedGoogle Scholar
- Zeeman AM, Luttik MA, Thiele C, van Dijken JP, Pronk JT, Steensma HY: Inactivation of the Kluyveromyces lactis KlPDA1 gene leads to loss of pyruvate dehydrogenase activity, impairs growth on glucose and triggers aerobic alcoholic fermentation. Microbiol (Reading, England). 1998, 144 (Pt 1): 3437-3446.View ArticleGoogle Scholar
- Ozier-Kalogeropoulos O, Malpertuy A, Boyer J, Tekaia F, Dujon B: Random exploration of the Kluyveromyces lactis genome and comparison with that of Saccharomyces cerevisiae. Nucleic Acids Res. 1998, 26: 5511-5524. 10.1093/nar/26.23.5511.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.