A nuclear magnetic resonance based approach to accurate functional annotation of putative enzymes in the methanogen Methanosarcina acetivorans
© Chen et al; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
Skip to main content
© Chen et al; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
Correct annotation of function is essential if one is to take full advantage of the vast amounts of genomic sequence data. The accuracy of sequence-based functional annotations is often variable, particularly if the sequence homology to a known function is low. Indeed recent work has shown that even proteins with very high sequence identity can have different folds and functions, and therefore caution is needed in assigning functions by sequence homology in the absence of experimental validation. Experimental methods are therefore needed to efficiently evaluate annotations in a way that complements current high throughput technologies. Here, we describe the use of nuclear magnetic resonance (NMR)-based ligand screening as a tool for testing functional assignments of putative enzymes that may be of variable reliability.
The target genes for this study are putative enzymes from the methanogenic archaeon Methanosarcina acetivorans (MA) that have been selected after manual genome re-annotation and demonstrate detectable in vivo expression at the level of the transcriptome. The experimental approach begins with heterologous E. coli expression and purification of individual MA gene products. An NMR-based ligand screen of the purified protein then identifies possible substrates or products from a library of candidate compounds chosen from the putative pathway and other related pathways. These data are used to determine if the current sequence-based annotation is likely to be correct. For a number of case studies, additional experiments (such as in vivo genetic complementation) were performed to determine function so that the reliability of the NMR screen could be independently assessed.
In all examples studied, the NMR screen was indicative of whether the functional annotation was correct. Thus, the case studies described demonstrate that NMR-based ligand screening is an effective and rapid tool for confirming or negating the annotated gene function of putative enzymes. In particular, no protein-specific assay needs to be developed, which makes the approach broadly applicable for validating putative functions using an automated pipeline strategy.
Protein functions are annotated in genomic databases using automated routines that search for sequence homology to a gene product with an established function. The accuracy of these sequence-based annotations is often variable, particularly if the sequence identity to a known function is low. Indeed recent work has shown that even proteins with very high sequence identity can have different folds and functions [1–3], and therefore caution is needed in assigning functions simply by sequence homology in the absence of experimental validation. Traditional experimental approaches to determine function such as enzyme assays are slow and painstaking and have not been able to keep up with the ever-increasing large body of genome sequence data that contains many genes with unconfirmed and undetermined function. Clearly more efficient methods for accurate, experimental-based annotation and validation of function are needed.
One area where there is a strong demand for functional annotation is the large number of putative enzymes identified from structural genomics and other efforts (e.g. [4–7]). Methods for rapidly establishing small molecule substrate or product specificity of putative enzymes are likely to be extremely useful on two levels. Firstly, they would allow efficient testing of functional assignments that may be of variable reliability. Secondly, such approaches may be extended to the characterization of partially assigned enzymatic functions like those annotated from structural genomics efforts. This article discusses a method that is applicable to the first level of testing current sequence-based annotations of enzymatic function. An NMR-based approach is described for identifying potential substrates or products of enzymes in vitro.
For the goal here of developing rapid approaches for annotating putative enzymes, we adopted a ligand-based NMR screening strategy . In our hands, the most consistent results were obtained using the waterLOGSY (water-ligand observed via gradient spectroscopy) pulse sequence . This method was originally developed for ligand screening of drug targets and is amenable to a pipeline approach. The NMR experiment is based on magnetization transfer between ligand and water molecules. In the presence of a protein that binds to the ligand, there are two competing flows of magnetization: 1) from water to the free ligand and 2) from bound water (via the protein) to the bound ligand. These two flows lead to opposite signs of the NOEs (nuclear Overhauser enhancements) between water and the ligand. The stronger magnetization flow determines the sign of the waterLOGSY peak. Compounds that bind the protein will give positive peaks whereas peaks generated from non-binding compounds will be negative in the waterLOGSY spectrum. Since exchangeable protons (e.g. hydroxyl or amino group protons) also appear as positive peaks in waterLOGSY spectra, these need to be identified and deconvoluted from the peaks due to protein-binding. This is readily achieved by recording a reference spectrum of the sample in which the water signal is saturated. Through chemical exchange, the labile OH and NH protons are also saturated and their peak intensities are greatly decreased allowing straightforward distinction of peaks due to binding.
The case studies below illustrate how this method can be used to identify the chemical structures of potential substrates or products for putative enzyme proteins. The functional assignments were further supported by additional experiments (e.g. genetic complementation, NMR-based enzyme assays). In all of the examples studied, we find that the initial NMR screen is indicative of whether the functional annotation is correct.
Genes from the metabolically diverse methanogenic archaeon, Methanosarcina acetivorans, were chosen for this study [10–12]. Methane producing organisms are of interest because they provide an efficient and cost-effective biofuel which is self-harvesting and can be distributed readily using existing infrastructure. As with other genomes, however, accurate functional annotation of methanogens lags significantly behind the large body of sequence data, representing a sizable gap in understanding of the biology of these organisms. This project was initiated by updating functional annotations for over 700 of the 4721 predicted genes in the MA genome. This was done by transferring many of the recently revised manual annotations in the closely related species M. burtonii to homologous genes in MA. In combination, a thorough literature search was conducted for published data that experimentally confirms the functionality of MA genes and closely related orthologs in other species. A complete list of revised MA annotations is provided in Additional file 1 (also available at http://ibbr.umd.edu/g2f) with summary statistics in Additional file 2.
By analogy with the M. burtonii re-annotation, confidence levels were given to each re-annotation based on current literature as follows: Level 1: An exact match in the literature with an experimentally defined function. Level 2: Gene product contains all domains needed for enzymatic function with ≥35% sequence identity to a gene product of known function. Level 3: Gene product contains all domains needed for enzymatic function but ≤35% sequence identity to a gene product of known function. Level 4: Gene product has no experimental match but some domain similarities to a known function are recognizable. Level 5: Has no experimental match or domain similarities – i.e. annotated as hypothetical. This provided a list from which targets with varying confidence levels were selected for experimental validation using our pipeline approach.
The two main selection criteria were 1) the protein should have a putative enzymatic activity on a small molecule substrate and 2) the protein should be non-membranous based on amino acid sequence analysis. Additional characteristics that were preferable but not absolutely required were that the gene product was expressed in vivo in MA based on published reports  and that an E. coli homolog exists for potential genetic complementation studies. A total of 44 MA targets were cloned of which 27 were found to express soluble protein in E. coli. We describe here a number of these as case studies to illustrate our generalized approach.
Thus the NMR and gene complementation data are both consistent with the isocitrate dehydrogenase function, but do not support the 3-isopropylmalate dehydrogenase or tartrate dehydrogenase annotations.
MA3706 is annotated as a putative Ham1 protein in the IMG database and our re-annotation process does not change this annotation. Ham1 proteins are nucleoside triphosphatases that are hypothesized to catalyze the hydrolysis of non-standard nucleoside triphosphates (NTPs) to nucleoside monophosphates as a mechanism for preventing their incorporation into DNA and RNA . In particular, they are thought to target the oxidatively modified inosine and xanthosine triphosphates. Our annotation of MA3706 is based on homology with Mj0226 from Methanococcus jannaschii[20, 21]. MA3706 and Mj0226 share 47% sequence identity and the latter has been shown to preferentially hydrolyze xanthosine triphosphate (XTP) and deoxyinosine triphosphate (dITP) over other canonical nucleoside triphosphates. We therefore tested whether MA3706 interacts with nucleotides in a similar way by screening a series of standard and modified NTPs from our small molecule library for binding with MA3706.
Thus the NMR ligand screening approach provides a very efficient means for identifying the nucleotide binding preferences of MA3706. Further, once binding specificity was established, the enzymatic activity was detected directly in the NMR sample without the need for involved assays.
Using the approach described, a number of other MA gene product annotations were also investigated. Table 1 summarizes the genes that were studied. The experimental data can be put into 3 categories. In several examples (MA0940, MA2498, MA3520, MA3706) the data are consistent with the putative biochemical function and therefore provide increased confidence in the existing annotation. This sometimes occurs even when the sequence homology to an ortholog of known function is not very high (e.g. MA2498). Other examples such as MA4265 show that the existing annotations are only partially correct. Here, the experimental data suggest that the function assignment needs to be narrowed. A third category contains genes where the present functional assignment is not supported by our experimental screening procedure. For example MA0154 is currently annotated as biotin synthase in the IMG database, but our NMR-based ligand screening of the gene product did not detect binding to the putative substrate, product, or any other compounds in the biotin pathway (data not shown). A report published subsequent to our testing showed that this gene is in fact involved in pyrrolysine biosynthesis . This highlights another general problem with regard to assignment of function where database entries are sometimes not updated after the initial annotation. Nevertheless, NMR screening was quickly able to detect that the ligand binding results were not consistent with the IMG annotation, indicating that this function assignment was likely to be incorrect.
Correct annotation of function is essential if one is to take full advantage of the vast amounts of genomic sequence data. Incorrect assignment of function is propagated by comparative annotation with mis-annotated genes and can potentially lead to mis-placed experimental efforts. Conversely, a corrected annotation in one organism can provide tremendous leverage in re-annotation of orthologs from a diverse phylogeny of organisms. Experimental methods are therefore needed to efficiently evaluate annotations in a way that complements current high throughput sequence homology-based techniques.
Summary of MA genes studied
Experimental data consistent with IMG annotation?
Revised annotation is to pyrrolysine biosynthesis protein
Revised annotation is 3-octaprenyl-4-hydroxybenzoate carboxylyase
Revised annotation is alpha-ribazole-5’-phosphate phosphatase and experimental data is consistent with this
Revised annotation is fumarate hydratase/tartrate dehydratase but experimental data is only consistent with the former
No change in annotation
Ham1 protein/nucleotide triphosphatase
No change in annotation
Isocitrate/ isopropylmalate dehydrogenase family protein
Revised annotation is isocitrate/ isopropylmalate dehydrogenase but the experimental data is only consistent with the former
For the examples described here, where there is some pre-existing annotation of putative function, the ligand screening is targeted and generally involves fewer than 20 compounds per protein. Where even less is known about gene function (e.g. a “putative methyltransferase” annotation), a larger number of compounds will need to be screened. However, it is possible to develop a suite of compounds for screening in an automated fashion. We use a 24-sample robot for most screening applications, with automated sample change, shimming, acquisition and processing. Typically we use 1-5 compounds per protein sample depending on how many compounds need to be screened. Pooled compounds need to have at least one resolvable 1H NMR signal and be structurally as diverse as possible to minimize the chance of competition for binding. One limitation of this approach is that the most relevant compounds for testing may sometimes not be commercially available. Nevertheless, as demonstrated in this report, structural analogs can often be used to gain insights into the types of small molecules recognized by the gene product even when the exact substrate or product is not readily available (e.g. MA0940).
In principle, the approach described here is applicable to putative enzymes with completely undefined substrate specificity. Further studies coupling NMR-screening with other methods such as mass spectrometry-based metabolite profiling will be needed to determine functions for the large numbers of such putative enzymes that are currently poorly defined.
Target genes were PCR-amplified from isolated MA genomic DNA using primer sets listed in Additional file 3. Invitrogen’s Platinum® Pfx DNA Polymerase protocol was followed. Several of the PCR products were treated with Taq DNA polymerase (PE Applied Biosystems) and 2.5 μmol dATP (Roche) for cloning into the Invitrogen pCR4-TOPO vector. TOPO ligations were transformed into DH5α competent E. coli cells (Invitrogen), selected on LB-ampicillin, and sequenced. These clones were then used for subsequent cloning into the pET-21a vector (Novagen). For other clones, the PCR products were cloned directly into the pET vector and sequenced. All constructs contained a C- or N-terminal His6-tag, introduced via the PCR primers.
MA proteins were over-expressed in E. coli BL21(DE3) Rosetta cells (Invitrogen) transformed with the plasmid constructs containing the target MA genes. Optimal temperatures for expression were determined using small-scale (10 mL) trial LB cultures at 16°C, 25°C or 37°C in the presence of antibiotic. If soluble expression could not be obtained at any of these temperatures, expression at 10-13°C with ArcticExpress cells (Stratagene) for 24 hours was also attempted. In a number of cases this produced soluble protein.
For NMR studies, a 1 L culture was incubated at 37°C until the OD600 reached 0.4-0.8. The temperature was then adjusted to the pre-determined optimum and expression was induced with 1 mM IPTG. Typical expression times ranged from 5-24 hours. Cells were harvested by centrifugation (3500g, 30 min) and the pellet was re-suspended in binding buffer (10 mM imidazole, 300 mM sodium chloride, 50 mM sodium phosphate, pH 8.0). The cells were then lysed by sonication and centrifuged (35000g, 1 hr). The supernatant was loaded on a Ni-NTA-Agarose column (Qiagen) and purified with an imidazole gradient using standard procedures. Pure fractions were combined and dialysed against a standard buffer for NMR samples (50 mM sodium phosphate, 100 mM sodium chloride, pH 7.0).
NMR experiments were acquired at 5°C and 25°C on a Bruker DMX-600 spectrometer equipped with either a Z-axis gradient cryoprobe or a conventional 3-axis gradient probe. The typical protein concentration used for NMR experiments was in the 10-50 μM range. Initial test compound concentrations were set at ten times the protein concentration. This allowed detection of binding in the 0.1 micromolar to hundreds of micromolar range. If higher compound to protein ratios (e.g. 100:1) are used then only the tightest micromolar binders are detected. Thus the stringency of the experiment can be controlled by the compound-to-protein ratio. Test compounds were generally prepared as 50 or 100 mM stock solutions in d6-DMSO or water and diluted appropriately into NMR samples. Compounds were obtained from Sigma-Aldrich.
One-dimensional 1H NMR waterLOGSY experiments were acquired using established protocols . A reference experiment was collected followed by the waterLOGSY magnetization transfer spectrum. Typical acquisition parameters for the waterLOGSY spectra were 256-512 transients with a mixing time of 1.5 s and a 2 s relaxation delay. Using these parameters each experiment took 15-30 minutes to acquire. NMR spectra were processed using Bruker Topspin software (version 1.3) and analyzed by electronically overlaying reference and waterLOGSY spectra in dual display mode. A Bruker NMRCase sample changer robot controlled by ICON-NMR software was used for automated sample change, shimming, data acquisition (reference and waterLOGSY), and processing.
Genetic complementation of E. coli mutants by the MA genes was performed with E. coli deletion mutants generated in the BW25113 background . The specific mutant strains used in this study were: JW1122 (icd), JW5807 (leuB), and JW1789 (yeaU). All mutant strains and the isogenic wild-type control strain were lysogenized with lambda-DE3 (lambda-DE3 lysogenization kit; Novagen). Each strain was transformed with a non-recombinant control pET-21a plasmid (Novagen) and a recombinant pET-21a plasmid carrying a cloned MA gene. Transformants were selected by the ability to grow on a medium containing 100 µg/mL ampicillin, and this medium also contained 50 µg/mL kanamycin, for maintaining selection of the deletion mutant allele.
Complementation was demonstrated on 1.5% agar plates by streaking selected transformants from LB medium to M9 minimal agar media containing 100 µg/mL ampicillin in the presence or absence of 0.1 mM IPTG, and grown at 37°C. Glucose was used as the carbon source with the exception of the experiment shown in Figure 2c where the wild type control and yeaU mutants were plated on 2g/L D-malate as the sole carbon source .
Growth curves were produced by growing strains in liquid cultures in 48-well culture plates that were incubated at a constant temperature of 37°C. The medium consisted of M9 minimal media with glucose as the carbon source, containing 100 µg/mL ampicillin in the presence or absence of 0.1 mM IPTG. The liquid cultures were inoculated to an OD600 of ~0.05 from an inoculum culture grown overnight in LB with the appropriate antibiotics. Cells were collected by centrifugation, washed, and re-suspended with M9 media prior to inoculation. A Multi-Detection Microplate Reader Synergy HT (BioTech) was used to simultaneously measure the OD600 at 30-minute intervals.
The binding interaction between ITP and MA3706 was quantified using a Microcal VP Titration Calorimeter. The protein was dialyzed into buffer containing 50 mM sodium chloride, 100 mM sodium phosphate (pH 7.0). ITP (Sigma) was dissolved in the protein dialysis buffer at a concentration of 1.0 mM. Five microliters of ITP were injected into a 50 μM solution of MA3706 every 5 min until MA3706 was saturated.
JO set up NMR experiments and the small molecule library, interpreted NMR and ITC data, participated in the manual re-annotation, and drafted the manuscript. YC expressed and purified proteins, helped to acquire NMR data, and performed the calorimetry experiments. EA cloned MA genes and participated in the manual re-annotation. LB assisted in setting up the genetic complementation experiments. ZK designed strategies for cloning MA genes, assisted in the manual re-annotation, and participated in drafting the manuscript. ZL cloned MA genes and assisted in the manual re-annotation. BJN supervised the genetic complementation experiments and participated in drafting the manuscript. LS performed the genetic complementation study. KS coordinated cloning and manual re-annotation, and participated in drafting the manuscript.
This research is supported by the Office of Science (BER), U.S. Department of Energy, Grant Number DE-FG02-07ER64502, and an equipment grant from the W. M. Keck Foundation. We also wish to thank Dr. Dennis Maeder (SAIC-Frederick Inc., NCI) for providing spread-sheets lining up orthologs between M. acetivorans and M. burtonii.
This article has been published as part of BMC Genomics Volume 12 Supplement 1, 2011: Validation methods for functional genome annotation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.