Filtering "genic" open reading frames from genomic DNA samples for advanced annotation
© D'Angelo et al; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
Skip to main content
© D'Angelo et al; licensee BioMed Central Ltd. 2011
Published: 15 June 2011
In order to carry out experimental gene annotation, DNA encoding open reading frames (ORFs) derived from real genes (termed "genic") in the correct frame is required. When genes are correctly assigned, isolation of genic DNA for functional annotation can be carried out by PCR. However, not all genes are correctly assigned, and even when correctly assigned, gene products are often incorrectly folded when expressed in heterologous hosts. This is a problem that can sometimes be overcome by the expression of protein fragments encoding domains, rather than full-length proteins. One possible method to isolate DNA encoding such domains would to "filter" complex DNA (cDNA libraries, genomic and metagenomic DNA) for gene fragments that confer a selectable phenotype relying on correct folding, with all such domains present in a complex DNA sample, termed the “domainome”.
In this paper we discuss the preparation of diverse genic ORF libraries from randomly fragmented genomic DNA using ß-lactamase to filter out the open reading frames. By cloning DNA fragments between leader sequences and the mature ß-lactamase gene, colonies can be selected for resistance to ampicillin, conferred by correct folding of the lactamase gene. Our experiments demonstrate that the majority of surviving colonies contain genic open reading frames, suggesting that ß-lactamase is acting as a selectable folding reporter. Furthermore, different leaders (Sec, TAT and SRP), normally translocating different protein classes, filter different genic fragment subsets, indicating that their use increases the fraction of the “domainone” that is accessible.
The availability of ORF libraries, obtained with the filtering method described here, combined with screening methods such as phage display and protein-protein interaction studies, or with protein structure determination projects, can lead to the identification and structural determination of functional genic ORFs. ORF libraries represent, moreover, a useful tool to proceed towards high-throughput functional annotation of newly sequenced genomes.
Advances in sequencing technologies have led to the explosion of large-scale sequencing projects: as of January 2011, 1331 bacterial genomes have been successfully sequenced, with other 4424 genomes either unfinished or in assembly phases (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Since sequencing is no longer an issue, the real challenge is understanding how DNA sequence leads to the specific phenotype of an organism. A key step in this process is the annotation of genes encoding proteins that contribute to particular functions. Since the completion of the first bacterial genome in 1998, this process is usually carried out automatically using ab initio, homology, or combination approaches . Most gene structures are based on computational predictions , and annotation is based on homology. However, automated annotation can be incorrect when sequence similarity is not associated with functional similarity, or when reference databases contain incorrect annotations, a problem that is estimated to afflict up to 49% of genes in public databases [3–6]. Gene function is originally assigned where there is homology with related genes whose activity has been determined experimentally. However, second and third generation annotations, as well as associated errors, are prevalent.
Experimental information related to protein function is necessary, but far more difficult to obtain at a whole genome level. Genes must be informatically identified before they can be cloned, expressed, and tested for function, and the study of a whole recombinant proteome relies on the cloning of all Open Reading Frames (ORFs) in a genome. These “ORFeome” collections, as they are termed, require huge efforts in terms of time and resources. Once cloned, recombinatorial cloning systems allow relatively straightforward transfer between different vectors for genome scale projects . However, even after correct gene identification and cloning, challenges are still present in terms of full-length protein expression and purification, with as few as 30% of proteins expressed solubly in E. coli at sufficient levels to be experimentally useful [8, 9].
Functional annotation would be greatly facilitated if genomic DNA could be used directly without the need to create ORFeome resources. The fact that proteins generally contain multiple domains, each of which contributes to a distinct function, provides a potential mechanism by which this can be carried out. Once generated, a protein domain library will be useful for many purposes. Applications such as structural studies, antibody generation, protein/substrate binding analyses, domain shuffling for enzyme evolution and protein chips will all benefit from a library of well folded protein domains.
An analysis of the length distribution of protein domains , reveals that most range from 50 to 200 aa, with a peak at around 100 aa: we speculate that fragmenting a whole (intronless) genome into DNA fragments of 200-800 bp, should provide a broad representation of all the protein domains from a single species, a polypeptide population that has been termed the “domainome” . The availability of genome sequences, coupled with the ability to experimentally determine the function of a domain, rather than a full-length protein, could also provide a simpler method to annotate genes on the basis of specific function.
In order to develop a gene annotation method based on the function of individual domains on a genomic scale, a random approach is required in order to avoid bias towards what is already known. In this perspective, randomly fragmented genomic DNA represents a good DNA source for intronless organisms. Unfortunately, the use of randomly generated DNA fragments suffers from several problems: i) non ORFs (ORFs containing suppressible stop codons) and non genic ORFs (alternative ORFs in a frame other than the original frame of the gene (genic ORFs)), can be obtained for fragments derived from random fragmentation. These non genic ORFs will translate into polypeptides with no biological meaning; ii) folding failure occurs even for correctly identified and cloned ORFs, thus impairing their function; iii) proteins or protein fragments which fold in different cellular compartments can be affected by recombinant expression in inappropriate redox, or chaperone, environments.
To address these issues, we demonstrate [12–14], with others [15–17], that folding reporters could be very useful tools. The principle under which they operate is that a poorly folded test protein can adversely affect the folding of a “reporter” protein to which it is fused, by trapping it in a non-functional or aggregated state. Moreover, when the folding reporter has an easily identifiable phenotype (e.g. antibiotic resistance , fluorescence , or color complementation ), rescuing those clones expressing properly folded and soluble ORFs becomes a relatively straightforward process.
While it might be expected that any ORF (genic or non genic) would confer the positively selectable phenotype (fluorescence, antibiotic resistance), we recently observed that when cDNA fragments are cloned upstream of the folding reporter, a selection for fragments from real genes tends to occur , and when a plasmid containing four known genes was fragmented and placed upstream of a folding reporter, over 80% of selected DNA fragments were genic ORFs .
Proper folding of a polypeptide depends both on the amino acid composition and extrinsic factors, such as chaperones and redox conditions. Consequently correct folding depends upon a protein folding in the appropriate cellular compartment.
In bacteria, most proteins cross the inner membrane by means of the type II secretory system. This comprises three different pathways : the Sec-dependent pathway , the signal recognition particle (SRP), and the twin-arginine translocation (TAT) pathways. While the Sec pathway exports unfolded proteins, the SRP and TAT pathways export co- and post-translationally folded proteins, respectively. In this paper, we show how the use of three different β-lactamase filtering vectors, each exploiting one of the different export systems, allows broader representation of domains that can be filtered from a genome.
In order to demonstrate the feasibility of the filtering method applied at the whole genome level, we chose Clostridium thermocellum as model organism. The general interest in this anaerobic bacterium relies on its extraordinary ability to metabolize plant cell wall polysaccharides by means of a complex secreted multiprotein complex. Its components have a typical multidomain structure where each domain has a defined function (e.g. anchoring onto substrate, anchoring on bacterial membrane, adaptor domains, catalytic domains) . Indeed, the fact that its genome has been recently sequenced (GenBank ID: CP000568.1), but not completely annotated makes it a good candidate for domain-based functional annotation purposes.
In this paper we describe the preparation and characteristics of filtered domain libraries prepared from genomic DNA, using libraries created from the C. thermocellum genome as a model system.
Due to the relatively high amount of starting material required to generate a genomic library, C. thermocellum genomic DNA (gDNA) was used as a template for multiple displacement amplification (MDA) in order to obtain 10-20 µg of DNA for subsequent fragmentation. Fragmentation was carried out using nebulisation, and conditions were optimized to obtain a fragment distribution in the range of 200-800 bp. Such a length range was intended to be optimal for the statistical representation of all the domains in the genome. gDNA was cloned as blunt end fragments into three filtering vectors, in which an EcoRV cloning site was found between a Sec (POS), SRP (SOS) or TAT (TOS) leader sequence and the mature ß lactamase. The three final vectors, carrying the chloramphenicol resistance gene as a selective marker, have the SV5 and 6xHis tags downstream of the cloned gDNA fragment (see Figure 1, panel A). The effect of growing clones containing the gDNA library on different concentrations of ampicillin is depicted in Figure 1, panel B: only gDNA fragments that are in frame with both secretion leader and β-lactamase gene (ORFs) can survive the selective pressure of higher ampicillin concentrations. Among surviving clones, those that encode well-folding fragments would be expected to survive at higher ampicillin concentrations, as they would be expected to allow greater amounts of functional β-lactamase to accumulate.
Fragment length was provisionally assessed by PCR analysis of random clones, and the average length determined to be around 400 bp. Considering the starting diversity of the non-filtered libraries (4x106 for SOS and 1x107 for POS and TOS), a statistical coverage of 400 to 1000 fold of the 3.8 Mb genome was obtained. With such coverage, we expected that, after ORF filtration the final library diversity would have remained sufficient to represent all C. thermocellum genes.
In order to confirm the hypothesis that some colonies can survive at higher ampicillin concentration because a greater amount of soluble and functional β-lactamase fusion protein is produced, we tested the β-lactamase activity of random clones grown on agar plates containing different concentrations of ampicillin.
As 454 sequencing normally introduces errors , it was impossible to accurately determine the percentages of genic or non genic ORFs that were filtered. In order to overcome this, we increased the stringency of the sequence analysis by creating a data set of “perfect sequences” with no sequencing errors and 100% match to the genome reference sequence (RefSeq). This procedure led to the identification of a data set of 10789 perfect sequences, with the same distribution pattern as the general Sec library (see Figure 5, panel B), thus indicating that no bias was introduced when increasing the stringency of the sequence analysis. Within the “perfect match” set of sequences, the reading frame could be correctly and unambiguously assigned (see Figure 5, panel C for more details): 73.5 % of reads corresponded to genic open reading frames, a much higher percentage than the expected 17 % (1 fragment out of 6 possible frames is expected to be in the same frame of the corresponding gene by chance). These data indicate that β-lactamase acts as a folding reporter, thus pushing the selection towards ORFs with biological meaning.
In a previous paper  we fragmented a plasmid containing four genes (and 62 non-genic open reading frames greater than 150bp), and used the ß lactamase approach described here to “filter” out putative open reading frames. We discovered that all filtered clones were open reading frames and 84% of these were derived from real genes, as opposed to random open reading frames. In the work described here, we have extended these results to the analysis of a full genome. This was carried out in two parts. In the first part the genome coverage for each of the three libraries (Sec, SRP and TAT leaders) was assessed using the obtained sequences. The results in Figure 4 showed that almost all of the annotated genes in C. thermocellum were represented by at least one read in the library. Furthermore, the use of different leaders increased the genome coverage. Most genes (1938, or 61%) were found in all three libraries. The Sec leader provided the greatest number of different genes (2712 genes or 85% of the total), and the addition of the TAT and SRP leaders provided an additional 352 (11%) different genes, for a total of 96% of all genes, or almost complete genome coverage. The goal of the second analysis was to determine the percentage of open reading frames, and where they were derived from. Given the tendency of 454 sequencing to introduce errors , it was impossible to carry this out with the raw 454 sequences. This was overcome by compiling a set of “perfect sequences” that had a 100% match to the genome sequence at both ends, allowing us to determine the precise start and end of each clone. Perfect sequence sets were generated from the Sec library filtered to a survival level of 1%. These revealed that 76.4 % were open reading frames with no stop codons, of which 96.2 % were derived from genic ORFs, as opposed to spurious ORFs, of no biological significance. We hypothesize that real gene or mRNA fragments encode polypeptides that naturally evolved to fold correctly, thus driving the proper folding and activity of the folding reporter, while random ORFs generate peptides with no biological meaning that are more likely to negatively affect the folding, aggregation state or activity of the reporter.
In our sequencing experiments, we analyzed those clones corresponding to a survival of 1%. We reasoned that this would represent a balance between broad genome representation and the selection for clones encoding well folding domains. Figure 3 and additional file 1 show clearly that the greater the concentration of ampicillin used for filtering, the higher the activity of ß-lactamase, as determined by the nitrocefin colorometric assay. Although we have not formally shown that domains fused to ß-lactamase with higher activity are better folded, similar experiments using GFP (green fluorescent protein) as a folding reporter [15, 25], in which proteins of interest are fused upstream of GFP and selected on the basis of clone fluorescence, have clearly demonstrated that clone fluorescence is directly proportional to the folding and solubility of the fused protein of interest when not fused to GFP. In these experiments, GFP can be considered to be analogous to the ß-lactamase, in that correct functioning of the reporter is dependent upon the folding and solubility state of the fused domain. The technology described here is genome neutral and can be used to rapidly create a domainome library from any intronless genome, or collection of open reading frames of interest. The use of the random approach described here avoids the need for extensive analysis, primer synthesis or multiple PCRs, and creates a resource in which many different versions of each domain, differing by a few amino acids, are created. Once generated, it is expected that the protein fragments obtained by this approach will be useful for many purposes, including structural studies, antibody generation, protein/substrate binding analyses, domain shuffling for enzyme evolution and protein chips. Furthermore, once recloned into a phage display context, domainome libraries can be directly selected for gene fragments encoding domains with specific binding properties (e.g. to other proteins, domains, enzyme substrates) or enzyme activities, if appropriate activity based probes are available 2120.
With this work we demonstrated that domainome libraries can be easily generated by applying β-lactamase based filtering to randomly fragmented bacterial gDNA libraries. Once a library is generated, it can be used as a universal reagent to be screened for several activities. The identification of domains showing specific activity, instead of the testing of single genes, will allow functional annotation of the domains themselves: this annotation represents the first step to the high throughput assignment of full length gene products to structural functions or to specific metabolic pathways.
Genomic DNA from Clostridium thermocellum (ATCC 27405) was kindly provided by Prof David Wu, Univ. Rochester. DNA was amplified by multiple displacement amplification (MDA) with a Repli-g screening kit (Qiagen) according to the manufacturer’s instructions. After 16 h amplification, DNA was fragmented using nitrogen gas based nebulisation for 1 min at 45 psi; average fragment size was around 500 bp (200-800 bp range).
Fragments were twice blunt-ended (Quick blunt kit, NEB) and gel-purified (Gel extraction kit, Qiagen) before cloning into the EcoRV cleaved POS, SOS, and TOS vectors. These vectors were obtained from the original pPAO phagemid vector , by removing the g3p gene and either maintaining the Sec secretion leader (in POS vector) or replacing it with SRP and TAT leaders (in SOS and TOS respectively) encoded by oligonucleotides.
After ligation, the three libraries were electroporated into E. coli DH5α F’ cells, plated on chloramphenicol agar plates and grown for 16 h at 37°C.
The genomic DNA fragmented libraries obtained in the three vectors were harvested from plates; 10 µL of each library were diluted in LB media to OD600 0.5 and 100 µL (108 cells, corresponding to 10-20 fold the starting library diversity) were plated on plates supplemented with chloramphenicol (34 µg/mL) alone, or containing ampicillin at different concentrations, ranging from 0.25 to 100 µg/mL. Plates were incubated for 20 h at 30°C.
Each library filtered to a survival rate of around 1% underwent deep sequencing. Filtered gDNA was removed as SfiI oriented fragments from the purified plasmid DNA. One µg of gel extracted DNA fragments for each library was used as starting material for the preparation of the 454 tagged libraries. DNA quality control was performed using the Qubit HS quantitation platform (Invitrogen); ligation of purified samples to specific adaptors and preparation of the single strand libraries (ssDNA) were performed following the manufacturer’s instructions (GS-FLX Titanium kit, Roche). The quality control on the ssDNA libraries was performed by capillary electrophoresis (Agilent Bioanalyzer 2100 with the RNA Pico 6000 LabChip kit; Agilent Technologies). The ssDNA libraries were then processed as required by the 454 sequencing protocol. Each enriched sample was separately loaded onto one-eighth of the PicoTiterPlate and sequenced.
Raw data were processed by a custom-made workflow procedure mainly based on PERL scripts. Briefly, sequences were mapped onto the C. thermocellum genome (Reference Sequence CP000568.1) using Gmap software  and matching sequences were compared with annotated genes. Each gene was then identified by the number of mapping reads; a further implemented reiterated procedure allowed us to analyze each library in terms of mapping, annotation and filtering features. Data are accessible through a website interface implemented in php and java (http://www.interactomeataglance.org). See additional information for browsing directions.
48 clones for each library were picked from plates at different ampicillin concentrations (ranging from 0 to 100 µg/mL). After O/N growth in autoinduction media , culture supernatants were collected and tested for β-lactamase activity in a nitrocefin-based functional assay. Briefly, nitrocefin (EMD, Calbiochem) was diluted to 100 µg/mL in PBS and 50 µL of this working solution were added to 10 µL of cultures supernatant. The assay was performed in the 96-well plate format and plates were read at 486 nm wavelength with a microplate reader (Infinite M200, Tecan) at different time-points (2 h, 6 h, 16 h). Signals were normalized to the positive control signal per each plate; the β-lactamase activity for each clone was reported as a percentage value (where the positive control has 100 % activity).
We are grateful to DOE, GTL for funding, JGI Los Alamos for sequencing, and Prof. David Wu for the C. thermocellum DNA. DS is grateful to Fondazione Cariplo Ricerca scientifica in ambito biomedico 2009, Regione Piemonte Piattaforma Biotecnologie Progetto IMMONC
This article has been published as part of BMC Genomics Volume 12 Supplement 1, 2011: Validation methods for functional genome annotation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.