Analysis of two large functionally uncharacterized regions in the Methanopyrus kandleri AV19 genome
© Jensen et al; licensee BioMed Central Ltd. 2003
Received: 20 December 2002
Accepted: 2 April 2003
Published: 2 April 2003
For most sequenced prokaryotic genomes, about a third of the protein coding genes annotated are "orphan proteins", that is, they lack homology to known proteins. These hypothetical genes are typically short and randomly scattered throughout the genome. This trend is seen for most of the bacterial and archaeal genomes published to date.
In contrast we have found that a large fraction of the genes coding for such orphan proteins in the Methanopyrus kandleri AV19 genome occur within two large regions. These genes have no known homologs except from other M. kandleri genes. However, analysis of their lengths, codon usage, and Ribosomal Binding Site (RBS) sequences shows that they are most likely true protein coding genes and not random open reading frames.
Although these regions can be considered as candidates for massive lateral gene transfer, our bioinformatics analysis suggests that this is not the case. We predict many of the organism specific proteins to be transmembrane and belong to protein families that are non-randomly distributed between the regions. Consistent with this, we suggest that the two regions are most likely unrelated, and that they may be integrated plasmids.
Typically, for a newly sequenced genome the number of unique genes is mentioned. This is usually claimed to be about one third of the annotated genes. However, this number is highly questionable as random open reading frames (ORFs) are often assigned as protein coding genes. Based on an analysis of protein length distributions, we have estimated the true number of genes in each of the completely sequenced prokaryotic genomes . Out of the estimated number genes in microbial genomes, M. kandleri contains the largest fraction of genes for which function cannot automatically be assigned based on sequence similarity (see supporting information at end of manuscript).
Pre-genome analysis of 16S ribosomal RNA had suggested M. kandleri to be placed close to the root of the Euryarchaeal tree. However, phylogenetic trees based on gene content, gene order, and ribosomal proteins places it toghether with the other archaeal methanogens [3, 4]. It thus comes as a surprise that its genome appears to contain large numbers of genes not present in the genomes of any of the other sequenced archaeal methanogens. Furthermore, the M. kandleri genome has been claimed to contain very few genes acquired through lateral gene transfer . It should however be noted that this claim was entirely based on an analysis of proteins with BLAST matches to sequences from other organisms. It is thus not possible to exclude that the two regions of unknown function could been transferred from other species that have not been characterized so far.
Results and discussion
To get an overview of the M. kandleri genome, we created circular visualizations known as genome atlases [5–7]. Figure 1 shows a customized atlas which summarizes the most interesting positional features of the M. kandleri genome: AT-content, predicted protein properties, and protein sequence similarity.
BLASTP searches of all predicted protein sequences in the M. kandleri genome were performed against a number of different databases. Comparison with SWISS-PROT was used for identifying proteins homologous with possible known function while a database of predicted archeal proteins was used for detecting conserved archeal proteins possibly missing in SWISS-PROT. The results of these two searches, as well as a search within the M. kandleri proteome, are shown in Figure 1.
Figure 1 reveals two large chromosomal regions containing mostly protein coding genes without significant sequence similarity to genes from other organisms. The smaller of the two regions, region I, is located at 1,063 kbp (kilobase pairs) to 1,182 kbp and the slightly larger region (1,231 kbp–1,390 k) will be referred to as region II. To rule out that the genes within these regions are simply missing from the databases mentioned above, a TBLASTN search was performed against all sequences in GenBank. This resulted in no significant matches to DNA from other organisms than M. handleri itself.
From the atlas visualization it is clear that region II (but not region I) has a much lower AT-content than the genome average. The average AT-content is only 35.1% compared to the already low genomic average of 38.8%. An atypical local base composition is often used as supporting evidence for lateral gene transfer, although it should never be used alone .
Orphan proteins – not just ORFs
However, region II has an extremely low AT-content which can give rise to long random ORFs. It is therefore not possible, based on the length distributions alone, to completely rule out that the genes annotated in this region could be random ORFs. To further investigate this, the codon usage of genes in the unknown regions was compared to that of genes from known parts of the genome (Figure 2(b)). From the plot it is concluded that the codon preferences are identical in the unknown regions and other parts of the genome, despite large differences in amino acid composition (see below). Given this, we find it highly unlikely that the annotated protein coding genes could be random ORFs. Furthermore, the similarity in codon usage also speak against the two regions having been acquired through lateral transfer of DNA. A more plausible explanation is that the two regions represent M. kandleri plasmids which have been integrated into the main chromosome.
Prediction of translation start
Discrimination between true protein coding genes and random ORFs is not the only problem in prokaryotic gene finding. Predicting the correct start codon is even more challenging as the ATG corresponding to the longest possible ORF is not necessarily the actual start codon used . In addition to analysis of codon usage, modelling of the ribosomal binding site can also help determine the correct translation start site. This has already been taken into account by the gene finder used for annotating the M. kandleri genome .
The consensus sequence for the ribosome binding site (RBS) in an organism can generally be deduced from the 3' sequence of the 16S ribosomal RNA. In the M. kandleri AV19 genome only one 16S rRNA sequence is annotated, ending with the sequence CACCTC-3'. However, a sequence comparison with 16S sequences from the Ribosomal Database Project  revealed that the most closely related sequences all end on CACCTCC-3'. As the first nucleotide after the annotated 16S rRNA in M. kandleri genome is indeed a cytosine, we suggest that the annotated 16S rRNA is missing one nucleotide. The corresponding RBS consensus sequence would thus be GGAGGTG, the reverse complement of the rRNA 3' end.
That GGGGG is the most overrepresented pattern just upstream of translation start would suggest that it is the preferred RBS sequence despite it not having perfect complementarity to the 16S rRNA 3' end. The mismatch between the 3'-end of the 16S rRNA and the most common RBS can easily be accommodated as it corresponds to the wobble base pairing between guanine and uracil, which is very common in RNA .
Protein families in M. kandleri
Distribution of M. kandleri specific protein families. Only the subset of the 45 protein families showing a preference for either region I or II is shown. The presence of specific protein families within each region suggests that the two regions serve different functions.
Properties of proteins from the unknown regions
The localization of different M. kandleri specific protein families in the two unknown regions (Table 1) together with the difference in AT-content (Figure 1) suggests that the two regions should be studied separately. To do so, linear zooms of the two regions were constructed (Figure 4).
Several striking subregions can be found within both of the unknown regions. Region I is dominated by proteins predicted to have N-terminal signal peptides, which strongly suggests them to be secreted (see Figure 4). However, this is largely due to two very large proteins from the MK-5 family located at 1,126 kbp to 1,136 kbp. In addition to this region, several other clusters of proteins predicted to have signal peptides are observed within both region I and region II.
In addition to clusters of secreted proteins, several groups of predicted transmembrane proteins are observed in both regions – in particular region I. Some degree of overlap is observed between proteins predicted to be transmembrane and those predicted to have signal peptides. The MK-9 protein family occurring exclusively within region I constitute one cluster of transmembrane proteins (1,138 k–1,149 k). Similarly, the five members of the MK-10 family in region II form a cluster of proteins predicted to be transmembrane (1,375 k–1,382 k). Considering the special membrane of M. kandleri which consists of a terpenoid lipid  and the extreme conditions under which it lives, the presence of special membrane proteins is hardly surprising.
Three of the MK-10 family proteins also contain low-complexity regions. In addition to these proteins, region II contains an abundance of other low-complexity proteins that are, however, not predicted to be transmembrane. Low-complexity regions often form unstructured non-globular domains [14, 15]. DNA binding proteins have previously been shown to contain more low-complexity regions than other proteins . It is thus tempting to speculate that some of the low-complexity proteins found in M. kandleri could be involved in stabilizing DNA at extreme temperatures.
Amino acid biases
Low-complexity regions are defined as regions with strong bias towards one or more amino acid residues. It thus possible to further characterize regions of low-complexity by studying which residues are overrepresented. Such biased regions will often lead to an unusual amino acid composition of the protein as such.
Also, Figure 5 shows an underrepresentation of methionine in proteins from both regions, as well as a strong bias against isoleucine and lysine in region II. In contrast to the positive biases, the negative ones appear to be due to a relatively weak bias in the majority of the proteins. The residues found to be underrepresented in region II have in common that they are encoded by AT-rich codons. This is consistent with the lower AT-content of region II compared to the rest of the genome.
This leads to an interesting "hen and egg" problem: is it the low AT-content of region II that is the cause of the biases in amino acid composition or vice versa? The lack of a positive bias for amino acid residues encoded by GC-rich codons would suggest the latter. This hypothesis is further supported by an analysis of AT-content of intergenic regions within region II to that of intergenic regions from other parts of the chromosome, which revealed no significant difference according to a Kolmogorov-Smirnov test .
A link between the amino acid compositions of proteins and their thermostability has previously been suggested by several groups [18–20]. There is little agreement between the compositional characteristics of thermostable proteins suggested, although asparagine and glutamine are generally found to be underrepresented while arginine is overrepresented. Although this is consistent with the amino acid biases observed in region II, the correlation is very weak as the strongest biases do not correlate with thermostability. We therefore see no reason to suspect proteins from either of the two unknown regions to have different thermostability than other M. kandleri proteins.
While claims of newly sequenced genomes containing large numbers of unique proteins are often encountered, the existence of these proteins is rarely backed up by anything but a gene finding method. While such spurious gene predictions are typically scattered throughout the genome, we have discovered two distinct, large regions in the genome of M. kandleri where only a small fraction of the annotated genes share significant similarity with known proteins from other organisms. Analysis of length distributions and codon usage strongly suggests the presence of a large number of unique protein coding genes within these regions and rejects the hypothesis that the regions could have been aquired through lateral gene transfer. Instead, we believe the most likely origin to be integration of plasmids. Extensive bioinformatics analysis of the proteins encoded by these regions suggests many of them to be transmembrane, but little else can be predicted about their functions. Additional experimental data is needed in order to learn more about these proteins.
The genome sequence and annotation of the M. kandleri AV19 genome was downloaded from GenBank  along with 13 other sequenced archaeal genomes (A. pernix, P. aerophilum, S. solfataricus, S. tokodaii, T. acidophilum, T. volcanium, A. fulgidus, M. thermoautotrophicum, M. jannaschii, M. acetivorans, M. mazei, P. abyssi, and P. horikoshii).
Sequence similarity searches
All sequence similarity searches were performed using gapped BLAST with low complexity filter enabled . The conceptual translations of all annotated protein coding genes in the M. kandleri genome were searched against four different sequence databases: the M. kandleri proteome itself, SWISS-PROT , GenBank , and a set of all annotated protein coding genes from the 13 other sequenced archaeal genomes. In the BLAST results from the search against the M. kandleri proteome, the self match of each protein was discarded.
Prediction of protein properties
Protein properties were predicted from sequence using a wide array of prediction methods. Low-complexity regions in the sequences were identified using SEG, which is the program used by BLAST to mask such regions . SignalP was used to predict signal peptides using the model trained on eukaryotic proteins . Transmembrane helices were predicted using the TMHMM method . These three protein features were the only that were found to correlate significantly with the unknown regions according to Kolmogorov-Smirnov tests .
The other features tested for correlations were: grand average hydropathity , instability index , predicted glycosylation sites , predicted phosphorylation sites , PEST regions , and secondary protein structure predicted by PSIPRED .
The matches found by the BLAST searches described above are visualized as three circles (Figure 1) representing each of the databases against which BLAST searches were performed. For every protein the negative logarithm of E-value of the most significant match was calculated (imposing a maximum score of 15 for highly significant matches). These values were mapped to the chromosomal location of the corresponding genes color coded so that regions with many significant matches are colored whereas regions with few matches are gray.
The predicted protein properties were plotted by making similar mappings of the predicted fractions of transmembrane and low-complexity residues in each protein. Signal peptide predictions were represented by their mean S-score . All three sets of values were color coded using a scheme so that only regions containing unusually many secreted, transmembrane, and/or low-complexity proteins will be visible.
Finally, we include a circle showing the AT-content (which is closely correlated to many DNA structural properties ). A double sided color scheme was used which highlights regions of unusually high or low AT-content compared to the average for the genome.
Length distribution plots
Protein length distributions for proteins from different regions of the M. kandleri were plottes as density estimates. Rather than using simple histograms, Gaussian kernel density estimates were calculated for log-transformed protein lengths. The widths of the Gaussian kernels were estimated based on the number of data points and their spread . The log-length density estimates were subsequently transformed back to yield ordinary length distributions.
Calculation of codon usage and AT-content
For genes residing in the each of the unknown regions and for genes from known regions, the codon usage was calculated by first counting the frequency of each of the 61 coding triplets. The triplets encoding each amino acid were then normalized to a sum of one to cancel effects due to the amino acid composition. We refer to these normalized frequencies as the codon usage. Applying the codon usage estimated from all annotated protein coding regions as weighting factors, the average codon AT-content for each amino acid was calculated as a weighted average of the AT-content of all codons encoding the amino acid in question.
RBS pattern search
The sequences located at positions -30 to -1 relative to translation start were extracted for all annotated protein coding genes. These sequence were used as positive examples in the subsequent search. The sequences for positions -60 to -31, again relative to translation start, were extracted for use as negative examples. All DNA words up to a length of 10 bp were tested for significant overrepresentation in the positive examples relative to the negative examples using a hypergeometric test as described by Jensen and Knudsen .
Fraction of proteins assignable to cellular role categories. The estimated number of genes in each genome is compared to the number of genes for which a cellular role could be assigned using EUCLID.
No. genes estimated
No. genes assigned
This work was supported by grants from the Danish National Research Foundation and the Danish Natural Science Research Council. Marie Skovgaard is funded by EU Cell Factory Project, Screen, QLK3-CT-2000-00649.
- Skovgaard M, Jensen L, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete micro-bial geuomes. Trends in Genetics. 2001, 17: 425-428. 10.1016/S0168-9525(01)02372-1.View ArticlePubMedGoogle Scholar
- Rogozin I, Makarova K, Murvai J, Czabarka E, Wolf Y, Tatusov R, Szekely L, Koonin E: Connected gene neighborhoods in prokarytic genomes. Nucl Acids Res. 2002, 30: 2212-2223. 10.1093/nar/30.10.2212.PubMed CentralView ArticlePubMedGoogle Scholar
- Slesarev A, Mezhevaya K, Makarova K, Polushin N, Shcherbinina O, Shakhova V, Belova G, Aravind L, Natale D, Rogozin I: The complete genome of hyperthermophile Methanopyrus kandleri AV19 and monophyly of archaeal methanogens. Proc Natl Acad Sci USA. 2002, 99: 4644-4649. 10.1073/pnas.032671499.PubMed CentralView ArticlePubMedGoogle Scholar
- Sebaihia M, Bentley S, Thomson N, Holden M, Parkhill J: Tales of the unexpected. Trends in Microbiology. 2002, 10: 261-262. 10.1016/S0966-842X(02)02379-X.View ArticlePubMedGoogle Scholar
- Jensen L, Friis C, Ussery D: Three views of microbial genomes. Res Microbiol. 1999, 150: 773-777. 10.1016/S0923-2508(99)00116-3.View ArticlePubMedGoogle Scholar
- Pedersen A, Jensen L, Stasrfeldt H, Brunak S, Ussery D: A DNA structural atlas of E. coli. J Mol Biol. 2000, 299: 907-930. 10.1006/jmbi.2000.3787.View ArticlePubMedGoogle Scholar
- Skovgaard M, Jensen L, Friis C, Stærfeldt HH, Worning P, Brunak S, Ussery D: The atlas visualisation of genome-wide information. In, Methods in Microbiology. Edited by: Wren B, Dorrell N. 2002, Academic Press, London, UK, 33: 49-63.Google Scholar
- Ragan M: On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett. 2001, 201: 187-191. 10.1016/S0378-1097(01)00262-2.View ArticlePubMedGoogle Scholar
- Hannenhalli S, Hayes W, Hatzigeorgiou A, Fickett J: Bacterial start prediction. Nucl Acids Res. 1999, 27: 3577-3582. 10.1093/nar/27.17.3577.PubMed CentralView ArticlePubMedGoogle Scholar
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. implications for finding sequence motifs in regulatory regions. Nucl Acids Res. 2001, 29: 2607-2618. 10.1093/nar/29.12.2607.PubMed CentralView ArticlePubMedGoogle Scholar
- Maidak B, Cole J, Lilburn T, Parker C, Saxman P, Farris R, Garrity G, Olsen G, Schmidt T, Tiedje J: The RDP-II (Ribosomal Database Project). Nucl Acids Res. 2001, 29: 173-174. 10.1093/nar/29.1.173.PubMed CentralView ArticlePubMedGoogle Scholar
- Gautheret D, Konings D, Gutell R: G: U base pairing motifs in ribosomal RNA. RNA. 1995, 1: 807-814.PubMed CentralPubMedGoogle Scholar
- Hafenbradl D, Keller M, Thiericke R, Stetter K: A novel unsaturated archaeal ether core lipid from the hyperthermophile Methanopyrus kandleri. Syst Appi Microbiol. 1993, 16: 165-169.View ArticleGoogle Scholar
- Wright P, Dyson H: Intrinsically unstructured proteins: Reassessing the protein structure – function paradigm. J Mol Biol. 1999, 293: 321-331. 10.1006/jmbi.1999.3110.View ArticlePubMedGoogle Scholar
- Dunker A, Obradovic Z: The protein trinity – linking function and disorder. Nature Biotechnology. 2001, 19: 805-806. 10.1038/nbt0901-805.View ArticlePubMedGoogle Scholar
- Wise M: Ojpy: a software tool or low complexity proteins and protein domains. Bioinformatics. 2001, 17: S288-S295. 10.1093/bioinformatics/17.3.288.View ArticlePubMedGoogle Scholar
- Young I: Proof without prejudice: use of the Kolmogorov-Smirnov test for the analysis of histograms from flow systems and other sources. J Histochem Cytochem. 1977, 25: 935-941.View ArticlePubMedGoogle Scholar
- Haney P, Badger J, Buldak G, Reich C, Woese C, Olsen G: Thermal adaption analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species. Proc Nail Acad Sci USA. 1999, 96: 3578-3583. 10.1073/pnas.96.7.3578.View ArticleGoogle Scholar
- Kreil D, Ouzounis C: Identification of thermophilic species by the amino acid composition deduced from their genomes. Nucl Acids Res. 2001, 29: 1608-1615. 10.1093/nar/29.7.1608.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar S, Nussinov R: How do thermophilic proteins deal with heat?. Cell Mol Life Sci. 2001, 58: 1216-1233.View ArticlePubMedGoogle Scholar
- Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Rapp B, Wheeler D: GenBank. Nucl Acids Res. 2002, 30: 17-20. 10.1093/nar/30.1.17.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl Acids Res. 2000, 28: 45-48. 10.1093/nar/28.1.45.PubMed CentralView ArticlePubMedGoogle Scholar
- Wootton J, Federhen S: Statistics of local complexity in amino-acid-sequences and sequence data bases. Comput Chem. 1993, 17: 149-163. 10.1016/0097-8485(93)85006-X.View ArticleGoogle Scholar
- Nielsen H, Brunak S, von Heijne G: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 1999, 12: 3-9. 10.1093/protein/12.1.3.View ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer E: Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle R: A simple method for displaying. J Mol Biol. 1982, 157: 105-132.View ArticlePubMedGoogle Scholar
- Guruprasad K, Reddy B, Pandit M: Correlation between stability of a protein and its di-peptide composition: A novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990, 4: 155-161.View ArticlePubMedGoogle Scholar
- Hansen J, Lund O, Tolstrup N, Gooley A, Williams K, Brunak S: tOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconj J. 1998, 15: 115-130. 10.1023/A:1006960004440.View ArticlePubMedGoogle Scholar
- Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol. 1999, 294: 1351-1362. 10.1006/jmbi.1999.3310.View ArticlePubMedGoogle Scholar
- Rechsteiner M, Rogers S: PEST sequences and regulation by pro-teolysis. Trends Biochem Sci. 1996, 21: 267-271. 10.1016/0968-0004(96)10031-1.View ArticlePubMedGoogle Scholar
- Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.View ArticlePubMedGoogle Scholar
- Silverman B: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. 1986, Chap 3-Google Scholar
- Jensen L, Knudsen S: Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics. 2000, 16: 326-333. 10.1093/bioinformatics/16.4.326.View ArticlePubMedGoogle Scholar
- Tamames J, Ouzounis C, Casari G, Sander C, Valencia A: EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics. 1998, 14: 542-543. 10.1093/bioinformatics/14.6.542.View ArticlePubMedGoogle Scholar
- Andrade M, Brown N, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C: Automated genome sequence analysis and annotation. Bioinformatics. 1999, 15: 391-412. 10.1093/bioinformatics/15.5.391.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.