Analysis of Nanoarchaeum equitans genome and proteome composition: indications for hyperthermophilic and parasitic adaptation
© Das et al. 2006
Received: 30 May 2006
Accepted: 25 July 2006
Published: 25 July 2006
Skip to main content
© Das et al. 2006
Received: 30 May 2006
Accepted: 25 July 2006
Published: 25 July 2006
Nanoarchaeum equitans, the only known hyperthermophilic archaeon exhibiting parasitic life style, has raised some new questions about the evolution of the Archaea and provided a model of choice to study the genome landmarks correlated with thermo-parasitic adaptation. In this context, we have analyzed the genome and proteome composition of N. equitans and compared the same with those of other mesophiles, hyperthermophiles and obligatory host-associated organisms.
Analysis of nucleotide, codon and amino acid usage patterns in N. equitans indicates the presence of distinct selective constraints, probably due to its adaptation to a thermo-parasitic life-style. Among the conspicuous characteristics featuring its hyperthermophilic adaptation are overrepresentation of purine bases in protein coding sequences, higher GC-content in tRNA/rRNA sequences, distinct synonymous codon usage, enhanced usage of aromatic and positively charged residues, and decreased frequencies of polar uncharged residues, as compared to those in mesophilic organisms. Positively charged amino acid residues are relatively abundant in the encoded gene-products of N. equitans and other hyperthermophiles, which is reflected in their isoelectric point distribution. Pairwise comparison of 105 orthologous protein sequences shows a strong bias towards replacement of uncharged polar residues of mesophilic proteins by Lys/Arg, Tyr and some hydrophobic residues in their Nanoarchaeal orthologs. The traits potentially attributable to the symbiotic/parasitic life-style of the organism include the presence of apparently weak translational selection in synonymous codon usage and a marked heterogeneity in membrane-associated proteins, which may be important for N. equitans to interact with the host and hence, may help the organism to adapt to the strictly host-associated life style. Despite being strictly host-dependent, N. equitans follows cost minimization hypothesis.
The present study reveals that the genome and proteome composition of N. equitans are marked with the signatures of dual adaptation - one to high temperature and the other to obligatory parasitism. While the analysis of nucleotide/amino acid preferences in N. equitans offers an insight into the molecular strategies taken by the archaeon for thermo-parasitic adaptation, the comparative study of the compositional characteristics of mesophiles, hyperthermophiles and obligatory host-associated organisms demonstrates the generality of such strategies in the microbial world.
The hyperthermophilic archaeon Nanoarchaeum equitans is characterized by several intriguing features. It is the only known parasitic archaeon and for survival it must be in contact with the crenarchaeon host Ignicoccus. Its genome size is only 490 kb, representing the smallest microbial genome known to date, and yet it has the highest coding density, encoding for 536 genes . Phylogenetic analyses suggested that this microbe is probably a derived, but genomically stable parasite diverged anciently from the archaeal lineage .
The genes for several vital metabolic pathways appear to be missing in Nanoarchaeum . This could be due to two plausible reasons. N. equitans might represent an ancient species, and hence possesses a small genome. Alternatively, it might have gone through a process of genome reduction as a strategy of adaptation to the obligatory parasitic lifestyle, as observed in cases of many other parasitic/symbiotic organisms . However, in obligatory intracellular bacteria, many genes involved in DNA recombination and repair along with the biosynthetic and metabolic genes are usually lost [3, 4]. But N. equitans possesses most of the DNA repair enzymes and the complete genetic machinery necessary for transcription, translation and DNA replication. The complexity of its information processing systems and the simplicity of its metabolic apparatus, therefore, suggest the presence of an unanticipated world of organisms yet to be characterized.
Most of the obligatory symbiotic/parasitic bacteria are characterized by the presence of only one/two rRNA operons, a small number of genes for tRNA isoacceptors, slow growth rate and an overall AT-richness [5–7]. It has been proposed that at the beginning of the symbiotic or parasitic integration, the loss of the genes involved in DNA repair favored the bias toward A+T content of such genomes [3, 4, 8]. However, the presence of a full set of archaeal DNA repair and recombination enzymes in N. equitans  contradicts the established hypothesis regarding AT-richness of its genome. Evidences for apparently little or no translational selection in synonymous codon usage have been reported for most of the species with reduced genomes examined so far, such as Borrelia burgdorferi , Buchnera aphidicola , Helicobacter pylori , Bartonella , Wigglesworthia  and Tropheryma whipplei  etc. Some of these bacteria exhibit strong base compositional asymmetries between leading and lagging strands of replication. It was, therefore, of interest to investigate whether N. equitans, an archaeon, is also characterized by any of such typical traits of the reduced genomes.
The composition of N. equitans genome and proteome are expected to bear the signatures not only of parasitism, but also of hyperthermophilicity. Comparative analysis of complete genomes of several hyperthermophilic archaea and bacteria revealed that organisms adapted to high temperature require a coordinated set of evolutionary changes towards stability of mRNA, codon-anticodon interactions , increased thermostability of encoded proteins by van der Waals interactions , larger number of residues in the alpha-helical conformation , enhanced secondary structure propensity , higher core hydrophobicity , additional networks of hydrogen bonds , increased ionic interactions , increased packing density , decreased length of surface loops  etc. With increase in growth temperature, microbial organisms tend to acquire base A and lose base C while keeping the contents of bases T and G relatively constant , and there is a clear link between a particular pattern of codon usage and the elevated growth temperature .
The current report presents an extensive study on the genome and proteome composition of N. equitans, along with a comparative analysis of the compositional characteristics of other mesophiles, hyperthermophiles and obligatory host-associated organisms. Only A+T -rich organisms were selected for analysis, so that the inter-species differences in nucleotide/amino acid usage patterns due to mutational bias could be minimized and any difference in such usage patterns among the mesophilic and hyperthermopilic organisms could be fairly attributed to their adaptation to the growth temperature. The study provides a more detailed view on the genome-wide strategies employed by N. equitans for adaptation to the hyperthermophilic environment and parasitic life style.
General features of the N. equitans genome and 14 other microbial genomes under study
Accession No. (GenBank)
Optimal growth temp °C
ORFs under study
GC- content (%)
While the predicted ORFs of hyperthermophiles are characterized by overrepresentation of purine content, the structural RNA genes of N. equitans and other hyperthermophiles exhibit much higher GC-content than those of the mesophiles (Table 1). The GC-content of tRNA/rRNA genes exhibit a strong positive correlation (r = 0.98 and 0.96 at p < 10-4 for rRNA and tRNA respectively) with the optimal growth temperature (OGT) (Fig. 1d). Similar observation has been reported earlier in thermophilic prokaryotes [26, 27]. The higher GC-content of non-coding RNA sequences in N. equitans and other hyperthermophiles could be a strategy to facilitate the intramolecular stabilization of RNA secondary structure at elevated temperature. This notion is in agreement with earlier reports [28, 29] demonstrating a significant correlation between the growth temperature and the GC-content of 16S rRNA, which is shown to be strongest in the double-stranded stem regions of the rRNA.
Overrepresentation of positively charged residues in the gene-products of N. equitans and other hyperthermophiles is also apparent from Fig. 3c, which shows the predicted isoelectric point (pI) distribution of the proteins of the fifteen organisms under study. A bimodal distribution of isoelectric points is observed with an acidic peak at pI range of 5.0–5.5 and a basic peak at ~9.5. For the mesophiles, the acidic peak is much larger than the basic peak, while the reverse is the case for hyperthermophiles. For N. equitans,the number of basic proteins is even appreciably higher than that observed with other hyperthermophiles (Fig. 3c). These results suggest that the hyperthermophilic proteomes are characterized by a relative predominance of basic proteins. In addition to the relative abundance of positively charged residues, the frequency distribution of the encoded proteins of N. equitans and other hyperthermophiles is also shifted significantly (p < 10-7) towards higher aromaticity, as compared to that of the mesophiles (Fig. 3d).
Differences between various indices of N. equitans proteins and theirmesophilic orthologs
N. equitans proteins
Average Hydrophobicity (Sweet & Eisenberg)
1.7 × 10-11
Average Hydrophobicity (Kyte & Doolittle)
4.2 × 10-9
2.1 × 10-3
Positively Charged Residues (%)
3.0 × 10-6
Negatively Charged Residues (%)
1.4 × 10-5
Polar Uncharged Residues (%)
1.1 × 10-13
Trends in amino acid replacements in N. equitans proteins and their mesophilic orthologs
N. equitans amino acid
Mesophilic homologs amino acid
Major trends in synonymous codon and amino acid usage, as obtained from COA of RSCU and amino acid usage in genes/gene-products encoded by N. equitans
Amino acid usage
Variability explained (%)
Source of variation
Correlation coefficient* (r-value)
Variability explained (%)
Source of variation
Correlation coefficient* (r-value)
Comparison of amino acid usage between two clusters of probable membrane associated proteins and between potential highly and lowly expressed genes
Upper cluster genes
Lower cluster genes
Highly expressed genes
Lowly expressed genes
Another important source of intra-proteomic variations in amino acid usage is gene expressivity, as indicated by the presence of potential highly expressed genes near the positive extreme of axis 1 and at the negative extremes of axis 3 (Fig. 5b). Both axes exhibit significant correlation with CAI values of the genes (Table 4). The significant positive correlation of MMW with axis 3 and the negative correlation between axis 1 and aromaticity (Table 4) suggest that the potential highly expressed genes in N. equitans have a tendency to avoid the heavier residues including the aromatic ones (Table 5). N. equitans is a strictly host-adapted microorganism, which can exploit the cellular machinery of host organisms for its own survival. It is therefore interesting to note that it follows the cost minimization hypothesis , which claims that highly expressed genes tend to use small and energetically less expensive amino acids in their encoded proteins.
To understand the sources of intragenomic variation in codon usage of N. equitans we have applied a COA on RSCU of its 487 ORFs. Axis 1 exhibits significant positive correlation with the CAI values of the genes and also exhibits slight but significant positive correlation with A3S (Table 4). Most of the potential highly expressed genes including ribosomal proteins are clustered at the positive extreme of axis 1 (Fig. 6b). In COA on absolute frequencies of synonymous codon usage also, the first major axis (representing 11.5 % of total variation) exhibits significant correlation with CAI values (r = -0.52, p < 0.0001) and the putative highly expressed genes are clustered on the negative side of that axis. These observations suggest that the major trend in synonymous codon usage is gene expressivity. Usage of 16 codons increases significantly (p < 0.05) in potential highly expressed genes, most of which prefer to use A-ending or C-ending synonymous codons [see Additional file 2]. However the frequencies of G-ending codons or U-ending codons either remain almost constant in potential highly and lowly expressed genes (except AGG codon for Arg, UGU for Cys and GGU for Gly), or show a marked fall in potential highly expressed genes. Preference for C-ending codons in highly expressed genes against the genome-wide mutational bias is probably a consequence of translational selection . No known parameter of codon usage, base or amino acid composition is found to have significant correlation with the position of sequences on axis 2 (Table 4). But interestingly enough, there is a divergence in the distribution of few genes along axis 2 near the negative extreme of axis 1. Careful examination reveals that this is only due to the differential usage of four rare synonymous codons (CGN of Arg) in N. equitans.
Variation in synonymous codon usage in N. equitans and seven other obligatory host- associated microbial organisms
No. of tRNA genes
No. of rRNA operons
ORFs under study
Variation explained by COA on RSCU (%)
Ureaplasma parvum serovar3
The present analysis indicates that the dual adaptation of N. equitans to high temperature and to an obligate parasitism has imposed selective constraints on nucleotide usage at synonymous and nonsynonymous codon positions, modulating thereby its genome/proteome composition. Thermal adaptation involves overrepresentation of purine bases in protein coding sequences, higher GC-content of the structural RNA genes, enhanced usage of positively charged residues, higher frequencies of aromatic residues, decrease in polar uncharged residues in the encoded protein etc., while parasitic adaptation is reflected in the extreme genome reduction, presence of weak translational selection for synonymous codon usage, limited number of tRNAs and rRNAs, large heterogeneity in membrane associated proteins and so on.
One of the most exciting observations is the significant increase in the usage of positively charged amino acid residues in encoded proteins of N. equitans and other hyperthermophiles compared to that in mesophilic organisms. A strong positive correlation between the optimal growth temperature of the organisms and the percentage of proteins with P/N ratio > 1 in the respective proteomes (Fig. 3b), relatively basic nature of the proteomes of N. equitans and other hyperthermophiles, as depicted by isoelectric point distribution (Fig. 3c) and bias in the replacement of uncharged polar residues of mesophilic proteins by positively charged residues (mainly Lys) in the N. equitans orthologs (Table 3) - all point towards a strong preference for positively charged amino acids in the gene-products of hyperthermophiles. In parallel, there has also been an increase in aromatic residues (especially Tyr) in encoded proteins of N. equitans and other hyperthermophiles (Fig. 3d; Table 3). Greater involvement of positively charged residues at or near protein surfaces may increase the probability of salt bridge formation with negatively charged residues, while simultaneous increase in aromatic residues may strengthen the cation-π interaction. When a cationic side chain comes near an aromatic side chain within a protein, the geometry is known to be biased towards one that would experience a favorable cation-π interaction . It was suggested earlier that both salt-bridge and cation-π interactions may play important roles in thermostability . Tyrosine, by itself, may also contribute to protein thermostability . Selective increase in Lys/Arg and aromatic residues in N. equitans may, therefore, be a strategy of survival at high temperature. A recent study on atomic simulation has revealed that among the charged residues, Lys has much greater number of accessible rotamers than Arg and may entropically stabilize the folded states of proteins .
Marked reduction in the frequencies of uncharged polar residues may also contribute to thermostability by avoiding the deamination and backbone cleavages involving Asn and Gln, which can be catalyzed by Ser and Thr [45, 46]. According to the Sweet and Eisenberg scale , there is a significant increase in average hydrophobicity in the encoded proteins of N. equitans, as compared to their mesophilic orthologs. This may also be a part of the measures taken for environmental adaptation. The Kyte-Doolittle scale , however, did not indicate any significant difference in average hydrophobicity between these two groups of proteins. It was suggested earlier by Haney et al.  that some of the established hydrophobicity scales are strongly correlated to the differences between the proteins of mesophiles and thermophiles, whereas others are not. Replacement of the uncharged polar residues of mesophilic proteins by more hydrophobic residues in N. equitans orthologs may lead to an increase in the extent of the hydrophobic core and hence to a decrease in the solvent accessible surface area of the protein. Therefore the stability of N. equitans proteins in extremely high temperatures is apparently provided by significant modifications in their sequences toward enrichment of certain residues.
It is interesting to note that unlike mesophilic organisms, in N. equitans and other hyperthermophiles, there is a markedly differential selection for nucleotide usage in protein coding and structural RNA sequences. Both nonsynonymous and synonymous codon positions of their coding sequences are purine rich, and this has two probable consequences. Firstly, due to purine richness of protein-coding sequences, the organisms may minimize unnecessary RNA-RNA interactions and prevent the double-stranded RNA formation within the molecule . Secondly, the higher purine content in nonsynonymous codon position has good correlation with the increased frequencies of certain residues in the encoded proteins. These may help the organism to adapt in high temperature. In contrast, the structural RNA sequences (tRNAs and rRNAs) of these organisms are characterized by significantly higher GC-content and hence by an increased number of hydrogen bonds, which may facilitate intramolecular stabilization at elevated temperature [29, 48]. Thus, in N. equitans, selection for the prevalence of purine bases in ORFs and the GC- richness of non-coding RNA sequences may also be the consequence of its hyperthermophilic adaptation.
Recent studies on several microbial genomes indicate a close connection between synonymous codon usage bias, tRNA abundance, number of rRNA operons, optimal generation time and genome size [6, 7, 49]. Most of the species with reduced genomes are host-associated microorganisms, characterized by the presence of only one or two rRNA operons, a small number of tRNA genes, long generation time and overall AT-richness. Evidences for apparently little translational selection have been reported in most of these organisms. N. equitans, which is an archaeal parasite with the smallest genome known so far, encodes only a limited number of tRNAs (38 identified tRNAs) and single copies of 5S, 16S and 23S rRNA. Although N. equitans is a hyperthermophilic archaeon, evidence for a relatively poor translational selection for synonymous codon usage is consistent with the earlier observations on several bacteria adapted to strictly host-associated lifestyle. Furthermore, the synonymous codon usage pattern of N. equitans forms a subset of the patterns observed typically in hyperthermophilic microbial species and is quite distinct from the patterns of the mesophilic organisms (Fig. 5a). Synonymous codon usage in Nanoarchaeal genes, therefore, reveals a dual adaptation to obligatory parasitism and hyperthermophilicity.
As demonstrated by the COA on amino acid usage, probable membrane associated proteins exhibit two different clusters (Fig. 4a). The amino acid usage profile and the predicted secondary structures of the members of these two clusters are quite distinct from one another (Table 5). Most of the variations in cell-surface proteins may be potentially important for N. equitans to interact with the Ignicoccus host and probably evolved during the course of parasitic and/or thermal adaptation. It is also important to note that in spite of its obligatory parasitic lifestyle, there is a tendency to follow the cost-minimization hypothesis at lower level as proteins encoded by highly expressed genes are preferentially constructed with some smaller and energetically less expensive amino acids. Existence of cost minimization effect in host-associated organisms might be due to a genome-level adaptation to utilize less expensive and small residues from the host in the highly expressed genes [12, 14, 50]. This might have an evolutionary advantage to minimize host energy exhaustion for maintaining continued association and the chance of elimination by the host.
Hyperthermophilic organisms, in general, have comparatively smaller genome than the mesophilic organisms. It might be advantageous in hyperthermophiles to maintain multiple copies of chromosomes per cell due to a probable need of a reserve supply of intact chromosome to compensate for the greater chance of DNA double strand breaks at high temperature [51, 52]. Furthermore, the faster replication of small genome is likely to be more favorable in the environment having temperatures near to or above 100°C . Many microbial organisms living in close association with other organisms in an obligate symbiotic or parasitic relationship also experienced a reduction in genome size with respect to their free-living ancestors. The N. equitans genome lacks the genes for central metabolism, primary biosynthesis and bioenergetic apparatus , which are expected to be present in the common archaeal ancestor. In contrast to mesophilic organisms, it possesses the simplest functional protein folding system - the genome contains only single copies of homologues of prefoldin α- and β-subunits, Hsp60 and sHsp . Unlike other obligate symbiotic/parasitic organisms, N. equitans has well-organized DNA repair mechanism with a full set of archaeal DNA repair and recombination enzymes . Furthermore, despite the small genome size, it devotes a large amount of coding capacity for surface-associated proteins, suggesting that the interaction with its host may play a major part in the parasitic adaptation of the organism. Hence, it can be inferred that the unusual genome reduction and genome composition in N. equitans are the consequences of both hyperthermophilic and parasitic adaptation and during the coevolutionary process with Ignicoccus host, N. equitans may have experienced a dramatic decrease of genome size, retaining only the essential genes for its thermo-parasitic lifestyle.
Comprehensive analysis on the N. equitans genome along with its comparison to other mesophiles, hyperthermophiles and host-associated organisms allowed us to understand how the dual adaptation of N. equitans to high temperature and to an obligate parasitism can influence the nucleotide usage at synonymous and nonsynonymous codon positions, modulating thereby its genome/proteome composition. Thermal adaptation involves overrepresentation of purine bases in protein coding sequences, higher GC-content of the structural RNAs, enhanced usage of positively charged residues and aromatic residues, decrease in polar uncharged residues in the encoded protein and so on, while the parasitic adaptation is reflected in the extreme genome reduction, presence of weak translational selection for synonymous codon usage, large heterogeneity in membrane associated proteins etc. Our findings not only offer an insight into the mechanisms of genomic adaptation of N. equitans to high temperature and parasitism, but also evaluate the generality of such mechanisms in the microbial world.
All predicted protein coding sequences and the sequences of structural RNAs (tRNAs and rRNAs) of Nanoarchaeum equitans Kin4-M were extracted from NCBI GenBank (Version 145.0) . To understand temperature related traits, we compiled sequences of predicted protein coding genes and structural RNAs from seven hyperthermophilic and seven mesophilic microbial organisms from GenBank . For comparison purpose, the selection of these completely sequenced microbial organisms was based on the close approximation in genomic GC-content of N. equitans (i.e. all organisms under study are relatively AT-rich) to minimize the GC-compositional effect on codon as well as on amino acid usage. To compare with other host-associated organisms, sequences from seven obligatory parasitic/symbiotic microorganisms were also retrieved. In order to reduce sampling errors, the annotated genes with less than 100 codons were excluded from the analysis. The presumed duplicates, genes for transposase and integrase, and the genes with internal stop codons and/or untranslatable codons were also excluded.
To find out the extent of base compositional bias, nucleotide frequency at all three codon positions were calculated for protein coding sequences. The purine content at nonsynonymous codon position (R1+2) and synonymous third codon position (R3S) were calculated for each coding sequences of seven mesophilic, seven hyperthermophilic organisms and N. equitans. Purine-pyrimidine skew was performed using sliding windows of 0.1 kb on the genomic sequence of N. equitans. The GC content and also purine content of the structural RNA sequences were calculated for N. equitans, seven hyperthermophilic and seven mesophilic organisms under study.
Correspondence Analysis (COA) was performed using the program CODONW 1.4.2  to identify the major factors influencing the variation in relative synonymous codon usage (RSCU) and amino acid frequencies. COA was also carried out on absolute frequencies of synonymous codons in order to avoid introducing other biases . These analyses generate a series of orthogonal axes to identify trends that explain the variation within a dataset, with each subsequent axis explaining a decreasing amount of the variation.
Orthologous sequences between N. equitans and seven mesophilic organisms under study were taken using the BlastP program . Orthologs were defined as those with more than or equal to 60% similarities and less than 20% difference in length. The amino acid sequences of 105 orthologous genes were aligned using the pairwise alignment program (ClustalW) and the amino acid replacements were obtained in the form of a matrix, using a program developed in-house in Visual Basic. For a given pair of amino acids, the "forward" direction exhibited the more common of the two replacements in the conversion of mesophilic proteins to N. equitans proteins. To assess the significance of the directional bias, if any, replacement values were compared by 2 × 2 contingency tables having 1 degree of freedom. For each pair of replacements, the first and second rows of the contingency table represented the number of replacements from one particular residue (say, i) to another (say, j) of the pair and the total count of the remaining replacements (say, k) from the residue i (where k ≠ j) respectively.
The prediction of protein secondary structure was performed using GOR IV algorithm  and the disordered regions within proteins were predicted using GlobPlot . SMART  and TMHMM2.0  available at ExPASy Proteomics Server  were used to detect the proteins likely to secreted in or localized to the cell surface.
Indices like total number of occurrence of each codon, RSCU , codon adaptive index (CAI) , amino acids frequencies, average hydrophobicity (Gravy score) [34, 47], aromaticity , aliphatic index  and mean molecular weight (MMW) of protein coding sequences were calculated to find out the factors influencing codon and amino acid usage. The CAI was calculated for N. equitans genes with respect to the RSCU values of the genes for ribosomal proteins (≥ 100 aa). The isoelectric point (pI) of each predicted proteins were calculated using Expasy proteomics server .
Modeled structures were generated for the elongation factor Tu (EF- Tu) of N. equitans and the cell division cycle family protein from both N. equitans and M. maripaludis (NEQ475 in N. equitans and MMP0176 in M. maripaludis) by using the First Approach Mode at the Swiss-Model protein structure homology modeling server . The surface charge distributions were mapped onto the predicted surface using the program MOLMOL . Total surface charge was calculated using Biomolecule module in Insight II workstation. Comparisons were made between N. equitans EF- Tu and E. coli EF- Tu (Blast P value 1e-45) and between the CDC proteins from N. equitans (NEQ475) and M. maripaludis (MMP0176) (Blast P value 0.0).
Codon adaptation index
Optimal growth temperature
Open Reading Frame
GC-content at synonymous codon position
Purine content at first and second codon positions
Purine content at synonymous codon position
Mean molecular weight
Relative synonymous codon usage.
We are grateful to Dr. B. Achari, Emeritus Scientist, Indian Institute of Chemical Biology, Kolkata, India, for critical reading of the manuscript. This work was supported by the Council of Scientific and Industrial Research (Project No. CMM 0017) and Department of Biotechnology, Government of India (Grant Number BT/BI/04/055-2001).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.