Distribution of nitrogen fixation and nitrogenase-like sequences amongst microbial genomes

Background The metabolic capacity for nitrogen fixation is known to be present in several prokaryotic species scattered across taxonomic groups. Experimental detection of nitrogen fixation in microbes requires species-specific conditions, making it difficult to obtain a comprehensive census of this trait. The recent and rapid increase in the availability of microbial genome sequences affords novel opportunities to re-examine the occurrence and distribution of nitrogen fixation genes. The current practice for computational prediction of nitrogen fixation is to use the presence of the nifH and/or nifD genes. Results Based on a careful comparison of the repertoire of nitrogen fixation genes in known diazotroph species we propose a new criterion for computational prediction of nitrogen fixation: the presence of a minimum set of six genes coding for structural and biosynthetic components, namely NifHDK and NifENB. Using this criterion, we conducted a comprehensive search in fully sequenced genomes and identified 149 diazotrophic species, including 82 known diazotrophs and 67 species not known to fix nitrogen. The taxonomic distribution of nitrogen fixation in Archaea was limited to the Euryarchaeota phylum; within the Bacteria domain we predict that nitrogen fixation occurs in 13 different phyla. Of these, seven phyla had not hitherto been known to contain species capable of nitrogen fixation. Our analyses also identified protein sequences that are similar to nitrogenase in organisms that do not meet the minimum-gene-set criteria. The existence of nitrogenase-like proteins lacking conserved co-factor ligands in both diazotrophs and non-diazotrophs suggests their potential for performing other, as yet unidentified, metabolic functions. Conclusions Our predictions expand the known phylogenetic diversity of nitrogen fixation, and suggest that this trait may be much more common in nature than it is currently thought. The diverse phylogenetic distribution of nitrogenase-like proteins indicates potential new roles for anciently duplicated and divergent members of this group of enzymes.


Background
Biological nitrogen fixation is the major route for the conversion of atmospheric nitrogen gas (N 2 ) to ammonia [1]. However, this process is thought be limited to a small subset of prokaryotes named diazotrophs, which have been identified in diverse taxonomic groups [2]. This biochemical pathway is only manifested when speciesspecific metabolic and environmental conditions are met, thus making it difficult to develop a standard screen for detection of this biological reaction [3,4]. The complications in experimentally detecting nitrogen fixation may be a reason for the relatively low number and relatively sparse distribution of known diazotrophic species.
All known diazotrophs contain at least one of the three closely related sub-types of nitrogenase: Nif, Vnf, and Anf. Despite differences in their metal content, these nitrogenase sub-types are structurally, mechanistically, and phylogenetically related. Their catalytic components include two distinct proteins: dinitrogenase (comprising the D and K component proteins) and dinitrogenase reductase (the H protein) [1,2]. The only known exception to this rule is the superoxide-dependent nitrogenase from Streptomyces thermoautotrophicus, whose protein sequence is unknown [5].
The best studied sub-type is the molybdenumdependent (Mo-dependent) nitrogenase, the structural components of which are encoded by nifH, nifD, and nifK [1]. The two other sub-types of nitrogenase, known as alternative nitrogenases, are enzyme homologs with the exception of an additional subunit (G) in the dinitrogenase component and the absence of the heteroatom Mo. The vanadium-dependent nitrogenases are encoded by vnfH, vnfD, vnfG, and vnfK. The members of the third sub-type, the iron-only nitrogenases, are devoid of Mo and V, and their components are products of anfH, anfD, anfG, and anfK. High levels of protein sequence identity among analogous subunits across the nitrogenase subtypes allow investigation of the biodiversity in nitrogen fixation using NifH (similar to VnfH and AnfH) and/or NifD (similar to VnfD and AnfD) as markers. Most phylogenetic studies of nitrogen fixing organisms have used only NifH and/or NifD sequences as queries to assess diversity [4,[6][7][8].
The high level of complexity of nitrogenase metalloclusters results in a laborious pathway for the assembly and insertion of the active site metal-cofactor, FeMoco, into dinitrogenase. Apart from the catalytic components, additional gene products are required to produce a fully functional enzyme [9]. Although the number of proteins involved in the activation of nitrogenase seems to be species-specific and varies according to the physiology of the organism and environmental niche [10,11], so far over a dozen genes have been identified as being involved in this process. Despite variations in the precise inventory of proteins required for nitrogen fixation, it is well acknowledged that the separate expression of the catalytic components is not enough to sustain nitrogen fixation, thus indicating that the FeMoco biosynthetic enzymes play a crucial role in dinitrogenase activation [12].
In the last few years, substantial advances have been made in the functional assignment of individual gene products involved in the biosynthesis of FeMoco in Azotobacter vinelandii [9,12,13]. The current biosynthetic scheme involves a consortium of proteins that assembles the individual components, iron and sulfur, into Fe-S cluster modules for subsequent transformation into precursors of higher nuclearity, and addition of the heteroatom (Mo) and organic component (homocitrate). The synthesis of FeMoco is completed in a so-called scaffold protein, NifEN, and shuttled to the final target by cluster carrier proteins. Interestingly, the scaffold NifEN has amino acid sequence similarity to NifDK [14].
The recent growth of genomic databases now including nearly 2,000 completed microbial genomes motivated us to re-evaluate the diversity of species capable of nitrogen fixation. Identification of co-occurrence of nitrogen fixing genes in species known to fix nitrogen enabled us to identify novel potential diazotrophs based on their genetic makeup. Our findings expand the expected occurrence of nitrogen fixation and the biodiversity of diazotrophs. In addition we have identified a large number of phylogenetically diverse nitrogenaseproteins that may represent ancestral forms of the enzyme and may have evolved to perform other metabolic functions.

Species containing NifD and NifH-like sequences
The rapid expansion of microbial genome sequencing in the last few years affords novel opportunities to reexamine the distribution of nitrogen fixation genes. In this work, we have searched the genome sequences of fully sequenced microbe genomes available in Gen-Bank [15] for coding sequences similar to NifD and NifH. The initial search included 1002 Archaeal and Bacterial distinct species with fully sequenced genomes, 174 of which contained sequences similar to NifH as well as sequences similar to NifD. Literature searches on these species indicated that nitrogen fixation has not been experimentally demonstrated in more than half of these (92 out of 174), thus suggesting that the phylogenetic distribution of diazotrophs is wider than currently known. Based on the literature survey (Additional file 1: Table S1), we classified species with hits into two categories: (1) known diazotrophs -with experimental demonstration, and (2) potential diazotrophs -with no reports of experimental demonstration. Interestingly, during this literature search we found three recent reports providing experimental demonstration of diazotrophy motivated by an initial genomic identification of putative nitrogen fixation genes [16][17][18].

Identification of a minimum gene set
The crucial involvement of the FeMoco biosynthesis enzymes prompted us to analyze the occurrence of nine additional nif genes in known diazotrophic species encoding NifK, NifE, NifN, NifB, VnfG, NifQ, NifV, NifS, and NifU. The involvement of eight of these proteins in FeMo-cofactor synthesis and nitrogenase maturation has been determined [3,9,12]. The co-occurrence of additional nif genes varied from species to species [19,20]. These differences in genetic requirements most probably reflect variations in meeting the physiological demands associated with nitrogen fixation and in species-specific metabolic and environmental life styles. Nevertheless, the identification of relevant hits (listed in the Additional file 2: Table S2) revealed that nearly all known diazotrophs contain a minimum of six conserved genes: nifH, nifD, nifK, nifE, nifN, and nifB ( Figure 1). The co-occurrence of these six nif genes, known to be essential for nitrogen fixation in characterized systems, has led us to propose a requirement for a minimum gene set for nitrogen fixation that can be used as an in silico search tool for the identification of additional diazotrophs. We did find a few exceptions to this minimum gene set rule, and they are discussed below.
Our investigation showed that a clustered genomic arrangement of nif genes was a recurring feature in known diazotrophic genomes. In several species the minimum gene set was located in a single genomic region. In all cases, at least three out of the six genes contained in the minimum set were in contiguous gene regions. Most often, nifHDK were clustered, but in some other cases, nifDK was adjacent to nifEN. Nevertheless, the genomic synteny of nif genes across nitrogen-fixing species facilitated in silico assignments of putative sequences involved in nitrogen fixation.

Identification of new diazotrophs
We identified potential diazotrophic species by computational searches using the minimum gene set (Additional file 2: Table S3). We identified 92 species containing coding sequences similar to NifD and NifH, 67 of which met the minimum gene set criteria (i.e. their genome contained at least nifH, nifD, nifK, nifE, nifN, and nifB). Based on gene content, we propose that these 67 species have the capacity for nitrogen fixation.

Biodiversity of nitrogen fixing species
The taxonomic distribution of diazotrophs identified through computational assignment suggests that nitrogen fixation has greater biodiversity. Prior to this work, known bacterial diazotrophs were found in six taxonomic phyla: Actinobacteria, Chlorobi, Chloroflexi, Cyanobacteria, Firmicutes and Proteobacteria ( Figure 2 gray bars). Our study resulted in the identification of potential diazotrophs within the already identified phyla and added seven new phyla ( Figure 2 black bars). Thus, despite the availability of few representatives in these other seven phyla (Figure 2), applying the minimum gene set criteria has expanded the biodiversity of this metabolic trait by approximately two-fold. No potential diazotrophs were Figure 1 Genes involved in nitrogen fixation. Top-A. vinelandii nif gene regions. Gray-shaded trapezoids are essential genes in Mo-dependent nitrogen fixation that were used as queries for the in silico identification of nitrogen fixing species described in this study. Bottom -The proposed minimum set of genes required for nitrogen fixation. All species with sequenced genomes that are known diazotrophs and all the species proposed to be diazotrophs based on genetic content contain the minimum gene set. Figure 2 Taxonomic diversity of nitrogen fixing species. Species with fully sequenced genomes (999 Bacteria and 93 Archaea genomes) were analyzed for the minimum set of nitrogen fixation ortholog genes. Taxonomic distribution of diazotrophic species based on experimental evidence (gray bars) and in silico prediction of nitrogen fixation (black bars) is displayed by phylum. The ratio of the number of proposed species versus the number of total distinct species with sequenced genomes within each phylum is indicated. identified in Acidobacteria (5), Deinococcus-Thermus (13), Dictyoglomi (2), Elusimicrobia (1), Fibrobacteres (1), Gemmatimonadetes (1), Planctomycetes (5), Synergistetes (2), Tenericutes (29), Thermotogae (12), Thermodesulfobacteria (3), and Thermomicrobia (1) (in parenthesis, the number of species in each group with fully sequenced genomes). The lack of diazotrophs within these phyla could be attributed to the under-representation of sequenced genomes in these taxonomic groups. Unlike bacterial species, nitrogen fixation in Archaea is contained only within the phylum Euryarchaeota, where we identified seven species as potential diazotrophs.

Sporadic occurrence of alternative nitrogenase
The presence of an additional subunit, AnfG or VnfG (Additional file 2: Table S2, Additional file 2: Table S3) and distinct sequence features of alternative nitrogenases allowed us to distinguish the Mo-dependent enzymes from the alternative systems [3,21]. The genomes of most diazotrophs encode only one copy of the Modependent sub-type of nitrogenase (134 out of 149 species). Exceptions were species containing additional sub-types (Vnf and/or Anf ), such as the well-studied A. vinelandii and Rhodopseudomonas palustris, as well as Dickeya dadantii, Chloroherpeton thalassium, Methanobacterium sp., Paludibacter propionicigenes, Rhodomicrobium vannielii, and Syntrophobotulus glycolicus. Unexpectedly, selected Alphaproteobacteria species, including Rhizobium etli and Sinorizobium fredii, encoded two putative copies of Mo-dependent nitrogenase, where one copy of nifHDK is clustered with nifEN and the other copy only has genes similar to the catalytic components nifHDK. As previously proposed [10], alternative nitrogenases were only found in species containing genes coding for the Mo-dependent enzyme. This finding suggests that the hierarchy of expression of Mo-dependent over alternative nitrogenase, observed in A. vinelandii, may be universal to all species containing alternative nitrogenases [10].

Phylogenetically distinct NifDK enzymes are present in thermophilic strains lacking a defined FeMoco biosynthesis pathway
Our analysis of nif gene content revealed 28 strains that did not meet the minimal gene set criteria because they lacked either NifN or both NifE and NifN. Nevertheless, some of the hyperthermophilic representatives of this class, for example, the deep-sea vent archaeon Methanocaldococcus sp. FS406-22, have been demonstrated to fix nitrogen [22]. To further analyse the properties of the putative nitrogenases encoded by this class, we examined the environment of the FeMoco ligands in 15 NifD proteins, which we refer to collectively as group C. NifDK homologs belonging to this group possess the conserved Cys residues required for liganding a P cluster, and the NifD component contains the FeMoco ligands αCys275 and αHis442. The NifD subunits also contain the equivalents of αGln191 and αHis195 that are important for nitrogen reduction, and in addition, the homocitrate "anchor ligand" αLys426. Previous analysis identified two distinct subfamilies of NifD proteins (indicated as A and B in Figure 3) characterised by distinctive sequences surrounding their FeMoco ligands at αCys275 and αHis442 [23]. Group C represent a third subfamily, containing Gln at position 276, Asp at position 440 and lacking a residue corresponding to the aromatic amino acid found at position 444 in the A and B subfamilies ( Figure 3). Sequences in the C group are distinct from the alternative nitrogenase VnfD and AnfD subunits, which contain a conserved Ala at position 276, and a His residue replacing an acidic amino acid at position 445 (indicated as Group V in Figure 3).
The division of NifDK into three primary lineages, distinct from AnfD/VnfD/AnfK/VnfK is supported by phylogenetic analysis ( [24] and Additional file 3: Figure S1). The existence of two lineages within conventional NifDK proteins has been shown to correlate with the domain structure of NifB in Bacterial and Archaeal proteins [25]. The third lineage (denoted as C in Additional file 3: Figure S1), entirely comprised of representatives of the Archaea and Firmicutes, appears to correlate with the absence of NifN and the sequence environment of the cofactor ligands in NifD. Notably the NifDK homologs in this lineage are all derived from thermophiles with the exception of Methanococcus aeolicus Nankai-3, which possesses both NifE and NifN. Two other NifDK sequences listed in the C group (Additional file 2: Table S3) are derived from the diazotrophic methanogens, Methanobacterium thermoautotrophicum Delta H, and Methanococcus maripaludis S2, which also encode nifE and nifN. The latter two NifDK proteins belong to a distinct group (labelled M in Additional file 3: Figure S1) that is considered to have emerged before all other nitrogenase proteins [24]. Thermophilic Roseiflexus species that lack both NifE and NifN also belong to a separate phylogenetic group (labelled R in Additional file 3: Figure S1). In conclusion, there is evidence for nitrogen fixation in species lacking nifN, but this appears to be associated with a thermophilic lifestyle and the presence of a phylogenetically distinct form of nitrogenase. Although this represents a clear exception to the minimal gene set, it appears to be a special case connected with the need to fix nitrogen in extreme environments.

Nitrogenase-like sequences
During our search for nitrogenases we encountered a large number of proteins that appeared to be distantly related to the alpha and beta subunits of nitrogenase, but nevertheless belong to the Pfam nitrogenase component 1 type oxidoreductase family (PF00148). This Pfam family currently contains 2561 sequences, although a large proportion of these show similarity to the B and N subunits of the light-independent chlorophyllide reductase (DPOR), which is structurally related to Mo-Fe protein of nitrogenase. This enzyme does not contain a heterometal cluster analogous to FeMoco within its active site, and the co-ordination of the [4Fe4S] "NB" cluster within DPOR is different to that of the [8Fe7S] P cluster in nitrogenase [26]. After removal of DPOR-related sequences from our analysis by running a BLAST search against ChlB, BchB, ChlN and BchN, we observed that NifDK paralogs are represented in both diazotrophic and nondiazotrophic strains. Phylogenetic analysis of the BLASTfiltered subset revealed distinct groupings that are clearly divergent from conventional nitrogenase (Figure 4). These outgroups are also distinct from the DPOR enzymes, which form a separate clade (not shown in Figure 4). The existence of an outgroup of nitrogenase homologs (termed Group IV) has been noted previously [27], but the current availability of genome sequences has enabled more extensive analysis. It is highly unlikely that any of these nitrogenase-like proteins are competent to reduce dinitrogen as they lack ligands required to co-ordinate Fe-Moco.
Representatives of these non-conventional enzymes cluster in distinct clades relative to the conventional NifDKEN, Vnf/AnfDK and the C-group DK proteins, which are coloured dark blue in Figure 4. The genes encoding these non-conventional proteins are adjacent in genomes and have the potential to encode the alpha and beta subunits of nitrogenase-like enzymes. The lineages coloured either green or yellow in Figure 4 comprise groups of NifE or NifN related proteins that each contain the three conserved Cys residues involved in liganding the P cluster. The NifE-related subunits of partners coloured in green possess the FeMoco-ligand Cys275, but lack the highly conserved co-factor ligand, His 442. Those coloured in yellow lack both FeMoco ligands. It is possible that these proteins ligand an [4Fe-4S] cluster in a similar location to the P cluster in nitrogenase that delivers electrons to the active site. By analogy to NifEN, these enzymes may be able to reduce substrates with a limited number of electrons such as acetylene and azide [28]. These orthologs are found in diverse organisms, including the Proteobacteria, Archaea, Firmicutes and Fibrobacteres. Some organisms have an unusually large number of nitrogenase-like proteins of this class. For example, Syntrophobotulus glycolicus DSM 8271 contains nine protein pairs related to the alpha and beta subunits of nitrogenase. In two cases, these are organised as four linked genes (Sgly_0993, Sgly_0994, Sgly_0995 Sgly_0996 and Sgly_2775, Sgly_2776, Sgly_2777 and Sgly_2778) potentially located in operons, suggesting that some of these gene pairs may provide scaffolding functions for co-factor assembly into the structural subunits, analogous to the nifDKEN gene clusters encoding conventional nitrogenase.
More diverse representatives of the nitrogenase-like sequences are found in the Archaea and Firmicutes. These proteins lack FeMoco ligands and contain a variable number of conserved cysteine residues that may ligand a [Fe-S] cluster. For example Clostridium botulinum strains and Alkaliphilus oremlandii encode NifEN-like sequences (coloured light blue in Figure 4) that are located downstream of genes encoding NifH and a potential ATPase component of the ABC transporter family. Their NifErelated components (CLM_0808 and Clos_0313) contain the three conserved P cluster ligands, but conserved Cys residues are not present in the NifN-like components (CLM_0809 and Clos_0314). In contrast, Methanocorpusculum labreanum Z and Desulfitobacterium hafniense DCB-2 encode proteins with two conserved Cys residues (corresponding to αC88/αC62 and αC154/αC124) in the NifD/E-related components (Mlab_1040 and Dhaf_1539) and only a single conserved Cys residue (corresponding to ßC95/ßC44) in the NifK/N related subunits (Mlab_1039 and Dhaf_1540). Representative species from the Human Microbiome project, including Coprococcus catus GD/7 and Dorea longicatena DSM 13814, also appear in these clades (coloured red in Figure 4) and possess nitrogenase- Figure 4 Maximum-likelihood phylogenetic tree of conventional nitrogenases and nitrogenase-like sequences. The tree is represented by a core set of 73 sequences, selected from a larger tree of 472 sequences. Shimodaira-Hasegawa local support values were >0.6 except for those nodes marked with a red star. The clade coloring reflects sequences that are co-located in genomes and likely to correspond to the alpha and beta subunits of nitrogenase, with the exception of those shown in light gray, which are single subunit enzymes (NflD). Dark blue clades are conventional nitrogenases, labeled as NifD/E and NifK/N respectively. Clades colored in light-green are NifD/E and NifK/N-like sequences in which the FeMoco ligand Cys 275 in the alpha component, is either present (dark green nodes) or absent (yellow nodes). In all other cases known FeMoco ligands are absent. The number of conserved Cys residues in each subunit that correspond to P cluster ligands in conventional nitrogenases are indicated for each clade.
like sequences with a similar arrangement of conserved cysteines. These organisms encode two closely linked copies of NifHEN-like sequences in their genomes. It is possible that a residue other than cysteine serves to coordinate an [Fe-S] cluster in representatives of these clades, as observed in the case of DPOR, which utilises an aspartate residue as a cluster ligand [26].
A variation in the arrangement of the subunits in these nitrogenase-like sequences is observed in some representatives of the Archaea, Firmicutes and Deltaproteobacteria, whereby nifH and nifE-like genes are fused to form a single open reading frame that is followed by a nifN-like gene (data not shown). In contrast, several representatives of the Archaea possess only a single gene encoding a homolog of the alpha and beta chains of nitrogenase (e.g. Metvu_0736, MpaI_0679 and Mbur_1037) (coloured grey in Figure 4). These form part of the outgroup identified by Raymond et al. [27] and are designated as NflD. These single subunit enzymes contain conserved Cys residues (corresponding to αC88/αC62 and αC154/ αC124 in NifD/E) and are frequently annotated as putative methanogenesis marker 13 metalloproteins, which are thought to function in methanogenesis.

Discussion
Biological nitrogen fixation is thought to be one of the most ancient enzyme-catalyzed reactions [27]. The elaborate architecture of its catalyst, which supports a complex reaction mechanism for dinitrogen reduction, has long been the subject of interest, not only from the viewpoint of evolutionary perspective and system complexity, but also as a fundamental biological process that can be exploited to develop new strategies for agricultural soil fertilization. The unpredictable occurrence of this metabolic trait across taxonomic groups, combined with the challenge of experimental detection of nitrogen fixation, makes it difficult to obtain a comprehensive census of prokaryotes with the capacity for diazotrophy.
The universal presence of gene sequences coding for the nitrogenase catalytic components in diazotrophs (nifH and nifD) is commonly used as a search tool in many phylogenetic studies. However, when using a single-gene survey in the database of microbial sequenced genomes, we detected orphan false-positive hits in several non-diazotrophic genomes. For example, the Methanobrevibacter ruminantium M1 and Methanocaldococcus fervens AG86 genomes include only a sequence similar to NifH, while the Methanosphaera stadtmanae DSM 3091 genome contains only a NifD-like sequence. In this case orphan nifD-like sequences may be evolutionary relics of divergent enzymes in which the NifD/E component does not contain conserved FeMoco ligands (see below). Thus genome analysis of environmental samples based purely on BLAST hits to NifH or NifD may lead to false indications of diazotrophy. To eliminate hits from orphan sequences our initial approach was to search in silico for the co-occurrence of NifH and NifD and then subsequently filter these hits for the occurrence of other nitrogen fixation protein sequences.
Many previous studies have focussed on NifH and NifD sequences as markers for the phylogenetic distribution of diazotrophs. However, BLAST searches at relatively low threshold identified nitrogenase-like sequences lacking FeMo-co ligands (Figure 4).
False positives can therefore be obtained if only NifH and NifD are used in the search criteria. Extending the gene set to NifHDK or even to NifHDKB can also give rise to false positives, because sequences similar to the α and ß subunits of nitrogenase can be associated with NifH-like and NifB-like genes (Additional file 4: Figure S2). The strict requirement of a separate set of proteins involved in the assembly and synthesis of the active site cofactor, FeMoco, provides strong indication that the presence of nifH and nifD coding sequences alone does not provide enough evidence for diazotrophy. Therefore, our rationale was first to determine the inventory of nif genes that were always present in known-diazotrophic species. Literature searches combined with BLAST analyses led to the proposal that nitrogen fixation requires at least 6 gene products (Figure 1). Using this criterion, we found 67 species that we hypothesize have the metabolic capacity for nitrogen fixation. Our computational assignments provide a good indication that these species are potential diazotrophs and give direction to experimentalists to validate these predictions.
Our in silico assignments predict that nearly 15% of prokaryotic species with sequenced genomes are either known or potential diazotrophs, a fraction much larger than commonly accepted. The biased distribution of sequenced genomes in relation to taxonomic groups probably undermines a robust evaluation of the taxonomic diversity of nitrogen fixation in nature. For example, the phylum Proteobacteria has 409 genomes from distinct species, while Thermomicrobia is represented by only one. Efforts towards detailed functional assignments of biochemical pathways were also compatible with our findings. The SEED database [29] lists the occurrence of 20 nif genes in 45 unique species, and in all cases the minimum gene set is present. Almost all of these species are included in this study, the only exception being Magnetospirillum gryphiswaldense, which was not in the NCBI database of completed sequenced genomes at the time this study was completed. It is probable that nitrogen fixation also occurs in many other diverse species in which phyla are underrepresented in current databases. Therefore, applying the minimum gene set to newly sequenced genomes as they become available can lead to the identification of many other diazotrophs and further expand the diversity of diazotrophs in terms of taxonomic distribution of this metabolic trait.
Our study revealed a set of species for which our criteria for in silico prediction of nitrogen fixation were not satisfied, as they lack NifEN but nevertheless retain the nitrogenase structural genes together with nifB and nifV. Paradoxically, recent phylogenetic analysis suggests that NifDK homologs present in strains lacking NifN, such as Caldicellulosiruptor saccharolyticus, Candidatus Desulforudis audaxviator and Methanocaldococcus sp. FS406-22, emerged after the ancestral Mo enzymes found in hydrogenotrophic methanogens such as M. maripaludis, which have a complete FeMoco assembly pathway represented by early branching lineages of NifE and NifN [24,25]. Nevertheless, the uncharacterised nitrogenases belonging to the C group appear to have evolved prior to the emergence of most NifDK homologs in both Archaea and Bacteria. Our studies indicate that although the catalytic components contain structural motifs competent to coordinate FeMoco, these proteins have a distinct environment surrounding their co-factor ligands, which may confer unique maturation or catalytic properties. The presence of diazotrophic species within this group suggests that these nitrogenases may have distinct characteristics that permit a more parsimonious mechanism for FeMoco assembly. Without exception, organisms in the C-group that lack either NifN or NifEN are thermophiles inhabiting diverse environmental niches. Biochemical studies that mimic the absence of NifEN demonstrate that a NifDK enzyme containing NifB-co rather than FeMoco, exhibits hydrogen evolution and retains some ability to reduce acetylene, but not dinitrogen. Addition of molybdenum and homocitrate to the NifB-co containing enzyme did not influence substrate reduction [30]. Potentially, however, thermal adaptation might permit the assembly of FeMoco on a modified scaffold or perhaps on the NifDK subunits themselves. Further characterisation of nitrogen fixation and the properties of nitrogenase in these thermophilic organisms will be required to establish if FeMoco can indeed by assembled via an alternative route.
Our studies have highlighted a number of nitrogenaselike homologs belonging to oxidoreductase/nitrogenase component 1 family, which may have different metabolic functions compared to the well-characterised canonical representatives, nitrogenase and protochlorophyllide reductase. Structural studies reveal that the fold of these two enzymes is remarkably similar, with equivalent positioning of the [Fe-S] clusters enabling a similar mechanism of ATPdriven electron transfer from the reductase protein, to the catalytic component. Diversity of substrate reduction is provided by the presence of a cleft in the catalytic component that can either accommodate a large cofactor (FeMoco) or a large substrate (protochlorophylide). Although none of the alpha subunit related sequences we have analysed contain the FeMoco ligand His442, it is not possible to distinguish whether the function of these sequences is likely to relate to catalysis (i.e. NifDK-like) or to biosynthesis (i.e. NifEN-like). Biochemical and structural studies of NifEN reveal its functional diversity, since it can catalyse cluster conversion, molybdenum incorporation into the cofactor in association with NifH, and potentially the incorporation of homocitrate into FeMoco [9]. Although the primary role of NifEN is to provide the machinery for FeMoco biosynthesis, it has also been shown to catalyse reduction of some nitrogenase substrates, albeit with relatively low efficiency [13].
Nitrogenase-like sequences could potentially perform analogous roles in association with a NifH-like component. The genomic organisation of these proteins may provide some clues to their possible metabolic functions (Additional file 4: Figure S2). We note that sequences possessing the equivalent of Cys275 in the alpha subunit are commonly associated with O-acetyl homoserine sulfhydrolase or cysteine synthase, suggesting a potential involvement in sulphur metabolism (e.g. Rhodospirillum rubrum ATCC 11170, Clostridium beijerinckii NCIMB 8052, Geobacter sp. FRC-32, Additional file 4: Figure S2). In other cases, nitrogenase-like sequences are co-located with ABC transporter systems (e.g. Clostridium cellulovorans 743B, Methanocorpusculum labreanum Z, Clostridium botulinum A2 Kyoto-F). Possibly this might provide a mechanism for coupling metal transport to the assembly of a metal cofactor. In Coprococcus catus GD/7 and other representatives of the Firmicutes, NifHEN-like proteins are associated with hydrogenase maturation proteins and may possibly play a role in the assembly of the active site metallocluster. The NifD proteins present in methanogenic Archaea have been proposed to function in coenzyme F430 biosynthesis, and NflD has been shown to co-purify with a NifH-like gene, NflH [31]. In some cases we observe that NflD homologs are adjacent to NflH and a gene involved in a late step in cobalamin biosynthesis, which encodes cobyrinic acid a,c-diamide synthase (Additional file 4: Figure S2). This may imply that these proteins function in cobalamin reduction.
The NflD single subunit enzymes appear to be the early ancestors of both the bacteriochlorophyll biosynthesis proteins (BchN and BchB) and the nitrogenases (Nif/Vnf/AnfDK) [24,27,31]. Recent evolutionary studies suggest that nitrogen fixation originated after the emergence of bacteriochlorophyll biosynthesis [25] and consequently spread to diverse microbial lineages via lateral gene transfer [24,27]. Potentially, the additional NifDKlike sequences that we have identified may be representative of ancestors that arose after the duplication event that led to the emergence of the alpha and beta subunits of nitrogenase and evolved to perform various metabolic functions. It is important to note that thus far we have only identified nitrogenase-like sequences in obligate or facultative anaerobes, consistent with the view that nitrogenase evolved in anaerobic methanogens and Firmicutes [25]. As noted above these early forms may not have functioned as catalysts, but might have had roles in metallocluster biosynthesis. Although current information on the role of these nitrogenase-like sequences is sparse, future biochemical and structural studies on this hitherto unrecognised group of proteins are likely to provide a rich source of information concerning the evolution and catalytic diversity of these nitrogenase homologs.

Conclusions
This work led to the identification of 67 potential diazotrophic species included in twelve taxonomic phyla, indicating that this metabolic trait is more widespread than formerly predicted. The identification of a minimum gene set required for nitrogen fixation provides a more robust method for the in silico prediction of this biochemical pathway. The occurrence of nif-orphan sequences or incomplete gene sets in several species questions single-gene approaches used in phylogenetic studies of nitrogen fixation. Furthermore our analysis highlights the presence of nitrogenase-like sequences with potential to catalyze as-yet unidentified functions.

Survey of nitrogen fixing genes in prokaryotic genomes
Nitrogen fixing genes present in species with completely sequences genomes were identified through the protein database of microbial genomes at the National Center for Biotechnology Information up to July 17 th 2011. Only one representative of species containing more than one sequenced genome was manually selected resulting in 999 unique Bacterial species and 93 unique Archaeal species. BLAST [32] searches used as queries the A. vinelandii nitrogen fixing protein sequences: NifH (Avin_01380), NifD (Avin_01390), NifK (Avin_01400), NifE (Avin_01450), NifN (Avin_01460), NifU (Avin_01620), NifS (Avin_01630), NifV (Avin_01640), NifB (Avin_51010), NifQ (Avin_51040), AnfG (Avin_48980), and VnfG (Avin_02600). Initially hits were selected based on a relatively weak threshold (≥ 20%amino acid identity over the query length); using the minimum gene set criterion, hits to anf/vnfG, and presence of synteny the initial list was refined, yielding the protein sequences listed in Additional file 2: Table S2, Additional file 2: Table  S3, Additional file 2: Table S4.

Selection and phylogenetic analysis of nitrogenase-like sequences
An initial list of 75 NifD/E and NifK/N-like sequences belonging to the PFAM family PF00148 were selected manually from the IMG database [33] (http://img.jgi.doe. gov) and then used as queries in a BLAST [32] search against the NCBI NR protein database with an e-value cut-off of 10 −20 . This returned 1117 unique geneIDs, which were then filtered against known NifD/E and NifK/N sequences (Additional file 2: Table S3) to remove hits to conventional nitrogenase. The remaining 900 unique gene IDs were further filtered with a BLAST search against ChlB (accession GenBank:AAT28195.1), BchB (SwissProt:Q3APL0.1), ChlN (GenBank:AAP99591.1) and BchN (SwissProt:Q3APK9.1) to remove homologs of protochlorophylide reductase. Fused protein sequences (NifHD/E) were also filtered out and were not subject to further phylogenetic analysis. Another filtering was done with a preliminary tree built using FastTree 2.1 [34] to identify very similar sequences; only one member of each set of similar sequences was kept. The final compilation contained 472 unique gene IDs.
Manual inspection of the 472-sequence tree yielded a "core" list of 73 representative sequences. These 73 sequences were then aligned with ClustalW version 2.1 [35] with the Gonnet 250 protein matrix and default pairwise alignment options. A phylogenetic tree was built with FastTree 2.1 [34] using the WAG + gamma20 likelihood model; the result is shown in Figure 4.