Skip to main content

Table 1 Gene families used in simulations

From: Beyond classification: gene-family phylogenies from shotgun metagenomic reads enable accurate community analysis

Gene name

Median length

Rate of amino acid evolution

Number of sequences

Source of sequences

Source of profile

16S rRNA

1535 bp

NA (Highly conserved)

427

RDP

RDP (INFERNAL)

rpoB

1296 aa

73.51

460

AMPHORA + GenBank

AMPHORA (HMMER)

rpsB

226 aa

51.96

411

AMPHORA + GenBank

AMPHORA (HMMER)

dnaG

395 aa

112.53

456

AMPHORA + GenBank

AMPHORA (HMMER)

lolC

411 aa

184.04

442

UniProt + GenBank

PhyloFacts (HMMER)

  1. Each family of gene sequences was limited to its unique representatives among AMPHORA taxa (see Methods). Rate of amino acid evolution was determined by summing all branch lengths in a phylogenetic tree inferred via RAxML from the protein sequences; smaller values indicate fewer substitutions and greater conservation. The 16S rRNA gene requires a nucleotide model of evolution and hence has an incomparable value; it is well known to be highly conserved, with variable regions. 16S rRNA sequences were obtained from the Ribosomal Database Project (RDP) [20]. A larger set of 1,071 16S rRNA sequences was used only for the Fast UniFrac analysis (see Additional file 1: Table S1). Amino acid sequences for rpoB, rpsB, and dnaG families were obtained via AMPHORA [14], while corresponding DNA sequences were downloaded from NCBI GenBank [21]. For lolC, family members were determined by PhyloFacts [22] (family accession bpg052966 as of February 16, 2011); amino acid sequences were downloaded from UniProt [23], and corresponding DNA sequences were downloaded from EMBL-EBI [24]. Additional file 1: Table S1 provides download dates and sequence accession numbers.