Studies of the human microbiome are producing large datasets of partial 16S rDNA sequences from prokaryotic colonizers of various human body sites. These metagenomic surveys of microbial communities bypass the need for isolating and cultivating individual species, permit association of bacterial populations with specific environments or conditions of health, and facilitate the discovery of new bacterial and archaeal taxa. Recent advances in sequencing technologies have enabled deeper sequencing of microbial communities at a lower cost, which has presented new computational challenges associated with analyzing 16S rDNA datasets of the human microbiome (see Petrosino et al. 2009 , for review).
As part of the Vaginal Human Microbiome Project at Virginia Commonwealth University (VCU), we are studying the association of the vaginal microbiome with various physiological and infectious conditions, and we are assessing how host genetic and environmental factors contribute to the composition of the vaginal microbiome [2, 3]. Microbiome profiles have been generated from mid-vaginal samples from over 1,000 participants, and the dataset currently contains ~30 million reads targeting the V1-V3 hypervariable region of the 16S rRNA genes of vaginal bacteria. We sought a classification method that would permit rapid and high-resolution identification of bacterial taxa to the species level or better, facilitate comparisons with previously published literature, enable incremental data analysis, and support classification of newly-identified vaginal taxa.
Two general strategies are commonly used to classify 16S rDNA reads: (1) taxonomic assignment approaches that utilize comprehensive databases (e.g., the Ribosomal Database Project (RDP) , ARB-Silva , and Greengenes ); and (2) taxonomy-independent approaches that classify reads into Operational Taxonomic Units (OTUs). Similarity searches against large public databases, such as the GenBank nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore), are problematic for taxonomic identification, in part, because these databases are incomplete and contain many unidentified and poorly annotated sequences. The RDP Classifier , a naïve Bayesian classifier that makes assignments based on the composition of subsequences, is commonly used for taxonomic assignments because of its balance of accuracy, ease of use, and speed. However, the most common applications of the RDP Classifier and training set achieve only resolution at best to the genus level. Moreover, as many microorganisms that constitute the human microbiome have yet to be identified or sequenced, the 16S rDNA reference databases are incomplete, thus often precluding resolution even to the genus level. Others have recently used alignment-based methods (e.g., MEGAN[8, 9]) to classify 16S rDNA reads, and the results of these methods are also highly dependent on reference database quality and completeness.
Despite the challenges associated with these classification strategies, partial 16S rDNA sequences of informative hypervariable regions can distinguish species assigned to the same genus. Lactobacillus species, which often predominate in the vagina, provide a striking example. Thus, the V1-V3 regions of 16S rRNA gene sequences of the most common vaginal lactobacilli; i.e., L. crispatus, L. iners, L. gasseri, and L. jensenii, clearly distinguish these species . As Lactobacillus species differ in their abilities to exclude the growth of organisms associated with bacterial vaginosis (BV) and other vaginal imbalances [11–13], species-level resolution of lactobacilli is pivotal in a study of vaginal microflora. While multiple-gene and whole-genome analyses are usually required for sub-genus classifications [14, 15], many if not most species-level distinctions can similarly be made using partial 16S rRNA gene sequences when high-quality curated reference databases are available.
De novo clustering methods that group sequences into OTUs have been widely adopted to address shortcomings of phylotype-based approaches [16–18]. These methods are valuable for characterizing datasets without prior knowledge, particularly in samples of largely uncharacterized complex bacterial communities. However, most OTU algorithms are computationally intensive when applied to large datasets. Moreover, the number of OTUs generated by these strategies is often inflated due to sequencing errors inherent in next-generation sequencing technologies . Finally, a biological interpretation of the OTUs and how they impact human health requires that they be put into taxonomic context that is linked to a known bacterial taxon or strain.
Given the challenges associated with taxonomic assignment of 16S rDNA sequences using standard approaches, we developed a comprehensive, non-redundant 16S rDNA reference database of bacterial taxa commonly found in the vagina for use in classification of metagenomic 16S rDNA sequence data derived from bacteria in vaginal samples. Others have recently developed similar reference databases for the oral microbiome: CORE  and the Human Oral Microbiome Database [21, 22]. These studies demonstrate the feasibility of employing a body-site-specific 16S rDNA reference database for taxonomic classification of metagenomic 16S sequence reads. Such a resource is currently not available for the vaginal microbiome. Here, we describe: (1) the Vaginal 16S rDNA Reference Database, a comprehensive and non-redundant database of 16S rDNA reference sequences for vaginal taxa; and (2) STIRRUPS, a general method for species-level classification that involves three steps: database curation, clustering reference database sequences into species-level taxa, and taxonomic classification using the STIRRUPS Classifier. The method is validated on reads from replicates of a mock sample containing DNA from six vaginally relevant bacteria and applied to ~30 million V1-V3 16S rDNA reads from samples obtained from mid-vaginal swabs.