Analysis of 16S rRNA environmental sequences using MEGAN
© Mitra et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Skip to main content
© Mitra et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Metagenomics is a rapidly growing field of research aimed at studying assemblages of uncultured organisms using various sequencing technologies, with the hope of understanding the true diversity of microbes, their functions, cooperation and evolution. There are two main approaches to metagenomics: amplicon sequencing, which involves PCR-targeted sequencing of a specific locus, often 16S rRNA, and random shotgun sequencing. Several tools or packages have been developed for analyzing communities using 16S rRNA sequences. Similarly, a number of tools exist for analyzing randomly sequenced DNA reads.
We describe an extension of the metagenome analysis tool MEGAN, which allows one to analyze 16S sequences. For the analysis all 16S sequences are blasted against the SILVA database. The result output is imported into MEGAN, using a synonym file that maps the SILVA accession numbers onto the NCBI taxonomy.
Environmental samples are often studied using both targeted 16S rRNA sequencing and random shotgun sequencing. Hence tools are needed that allow one to analyze both types of data together, and one such tool is MEGAN. The ideas presented in this paper are implemented in MEGAN 4, which is available from: http://www-ab.informatik.uni-tuebingen.de/software/megan.
Metagenomics is the study of the genomic content of a assemblage of organisms, obtained from a common habitat or an environmental sample of microbes. With the progress in the throughput and cost-efficiency of sequencing technology, there is a rapid increase in the number and scope of metagenomic projects. Two possible ways to analyze the taxonomic content of an environmental sample are either to perform a random shotgun sequencing of the DNA of the sample, or to use a targeted approach in which only one particular gene is amplified and sequenced. The latter is sometimes called amplicon sequencing.
As rRNA gene sequences are present in all living cells, these sequences (16S or 18S rRNA) are widely used for phylogenetic studies and also as the target of amplicon sequencing [1, 2]. There are a number of tools for the analysis and comparison of 16S or 18S data, such as DOTUR , MOTHUR , SINA aligner at the SILVA website , RDP  and EstimateS . More recent tools include MLtreemap , UniFrac  and pplacer  and QIIME .
MEGAN (“MEtaGenome ANalyzer”)  is widely used to perform the taxonomic and functional analysis of large metagenomic datasets. Previous versions of MEGAN could only be applied to random shotgun sequences. One of the new features released with the version 4 of MEGAN [13, 14] is the ability to analyze 16S sequences. The aim of this paper is to describe this new approach in more detail.
We will illustrate how to apply MEGAN4 to rRNA sequences using an example dataset of ≈ 4000 published 16S sequences from  (obtained from a set of mice children and referred here as ‘mice-data’). The ideas presented in this paper are quite simple. The main merit of this work lies in the integrated implementation of the methods in the form of a user-friendly program, which can be used by biologists to analyze their 16S datasets in the context of other types of datasets.
The aim of this work is to support the analysis of the result of a BLAST comparison of 16S rRNA against the SILVA database. SILVA result files do not contain the information on the species and/or the strain from which a reference sequence was obtained. Hence, we created a mapping file that maps SILVA accession numbers to corresponding NCBI taxon IDs. This mapping file is used by the “accession lookup” feature of MEGAN4 to identify the related species. An advantage of this approach is that no modification to the original SILVA database are required and it is possible to include additional information on the species/strain name when creating the mapping file. Moreover the mapping file is very small and can be updated with ease.
Our simple mapping algorithm starts by computing a hash map between all NCBi taxon names, synonyms, equivalents names and missing spellings, on the one hand, and all corresponding taxon IDs, on the other. Then each entry in the SILVA file is compared against the NCBI map. If a match to a taxon name is found, then the Silva entry is mapped to that taxon, unless the taxon name contains one of the keywords ‘uncultured’, ‘unidentified’ or ‘metagenome’, in which case the lowest taxon entry from the SILVA full taxa description is taken. If neither case is successful, then we change the capitalization of the Silva entry and retry the matching step.
Examples taken from the SILVA file for illustrating the mapping algorithm (taxonomic paths shortened when necessary).
Eukaryota; Metazoa;[…]Turbinidae; Homalopoma
Bacteria; Firmicutes;[…]Family XII Incertae Sedis; Fusibacter
Eukaryota; Metazoa; Nematoda; environmental samples
As a second example (Table 1), the keyword ‘uncultured’ appears in the species name. In such a situation the taxonomic path is used to assigning this read to a taxa. To be precise, the lowest taxonomic entry which is in this case Fusibacter is considered. This name is found in the NCBI hash map and the read assigned to the ID 76008. If there was no hit for Fusibacter, the next higher taxonomic entry would be used for searching (in this case: Family XII Incertae Sedis). If this search also failed to find a hit, then this procedure is repeated until the highest taxonomic entry for this read (here: Bacteria) is reached and a hit is to be expected.
The last case illustrates an example of combining two unwanted keywords (Table 1). In the species name the keyword ‘uncultured’ again appears. The lowest taxonomic entry is also rejected by the searchFullTaxa-method because the keyword ‘environmental samples’ is detected. So this read is finally assigned to NCBI ID for Nematoda.
In order to test the analysis method with the created mapping file first we used published 16S sequences from  (≈ 4000 reads obtained from mouse guts (all mice children data) referred to here as ‘mice-data’).
First the ‘mice-data’ is aligned against the SILVA ribosomal RNA sequence database  using a variant (BLASTN) of the program BLAST . Furthermore, we aligned the dataset against NCBI-NR database (of non-redundant protein sequences ) using BLASTX, expecting to see no hits as the NR database is not supposed to contain any 16S sequences, as it is a database of protein sequences. For both the cases for aligning the sequences using BLAST we used a very relaxed threshold in order to allow almost all the mappings. But while importing it in MEGAN we used a threshold of Min Score=120, Top Percent=10 and Min Support=5, which enables a conservative assignment.
This tells MEGAN4 to use the mapping file to assign the accession number of a hit in the BLAST output file to a taxa, before trying to make taxon names. Before opening “regular” BLAST output files these changes must be revoked.
To compare the performance of the MEGAN4 analysis based on a BLASTN comparison of the reads against the SILVA database, we applied a number of different analysis tools to the mice-data. In more detail, we ran the data through the RDP web server  (using ‘Confidence threshold’: 80%) and the SINA aligner at the SILVA website (using default settings) , and a Greengene-, RDP- and SILVA-based analysis offered by MG-RAST web service  (all analyses are as of July 12, 2011). For MG-RAST analyses the e-value cutoff for sequences matches to these annotation sources (Greengene/RDP/SILVA) was set to 1 × 104 with a Min. % Identity Cutoff as 90%. We didn’t specify minimum alignment length in order to allow all the assignments with previous threshold. We extracted our result at genus level in order to compare the analyses in depth. As MG-RAST does not produce a result in a hierarchical structure we certainly loose many hits that couldn’t attain the threshold at ‘genus level’. For comparison purpose we put those reads that are not available at ‘genus level’ analysis as ‘No hits’.
MEGAN4 is able to directly import the results obtain from the RDP website and also the results obtained from the SILVA website. For importing the the SILVA result users need to select Import from BLAST-menu item using the option Use Synonyms as mentioned above. For RDP analysis results users need to download the resulting text file from the “Classifier:: Assignment detail” page. For importing the analysis directly from SILVA website, users need to download the “log file” after running the website’s aligner. MEGAN 4 is then able to read both the files using the standard ‘Import from BLAST’ dialog. MG-RAST results can be saved and imported using an importer for CSV (comma separated value) files (using only two columns ‘genus’ and ‘abundance of reads’ without any header).
What we in fact observed is that a hit is found for most reads, usually to protein entry labeled “hypothetical protein”. Only a small number of reads (420) were not assigned to a taxon, and this was usually because the ‘min score’ threshold was not reached. While in some cases the taxonomic assignment based on NR was the same as the one obtained using an appropriate method, in most cases the assignment to was to a taxon that is probably incorrect. One of the best examples of the wrong assignment using NCBI-NR is the node Oryza (21 reads mapped to an unknown protein [Oryza sativa Japonica Group]), all of which should be assigned to Lachnospiraceae in the phylum Firmicutes. Because many of matches are highly significant, this all indicates that the NCBI-NR database probably contains a number of 16S rRNA sequences that are falsely assumed to be protein-coding genes.
An important practical implication of this study is that one should remove all rRNA sequences from a random shotgun dataset before performing an NCBI-NR based analysis, as they will lead to false positive assignments.
Metagenomics is a fast growing field and novel tools are required to analyze the ever growing datasets. Amplicon sequencing targeting the 16S rRNA gene is widely used for estimating the taxonomic structure of environmental bacterial assemblages. MEGAN 4, already widely used for analyzing random shotgun sequences, can now also be used for 16S rRNA, allowing the direct comparison of taxonomic profiles obtained from different types of data, and different methods.
SM and DHH designed the project and wrote the manuscript. MS created the mapping file and wrote necessary codes. DHH wrote necessary codes for implementing this new approach in MEGAN. SM performed all the BLAST and comparisons.
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.