Differences in sequencing technologies improve the retrieval of anammox bacterial genome from metagenomes

Background Sequencing technologies have different biases, in single-genome sequencing and metagenomic sequencing; these can significantly affect ORFs recovery and the population distribution of a metagenome. In this paper we investigate how well different technologies represent information related to a considered organism of interest in a metagenome, and whether it is beneficial to combine information obtained using different technologies. We analyze comparatively three metagenomic datasets acquired from a sample containing the anammox bacterium Candidatus ’Brocadia fulgida’ (B. fulgida). These datasets were obtained using Roche 454 FLX and Sanger sequencing with two different libraries (shotgun and fosmid). Results In each dataset, the abundance of the reads annotated to B. fulgida was much lower than the abundance expected from available cell count information. This was due to the overrepresentation of GC-richer organisms, as shown by GC-content distribution of the reads. Nevertheless, by considering the union of B. fulgida reads over the three datasets, the number of B. fulgida ORFs recovered for at least 80% of their length was twice the amount recovered by the best technology. Indeed, while taxonomic distributions of reads in the three datasets were similar, the respective sets of B. fulgida ORFs recovered for a large part of their length were highly different, and depth of coverage patterns of 454 and Sanger were dissimilar. Conclusions Precautions should be sought in order to prevent the overrepresentation of GC-rich microbes in the datasets. This overrepresentation and the consistency of the taxonomic distributions of reads obtained with different sequencing technologies suggests that, in general, abundance biases might be mainly due to other steps of the sequencing protocols. Results show that biases against organisms of interest could be compensated combining different sequencing technologies, due to the differences of their genome-level sequencing biases even if the species was present in not very different abundances in the metagenomes.

shows that 454 technology sequenced much more base pairs of the others. However, the amounts of base pairs belonging only to the annotated reads were similar: the quantities of annotated base pairs of 454 and Fosmid were almost identical, and a bit lower than Shotgun ones (Table 2). Focusing our attention on B. fulgida, we observed that the sets of reads of different datasets annotated to its ORFs had comparable number of total base pairs. Therefore, the performances of these sequencing technologies on B. fulgida ORFs recovering could be compared fairly. The percentage of 454 reads that could be annotated was less than what was obtained for the other two technologies; it was 41.71%, whereas for Shotgun and Fosmid was 84.62% and 81.07%, respectively (Tables 1 and 2). This difference was probably due to the short length of 454 reads, increasing the probability of having an alignment by chance and hence not meaningful. Alignments of 454 reads to the reference sequences were more likely to have E-value higher than the chosen threshold, and the higher the E-value is, the more likely it is that the alignment has been obtained by chance.
3 The recovering trend for higher mapping percentage threshold values changes In this section we compare the sets of B. fulgida ORFs recovered by different technologies focusing on the ORFs mapped for at least a given minimum mapping percentage threshold, and on the functional content of highly mapped ORFs. When the threshold value of minimum mapping was set to zero, the technology 454 gave the best performance, as observed in Section 2. Indeed, 454 recovered more B. fulgida ORFs than the other technologies (Table 3). Moreover, 454 recovered also many of the ORFs recovered by the other technologies: about 90% of the ORFs recovered by Shotgun and by Fosmid (Manuscript Figure 3A), considered separately. Shotgun and Fosmid recovered many common ORFs: the intersection of their sets of recovered ORFs was 877, corresponding to 67.98% of Shotgun and 72.48% of Fosmid. 94.07% of these common ORFs were also recovered by 454.
As the threshold value of minimum mapping was increased, we recognized two trends: the amounts of ORFs recovered by the technologies changed in favour of Fosmid and Sanger, and fewer ORFs were recovered by more than one technology. Indeed, increasing the value of mapping threshold, the number of recovered ORFs decreased faster for 454 than for the other two technologies (Manuscript Figure 5, Table 4). Indeed, as long as the mapping threshold was lower that 50%, 454 recovered more ORFs than the other technologies. From 50% onward, 454 recovered fewer ORFs than the other technologies. This change of relation between the sequences was due to the high number of ORFs that Fosmid and Sanger recovered almost entirely ( Figure 4).
In order to better illustrate these behaviours, we compared the sets of ORFs recovered by different technologies for mapping thresholds 50% and 80%. For mapping threshold of 50%, the three sets of recovered ORFs had a symmetric relation. Indeed, they all had about 1,000 ORFs (see Manuscript Figure 5, Table 4) and the intersections between every possible pair of sets contained about 570 ORFs each (Manuscript Figure 3B, Table 5). For mapping threshold of 80%, the set of ORFs recovered by technology 454 was much smaller than those of the other two technologies.
Its size was about one third of the sizes of the others (Manuscript Figure 3C, Table 4). 125 of the 201 ORFs of the 454 set were recovered also by at least one of the other two technologies. However, the sets of ORF obtained by Shotgun and Fosmid had similar size and their intersection corresponded to about 38% of each of them. Each of these sets shared about 41% of the ORFs recovered by 454.
4 Shotgun and Fosmid achieve better ORF recovering quality than 454 The similarity between the coverage patterns of a given ORF obtained with two technologies was measured through Pearson correlation. Positive correlation indicates that the coverage depths obtained with the two technologies increase or decrease together; negative correlation indicates that as the depth obtained by one technologies increases, so the depth of the other decreases, and vice versa. The Sanger-based technologies showed similar depth of coverage patterns for 50.29% of the ORFs recovered by both ( Figure 2). For 22.17% of the B. fulgida ORFs recovered by both technologies, the correlation was between 0.7 and 1; for 28.12% of the ORFs, the correlation was between 0.3 and 0.7. Nevertheless, for 21.59% of the ORFs the correlation was significantly negative (from -1 to -0.3).
In contrast to what happened in the previous case, the Sanger-based technologies and 454 coverage patterns were not related. Indeed, for couples Fosmid/454 and Shotgun/454 the correlations between the coverage patterns did not indicate a significant difference or similarity for the majority of ORFs: for 51.54% and 53.96% of the ORFs, the correlations were between -0.3 and 0.3, respectively. In both cases, the percentage of ORFs with very significant correlation (i.e. outside range -0.7 / +0.7) was less than 10%.

Combining all technologies improves ORF recovering
We compared the set of ORFs recovered when considering all the possible technologies combinations. The combination of all the three technologies resulted in the recovering of more ORFs than any other combination or any single technology (Manuscript Figure 5). As a matter of fact, the combination of all the technologies recovered 1879 ORFs, that was 46.26%, 55.93%, and 10.40% more ORFs than Shotgun, Fosmid and 454, respectively; the differences with combinations Shotgun-454, Fosmid-454 and Shotgun-Fosmid were 2.68%, 4.22%, and 16.56%, respectively.
Thanks to the diversity of sequencing biases, combining all the technologies significantly increased the number of ORFs recovered for at least 95% of their length (Manuscript Figure 4). As the mapping threshold value increased, there was an increase in relative gaps between the number of ORFs recovered by the combination of all the technologies and those recovered by any other combination or by any single technology. This trend was particularly strong in comparison with technology 454 and the combinations involving it. These relative differences between the combination of all technologies and Shotgun, Fosmid and 454 alone increased to 108.90% of Shotgun, 124.09% of Fosmid, and 543.28% of 454 at threshold 80%, respectively. The difference of all technologies with combination Shotgun-454 increased to 38.59%; similarly, the difference with combination Fosmid-454 increased to 42.24%. In contrast, the gap between the combination of all the three technologies and the combination Shotgun-Fosmid resulted in a small increase: from 16.56% at threshold 0% to 20.39% at threshold 80%.
Among the combination of two technologies, Shotgun-454 was the best one for low mapping thresholds, while Shotgun-Fosmid was the best for high mapping thresholds. Shotgun-454 combination recovered 1830 ORFs, that is slightly lower than the number of ORFs recovered by the combination of all the technologies. Despite Shotgun-Fosmid was the combination that recovered the lowest number of ORFs (1612), it outperformed Shotgun-454 at mapping threshold of 70% or more.

No specific genome location bias of the technologies
We performed an analysis to check if sequencing technologies had some location bias in sequencing, i.e., we wanted to examine if some areas of the genome were more covered than others. Figure  4 shows that the recovered ORFs were almost uniformly distributed on the genome, indicating no strong location bias except for a few spikes. These coverage spikes were mostly consistent among different sequencing technologies, and corresponded to the following ORFs: kustd1658, kustd1783, kustd2042, kuste3701, kuste4036, kuste4355, kuste4640, kuste4642. These ORFs have other almost identical copies in Kuenenia genome; moreover kuste3701 differs from kuste4642 for just two amino acids. Therefore, it is likely that BLASTX wrongly assigned to these ORFs many reads that were actually sequenced from their copies. Most of these ORFs corresponded to genes related to DNA replication, recombination, and repair. Some areas of the genome were less covered than the others, and these biases were consistent among different sequencing technologies.