Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes

Miyamoto, Mari; Motooka, Daisuke; Gotoh, Kazuyoshi; Imai, Takamasa; Yoshitake, Kazutoshi; Goto, Naohisa; Iida, Tetsuya; Yasunaga, Teruo; Horii, Toshihiro; Arakawa, Kazuharu; Kasahara, Masahiro; Nakamura, Shota

doi:10.1186/1471-2164-15-699

What Matters: Balance Between Coverage and Sequencing Costs

Sebastian Jünemann, University Bielefeld

18 September 2014

Nice and very interesting paper! We agree on everything especially on the performance of the assemblers and the observed sub-sampling effects. Hence, we want highlight that we just published a de novo assembler comparison, which are corroborating your conclusions [PLoS ONE 9(9): e107014, 2014]. Albeit we didn't have the opportunity to include the PacBio system in our study, we used a wider range of assemblers, three different bacterial references, and more data sets.

We agree with you that the consensus and not the read level accuracy is of major importance (see also our second generation sequencer comparison publication [Nat Biotechnol 31:294–296, 2013]) when evaluating performance. We also consider QUAST as a very useful evaluation tool. However, we wonder what was the specific reason behind using the N50 metric for comparing the de novo assemblies? In particular with regard to the two contigs (>1Mb) assembled with the PacBio system, the NA50 or the NGA50 metric (both correcting for mis-assemblies) offered by QUAST would probably reflect the assembly quality more precisely. This also applies for the other platforms, of course. An almost perfect assembly contiguity must be put into perspective of the assembly quality to provide a meaningful assessment. In addition, your results demonstrate that the other assemblies have a very low rate of mis-assemblies (even zero mis-assemblies for the GS Jr and Ion PGM), the corresponding description (Table 2) suggests that for measuring mis-assemblies only the “large” mis-assemblies as reported by QUAST were used (i.e. no local mis-assemblies were reported, where the affected genomic area is less than 1kb in size). We consider this threshold of 1kb as somewhat arbitrary and mis-assemblies within smaller genomic regions are still mis-assemblies. Therefore, we would appreciate if you could comment on the local mis-assembly rates.

Next to k-mer optimization, we also investigated in our recent PLoS One paper the effect of different genomic coverages on the assembly outcome by performing in-depth random sub-sampling. For Illumina and Ion Torrent data we observed the very same effect, i.e., that “excessive number of reads does not help and can even harm genome assembly”. We believe this finding is of major practical importance as lower experimental coverage means a lower price per sample! Looking at the results given in your Figure S2, there seems to be quite some high variation in the 100 random data sets for each coverage cutoff (a lot of points are located outside the quartiles), which is very interesting. Do you have tested whether this variance was mainly induced by the sub-sampling procedure or was it caused by or cumulative with the assembler? Do you have applied the same method on the MiSeq assemblies and if so, what was here the variance?

Competing interests

none declared

What Matters: Balance Between Coverage and Sequencing Costs

Sebastian Jünemann, University Bielefeld

18 September 2014

Nice and very interesting paper! We agree on everything especially on the performance of the assemblers and the observed sub-sampling effects. Hence, we want highlight that we just published a de novo assembler comparison, which are corroborating your conclusions [PLoS ONE 9(9): e107014, 2014]. Albeit we didn't have the opportunity to include the PacBio system in our study, we used a wider range of assemblers, three different bacterial references, and more data sets.

We agree with you that the consensus and not the read level accuracy is of major importance (see also our second generation sequencer comparison publication [Nat Biotechnol 31:294–296, 2013]) when evaluating performance. We also consider QUAST as a very useful evaluation tool. However, we wonder what was the specific reason behind using the N50 metric for comparing the de novo assemblies? In particular with regard to the two contigs (>1Mb) assembled with the PacBio system, the NA50 or the NGA50 metric (both correcting for mis-assemblies) offered by QUAST would probably reflect the assembly quality more precisely. This also applies for the other platforms, of course. An almost perfect assembly contiguity must be put into perspective of the assembly quality to provide a meaningful assessment. In addition, your results demonstrate that the other assemblies have a very low rate of mis-assemblies (even zero mis-assemblies for the GS Jr and Ion PGM), the corresponding description (Table 2) suggests that for measuring mis-assemblies only the “large” mis-assemblies as reported by QUAST were used (i.e. no local mis-assemblies were reported, where the affected genomic area is less than 1kb in size). We consider this threshold of 1kb as somewhat arbitrary and mis-assemblies within smaller genomic regions are still mis-assemblies. Therefore, we would appreciate if you could comment on the local mis-assembly rates.

Next to k-mer optimization, we also investigated in our recent PLoS One paper the effect of different genomic coverages on the assembly outcome by performing in-depth random sub-sampling. For Illumina and Ion Torrent data we observed the very same effect, i.e., that “excessive number of reads does not help and can even harm genome assembly”. We believe this finding is of major practical importance as lower experimental coverage means a lower price per sample! Looking at the results given in your Figure S2, there seems to be quite some high variation in the 100 random data sets for each coverage cutoff (a lot of points are located outside the quartiles), which is very interesting. Do you have tested whether this variance was mainly induced by the sub-sampling procedure or was it caused by or cumulative with the assembler? Do you have applied the same method on the MiSeq assemblies and if so, what was here the variance?

Competing interests
none declared

Archived Comments for: Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes

What Matters: Balance Between Coverage and Sequencing Costs

Competing interests

BMC Genomics

Contact us