Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework
© Fang et al. 2015
Published: 9 December 2015
Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-of-the-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. Illumina HiSeq 2500 is capable of generating up to 1 Tbp per run, with >80% bases above Q30 and often >100x sequencing depth for large genomes. To speed up otherwise time-consuming genome assembly and/or to obtain a skeleton of the assembly quickly for scaffolding or progressive assembly, methods for noise removal and reduction of redundancy in the original data, with almost equal or better assembly results, are worth studying.
We developed two subset selection methods for single-end reads and a method for paired-end reads based on base quality scores and other read analytic tools using the MapReduce framework. We proposed two strategies to select reads: MinimalQ and ProductQ. MinimalQ selects reads with minimal base-quality above a threshold. ProductQ selects reads with probability of no incorrect base above a threshold. In the single-end experiments, we used Escherichia coli and Bacillus cereus datasets of MiSeq, Velvet assembler for genome assembly, and GAGE benchmark tools for result evaluation. In the paired-end experiments, we used the giant grouper (Epinephelus lanceolatus) dataset of HiSeq, ALLPATHS-LG genome assembler, and QUAST quality assessment tool for comparing genome assemblies of the original set and the subset. The results show that subset selection not only can speed up the genome assembly but also can produce substantially longer scaffolds. Availability: The software is freely available at https://github.com/moneycat/QReadSelector.
With the introduction of next-generation-sequencing technology, a vast amount of sequencing data can be generated in a short period of time. A major application in genome sequencing is de novo assembly, which aligns overlapping reads into super-sequences known as contigs and uses paired-end (PE) reads to further connect contigs into scaffolds . To produce longer contigs and scaffolds, sequencing data with sufficient sequencing depth and low error rate are required. However, DNA sequencing reads from Illumina sequencers have previously generated errors at the rate of 0.5-2.5% , forcing researchers to develop various error correction algorithms in order to be able to use as many sequencing reads as possible. Recently, state-of-the-art high-throughput sequencers, such as the Illumina MiSeq series, have been reported to generate sequencing reads of around 2500x sequencing depth in small genomes, with >80% of bases above Q30 [3, 4]. Another example is the Illumina HiSeq 2500 that is capable of generating up to 1 Tbp per run with >80% bases above Q30 . Sequencing depth of the HiSeq data is often 100x or more for large genomes. The availability of such high sequencing depth and high-quality reads leads us to wonder if it is possible to select useful reads and read pairs from the original sequencing data, in order to assemble genomes without affecting assembly results or with even better results.
Datasets and preprocessing
The sequencing datasets used in the experiments.
2 × 300 bp
2 × 300 bp
2 × 200 bp
Mean quality score
% Bases with quality score > 30
To test subset selection for more complex genomes, we included the grouper NGS data, generated by HiSeq 2500, as the third dataset (Table 1). The grouper dataset consist of two PE libraries with a read length configuration of 2 × 200 bp and insert lengths 400 bp and 500 bp. The two libraries are similar in size and have the total size of 125G bp after adaptor and quality trimming by Trim Galore scripts . The base quality score distribution and cumulative distribution of the grouper dataset are given in Additional file 3. In addition, there are five mate-pair libraries, with insert lengths ~2K, ~4K, ~6K, ~8K, ~10K bp, of the grouper. The size of each mate-pair library is ~4.4G bp. Note that the grouper dataset is sequenced by Prof. Lin's team (coauther of this paper) and is under preparation for publication.
The sequencing data were stored in FASTQ format, which provides information on the sequence identifier, read sequence and quality values for each base. The quality values are in ASCII format, and can be transformed into a probability p, which indicates the probability of the corresponding base call being incorrect. The quality value information contained in FASTQ files enables the selection of reads based on quality values. In order to process large-scale datasets more quickly, we developed preprocessing and analytic programs using Hadoop  and MapReduce  framework. The FASTQ format were converted into the key-value format before further processing, where the key field is the read identifier, and the read sequence and quality values are put as two fields of the value.
Subset selection for single-end reads
Here, we propose two strategies to select a subset of reads based on quality value of each base. We use Velvet  to assemble the subset of single-end reads with k-mer size 221 (the reason is given in Additional file 4).
Results of read subset selection for single-end reads
Our investigation was aimed at determining whether it is possible to select correct reads from raw data, in order to assemble contigs without affecting contig N50 result, achieving high sequence depth with high quality data. In order to determine this, we designed our experiments for single-end reads in three steps.
Step 1: Different subsets of reads were selected using MinimalQ and ProductQ strategies.
Step 2: Velvet assembler was used to obtain contigs
Step 3: The assembly results were evaluated using GAGE benchmark
The detailed results are shown in Additional files 5 and 6. The first column of these Tables represents the subsets of reads that were assembled. "Q>x" represents reads with minimal quality under x, that were filtered out. "PQ>yy" represents reads with minimal correctness probability under 0.yy, that were filtered out. The reference genomes obtained from NCBI library enabled the calculation of the sequencing depth of the selected reads, and the determination of the percentage of reads left over compared to the original sequence data. The information in the rest of the columns was obtained using GAGE  tools. In Additional files 5 and 6, the variances in columns coverage%, #misjoin, and #indel are relatively slight, but the contig N50 and #contig columns, no matter uncorrected or corrected by GAGE, changes significantly.
Subset selection for paired-end reads
PE subset selection selects not only reads but also the paired relations, and it directly affects both results of contigs and scaffolds. A feasible and reasonable way to PE subset selection is to treat a pair of reads as a whole and use the MinimalQ or ProductQ to select the pair is removed or not. Since the constraint of MinimalQ is stricter to obtain an accurate PE subset, we chose MinimalQ as the method for the experiments of PE subset selection. That is, a pair of reads will be selected if the minimal quality value of the two reads is larger than a given threshold. We used the grouper dataset (Table 1) in the experiment of PE selection. We first selected a PE subset of the grouper dataset by MinimalQ, and then used ALLPATHS-LG  to assemble the subset into contigs and scaffolds. Since ALLPATHS-LG requires both PE libraries and mate-pair libraries as the input, the five mate-pair libraries mentioned in the Datasets section were also used. To compare genome assemblies of the original set and the subset without the reference genome, we used QUAST  for quality assessment.
Results of PE subset selection
Comparing the assembly results of PE subset selection for the grouper dataset.
Dataset size (G bp)
# read pairs
Mean length of reads
%GC content of reads
Assembly statistics 1
Total contig length
N50 contig size (K bp)
Total scaffold length
Largest scaffold length
N50 scaffold size (K bp) ( L50 number)2
3,354 (97 scaffolds)
5,443 (61 scaffolds)
N75 scaffold size (K bp) (L75 number)2
1,429 (218 scaffolds)
2,493 (131 scaffolds)
%GC of scaffolds
# 'N's per 100K bp
# scaffolds for 1G bp3
Discussion and conclusions
We proposed the subset selection problem of high-depth reads for de novo genome assembly and developed two selection strategies, MinimalQ and ProductQ, to select subsets of reads and paired ends. The experiments of read subset selection on two bacteria datasets (Figures 5, 6 and Tables S2-S3) show that both the selection strategies can largely reduce the subset size with graceful decay of the corrected contig N50 and possibly with even better corrected contig N50 sizes. Meanwhile, the results of the experiments of PE subset selection on the grouper data (Table 2) are more promising. It shows that the PE subset reduced much of the runtime and generated substantially longer scaffolds with >10% less unknown bases compared to the original data.
One important issue is how to determine the thresholds of MinimalQ and ProductQ. This issue is affected by multiple factors, including sufficient coverage depths for genome assemblies, characteristics of genome assemblers (e.g., tolerance of variance in coverage depths), and characteristics of datasets and genomes (e.g., read biases and genomic structures that affect assemblies will make the subset selection harder). One feasible solution is to select a small subset initially and perform the assembly to get the contig/scaffold N50. Note that the minimal-quality thresholds of the subset selection methods can be obtained at the x-axis of the cumulative-percentage Figures 1b, 2b, 3b, 4b and 7b by choosing a percentage of subset size at the y-axis. Then we can relax the thresholds to select until a satisfying contig/scaffold N50 is obtained. To speed up the aforementioned solution, we suggest determining the initial subset sizes by sufficient coverage depths. Desai et al suggest that 50x data is enough to get good genome coverage for assemblies of small and moderate sized genomes . But if the goal is to get the longer contig N50, their results show that the higher depths are still useful. Note that in de novo genome assemblies, the genome size is unknown but can be estimated by computing the total number of k-mers in reads divided by the k-mer coverage depth, and then the estimated coverage depth is the total bases of reads divided by the estimated genome size .
Despite the aforementioned results already show the usefulness and potentials of the subset selection problem, there are not-yet-solved questions and limitations observed. First, it is difficult to determine the optimal thresholds to get the best subsets producing the best scaffolds/contigs without a certain amount of trial-and-errors. Besides, the functionality of the read selection strategies may be dependent on the datasets involved. For example, we can obtain better corrected N50 using 15% of the original data for the E. coli dataset; but for the B. cereus dataset, we can only obtain a satisfactory corrected N50 using around 50% of the data. In future work, we will investigate the reasons for the results of the PE subset selection experiments to try to understand how the dataset characteristics and ALLPATHS-LG characteristics affect the results and then improve the subset selection methods. In addition, we plan to integrate the proposed subset selection methods into the CloudDOE software  to improve usability.
In order to handle large-scale data faster, we developed several tools in Java for preprocessing and analyzing data using MapReduce framework. We developed two pipelines of subset selection for single-end reads, i.e., the MinimalQ pipeline and the ProductQ pipeline. The MinimalQ pipeline contains five main steps, including 1) preprocessing (mentioned in Datasets and Preprocessing), 2) computing MinimalQ values (the MinimalQ program), 3) computing the MinimalQ statistics (the Qstatistics program), 4) analyzing the statistics and determining the thresholds (mentioned in Results and Discussion), and 5) obtaining the selected subset (the MinimalQFilter program). The ProductQ pipeline shares the aforementioned steps 1 and 4, and replaces the steps 2, 3, and 5 with the programs MinimalProductQ, MinimalProductQsta, and PQFilter respectively. The result generated by MinimalQ or MinimalProductQ program is a list of records, and each record contains a read followed by its corresponding minimal quality value (MQV) or correctness score, respectively. The Qstatistics program can produce two types of quality statistics, depending on the input type. It generates the distribution of base quality scores for raw data as the input. For the MinimalQ file as input, the Qstatistics program generates the distribution of MQV. The MinimalProductQsta program takes MinimalProductQ program's output as input and generates the distribution of correctness score. The MinimalQFilter program allows users to set a threshold for selecting the reads with MQV above the threshold; similarly, the PQFilter program is for selecting the reads with correctness score above a given threshold. Both the outputs of MinimalQFilter and PQFilter programs are in FASTA format.
For selecting paired-end (PE) reads, we developed two MapReduce programs PEMQExtractor, PEMinimalQFilter and a single machine program PEMQsta. The PEMQExtractor program takes shuffled FASTQ data as input to extract a MQV pair for each pair of reads. The PEMQsta program reads the MQV pairs, computes the minimal value of each paired MQVs as the PE-MQV, and generates the distribution of PE-MQV. The PEMinimalQFilter program allows users to set a threshold for selecting the PE reads with PE-MQVs above the threshold.
The authors wish to thank anonymous reviewers for their useful suggestions and valuable comments. This research is partially supported by Ministry of Science and Technology of Taiwan under grant NSC 102-2221-E-0001-013-MY3, and System Management and Content Retrieval Technologies for Supporting Cloud-based Digital Archive Systems and Services of Academia Sinica.
Publication charges for this article have been funded by Academia Sinica.
This article has been published as part of BMC Genomics Volume 16 Supplement 12, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S12.
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327.PubMedPubMed CentralView ArticleGoogle Scholar
- Kelley DR, Schatz MC, Salzberg SL: Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010, 11 (11): R116-PubMedPubMed CentralView ArticleGoogle Scholar
- MiSeq Performance Specifications [Internet]. [cited 5 Jul 2015]. [http://www.illumina.com/systems/miseq/performance_specifications.html]
- MiSeq Scientific Data [Internet]. [cited 5 Jul 2015]. [http://www.illumina.com/systems/miseq/scientific_data.html]
- Specifications for HiSeq 2500 [Internet]. [cited 5 Jul 2015]. [http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.html]
- Babraham Bioinformatics - Trim Galore! [Internet]. [cited 5 Jul 2015]. [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/]
- Welcome to Apache™ Hadoop®! [Internet]. [cited 5 Jul 2015]. [https://hadoop.apache.org/]
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008, 51: 107-113.View ArticleGoogle Scholar
- Zerbino D, Birney E: Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Research. 2008, 18 (5): 821-829.PubMedPubMed CentralView ArticleGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al: GAGE: A Critical Evaluation of Genome Assemblies and Assembly Algorithms. Genome Res. 2012, 22 (3): 557-567.PubMedPubMed CentralView ArticleGoogle Scholar
- Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, et al: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 2011, 108 (4): 1513-1518.PubMedPubMed CentralView ArticleGoogle Scholar
- Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013, 29 (8): 1072-1075.PubMedPubMed CentralView ArticleGoogle Scholar
- Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, et al: Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS One. 2013, 8 (4):Google Scholar
- Chung W-C, Chen C-C, Ho J-M, Lin C-Y, Hsu W-L, Wang Y-C, Lee DT, Lai F, Huang C-W, Chang Y-J: CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce. PLoS One. 2014, 9 (6): e98146-PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.