Running environment and input data formats
SUGAR is implemented in Java as an extended version of the quality control software FastQC [5], and runs on any operating system with the Java Runtime Environment. The users can operate and control the SUGAR with user-friendly GUI that offers interactive analysis capability. The “FastQC-style” GUI also reduces the effort required for initial learning process by new users. The SUGAR can handle following types of sequence data as input file: Fastq [6], Sequence Alignment/Map (SAM) [7], and Binary Alignment/Map (BAM) [7]. Reference sequence file is not required when BAM/SAM files are analyzed.
Heatmap generation
From the input file, the SUGAR loads X-Y coordinates, tile number, base quality values (QV) [8], and mapping quality (MapQ) of sequence reads. Then it generates high-resolution heatmaps to show overall distribution of sequencing qualities on the Illumina flowcell (Figure 1). A lane of the flowcell is divided into tiles, which correspond to the scopes of image scanning in the nucleotide sequencing process [9]. For instance, a lane of the MiSeq version 2 and HiSeq2500 Rapid Run flowcells is comprised of 28 and 64 tiles, respectively. Then each tile is further divided into 100 (10×10 resolution) subtiles in a default setting of SUGAR analysis. These subtiles are used as a unit of data quality assessment, and the resultant scores of each subtile are shown as colored dots that constitute the heatmap. Consequently, the heatmap reflects spatial organization of sequencing clusters and their qualities on the flowcell. Resolution of the heatmap (numbers of subtiles/dots) can be changed, although higher resolution requires more memory space. SUGAR also has a downsampling option to conduct quick and rough evaluation of data quality.
Quality assessments
Overall quality of sequence reads within each subtile is evaluated based on four measures: (1) proportion of low-quality reads in the subtile (Figure 1A), (2) number of reads sequenced in the subtile (read density) (Figure 1B), (3) average QV of the reads in the subtile (Figure 1C), (4) proportion of reads showing low MapQ-scores in the subtile (Figure 1D). Threshold QV to specify data of low-quality (<30 in a default setting) can be changed. These heatmaps enable users to find possible technical errors in sequencing processes. Particularly in result tabs and detailed popup windows of the modules of “proportion of low-quality reads” (Figure 1A) and “average QV of the reads” (Figure 1C), a weighting for heatmap representation between top- and bottom-tiles can be changed by manual operations. This enables virtually three-dimensional evaluation of the distribution of low-quality spots in the flowcell to infer whether the cause of low-quality portion is three-dimensional phenomenon (e.g., air bubbles or debris in sequencing fluids) or two-dimensional phenomenon (e.g., cracks on a flowcell or imaging errors).
Parameter setting and results evaluation
In the parameter setting of the “proportion of low-quality reads” (Figure 1A), threshold value of 20 in Phred score provides clear visualization result of the heatmaps according to our empirical tests. If an overall quality of the run was remarkably high (which would be checked by Illumina BaseSpace console or the FastQC software), the above threshold can be set to higher one (e.g., 30). If the overall quality was low, the threshold value can be set to lower one (e.g., 10). These parameter changes may show air bubbles or debris on a flowcell more clearly. The heatmaps of “average QV of the reads” (Figure 1C) provide supportive information to quality evaluation by “proportion of low-quality reads” (Figure 1A), in which quality scores of high-quality reads and their variations are not considered and represented in the heatmap.
The read density heatmaps (Figure 1B) show condensation distribution of the reads on a flowcell. Read-dense regions generate greater number of reads with lower quality, while read-sparse regions generate less number of reads with higher quality, in general. By comparing the read-quality heatmaps with the read-density heatmaps, the users can examine whether the low-quality regions are related to read densities and DNA concentration loaded on a flowcell, providing possible feedback to the improvement of DNA experiments. The mapping quality heatmaps (Figure 1D) enable the users to examine whether or not the detected variants came from low-quality regions of a flowcell. This type of analysis has not been provided by other quality-control softwares, however, it would be particularly useful for careful examination of mutation finding from a high coverage data, to improve analyses of, e.g., somatic mutations, cancer cells, or mitochondrial heteroplasmy.
Predictions of data cleaning results
SUGAR also generates curve charts to predict remaining amount of data after removing low-quality tiles/subtiles (Figure 1E). In these charts, subtiles are ordered and positioned along the horizontal axis on the basis of four types of quality indicator following: (1) read density, (2) average QV, (3) proportion of low-quality reads, and (4) average MapQ, any of which users can choose to generate the graph. Values of selected quality indicators are plotted as red curve. Green curve shows predicted amount of data that remains after discarding the subtiles with given thresh-olds of the quality indicator shown by red curve.
Removing low-quality tile/subtile and data outputs
SUGAR conducts data cleaning via both manual and automated operations. In the former manual approach using the GUI, the users can select low-quality tiles/subtiles to discard the reads within those regions, or select low-quality nucleotide positions in the tiles/subtiles to change the unreliable nucleotide calls to N bases from the original data. In the latter mode, SUGAR automatically removes reads or changes nucleotides to N-base within low-quality tiles/subtiles. The threshold QV and remaining amount of the data can be specified from the curve charts with GUI guide (Figure 1E).