New quality control measures
All sequencing platform vendors provide quality scores for individual bases in the sequence reads. These quality scores are utilized by TileQC[1] and PIQA[2], as well as vendor provided solutions for plotting quality score trend in sequencing cycles and quality score distribution in each tile, which is the unit of image capture for the Illumina platform. TileQC also plots genomic alignment results based on number of mismatches on individual tiles and such graphs are very helpful for understanding the impact of quality issues on sequence alignment.
Besides incorporating key QC measures from existing solutions, we add the following novel QC measures in NGSQC for detecting quality issues that other solutions cannot identify:
1. Base/color code distribution across each tile/panel. In the sequencing data that we have processed, we have noticed that the percentage of a specific base (e.g., percentage of A bases in all called bases in Illumina sequencing or percentage of color code 0 in Applied Biosystem sequencing) in a fixed size area often exhibits non-random patterns across the sequencing surface (e.g., Figures 1C). Although the base composition (or the dibase composition in the case of SOLiD sequencing) is unique for each sample, we would expect that the distribution of sequence fragments on the sequencing surfaces be random. As a result, the distribution of base/color code should be fairly consistent across the sequencing surface if the surface area used for deriving the base percentage includes hundreds or even only dozens of sequences. Any deviation from the expected random distribution of sequences, e.g., the uneven concentration of A base rich sequences in a sub-region of the sequencing surface, would suggest problems in sequencing assays.
2. Base/color code percentage plot in full sequencing cycle for all sequences in each tile/panel. Similarly, we noticed cyclic patterns for the percentage of specific base/color code call along the full sequencing process, in particular in the later cycles of the sequencing assay for some samples (Figure 2). Since the widely-used quality score plots cannot predict base/color code bias, the base/color code percentage trend plot can reveal new sources of sequencing problems not detectable by the quality score plots.
3. Summary of all tile/panels based on x-y coordinates in each tile/panel: In our analysis of deep sequencing data, we also noticed that summarizing a specific quality measure (e.g., quality score or base percentage) for the same xy location for all tiles/panels in a sample can sometimes reveal patterns not obvious when these measures are examined at individual tile/panel levels. Conceivably, the summary across all tiles/panels will make weak but repetitive or consistent biases that the sequencing system introduced on each tile/panel more obvious. A common problem is likely to be the inappropriate imaging capture system setup, as donut-like patterns or stripes can sometimes be seen on the data we examined.
4. Full sample view: Since the individual tile/panel based visualization cannot capture large scale quality pattern across the full sample assay, NGSQC provides full sample view of several quality control measures derived from individual tiles. There are two different views for all tiles/panels in a sample: 1) heat maps for all tile/panels based on their original spatial relationships on the sequencing chip/slide. 2) cycle-based plot for QC measures derived from all tiles/panels in individual rows or columns in a sample. These graphs allow users to quickly identify the pattern of problematic tiles/panels, and thus can use tile/panel location as a filter in sequencing analysis.
5. Quality measures for paired-end/mate pair libraries: Although sequencing library preparation is not part of the deep sequencing assay per se, paired end/mate pair sequencing library preparation procedures have major influences on the final sequencing results. We included two quality measures, the distribution of distance between sequencing reads from the two ends and percentage of chimeras, in the NGSQC package for paired-end/mate pair libraries.
Identifying potential quality issues in sequence reads related to specific biological conclusions
An important reason that quality control analysis is not widely used by biomedical researchers is that although existing solutions can reveal a number of quality issues, it is hard to investigate the biological consequences of these quality issues. A key feature provided by NGSQC is the ability to link sequence reads related to a specific biological conclusion (e.g., a list of differentially expressed genes, novel alternative splicing variants, etc.) to potential quality issues. NGSQC will plot the user-provided sequence lists back to the original sequencing surface as sequence distribution heat maps, which can help identifying potential quality issues related to specific biological results.
The most straightforward method is to examine the sequence read distribution in the sample/tile maps. Since we expect sequences related to a specific biological conclusion be distributed relatively evenly across the sequencing assay surface, any significant deviation from random distribution, either at the full sample overview or at the all tile/panel summary view, will suggest the involvement of sequencing quality issues. The expected random distribution of sequencing reads is a surprisingly powerful property that has not previously been utilized in quality control solutions. We believe that the deviation of random distribution can detect many types of quality issues even if we do not understand the underlying mechanisms.
In addition, our solution provides side-by-side comparison of the user-defined sequence reads distribution with different types of quality control heat maps generated by NGSQC for the full sequencing assay. The side-by-side comparison allows users to quickly identify situations where most of the sequence reads related to a specific biological conclusion are from problematic regions revealed by the heat map of a specific quality measure through visual pattern match. This approach is particularly useful for identifying quality issues in a small number of sequence reads related to rare genome structure changes or transcriptome variants, as it is hard to judge the randomness of distribution for a small number of sequence reads.
Multiple deep sequencing platform support
Currently, neither TileQC nor PIQA supports sequencing data from the Applied Biosystems SOLiD platform, which is gaining popularity for SNP identification and genomic structure analysis. It is highly likely new deep sequencing platforms will become available in the coming years. NGSQC is designed to support all sequencing technologies that perform sequencing on a two dimensional surface, as long as they provide methods for linking individual sequence reads to specific locations on the sequencing surface.
In NGSQC, we use the sequencing format files to capture the two-dimensional arrangement of different sequencing assay types for a sample. For each new sequencing format from an existing or new sequencing platform, we will add a new sequencing format file that defines the arrangement of tiles/panels in each sample as well as the number of x and y values in each tile/panel.
For each analysis, NGSQC will ask a user to select a sequencing format file out of a list of all supported platforms. NGSQC will find corresponding sequencing format file under the conf folder, which is created during NGSQC pipeline setup, for generating the heat maps as well as tile/panel row/column based statistics. For example, if the user selects 'Illumina_120', the pipeline draws the image with file 'conf/Illumina_120.tile', which is an ASCII file for the one lane per sample sequencing on the current generation of Ilumina Genome Analyzer.
Since the sequencing format files are ASCII files, users of the NGSQC package can follow the example of existing sequencing format files for new sequencing assay formats or new platforms. However, if a new sequencing platform is very different, we will be happy to work with users and the corresponding vendor to generate new sequencing format files.
NGSQC software architecture
NGSQC performs various analysis and reporting tasks in a pipeline. Its workflow is controlled by a GNU Make file, which calls a number of programs for sequence alignment (BOWTIE [9]), quality graph generation (gnuplot [10]), as well as custom LINUX scripts and Python programs for processing various text files.
Users of NGSQC can take advantage easily of multi-processor/core computers by specifying a number of jobs that can run simultaneously. NGSQC can be run on LINUX clusters too. Both Torque and SUN/Oracle Grid Engine are supported.
NGSQC usage
To set up the NGSQC pipeline, a user needs to download a ngsqc_version.tar.gz file at http://brainarray.mbni.med.umich.edu/brainarray/ngsqc/. After unzipping it, he/she needs to go to the ngsqc folder and 1) copy genome (or transcriptome) fasta file to INPUT/fasta. 2) if the sequence reads are from the Illumina platform, copy the fastq file to INPUT/solexa. If the sequence reads are from the Applied Biosystems platform, copy both the csfasta file and qual file to INPUT/solid.
If a user wants to analyze a different type of target sequence or perform quality analysis for another species, he/she needs to perform the above procedures at another location and copy the required files.
For paired-end analysis, a user needs to create a file '<PAIRID>.pair' under either sub-folder INPUT/solid or INPUT/solexa. This file has two lines: the first line 'pair: <END1> <END2>' defines the two sequence read file names separated by a white space. These two sequence files contain sequence reads from two ends of the same sequencing library. Naturally, sequence read files END1 and END2 must exist under INPUT/solexa or INPUT/solid. The second line is 'range: 500 5000', which defines expected minimum and maximum distance between the sequence reads from two ends. In the above example, NGSQC will consider any paired-end reads with distance within 500 bp or over 5000 bp as abnormal. The NGSQC will report the number of such paired-end reads as quality measures for paired-end/mate pair library preparation.
To run the NGSQC, a user only needs to type 'make'. Users can also tweak some run parameters in the 'Environment Section' of the makefile. For example, if NGSQC is run on a big memory computer, a user might want to specify a bigger memory for program 'sort' to get better performance.
To view the result, a user can directly access file 'index.html' under the folder OUTPUT. He/she can also run 'make httpserver', which will start a web server using 8080 as the default port. The default port can be specified in makefile too.
If a user wants to perform QC analysis for several lists of sequence reads related to specific biological targets or conclusions, he/she can create sub-folders under INPUT/solexa or INPUT/solid, with one sub-folder for each sequence list. The subfolder name should be the same as the name of the corresponding sequence read file but without a suffix. Each line of such a custom sequence read file should be the sequence read ID of the corresponding sequence. After copying the custom sequence read files to their destination folders, the user only needs to run ‘make’ again to generate read distribution maps for a side-by-side comparison of the full sample QC heat maps described in earlier sections.
To utilize multi-core/CPU computer, a user can run 'make -j N', where N is a number, to run N analyses simultaneously. To run the pipeline on cluster, a user needs to modify the 'CLUSTER' parameter to either 'SGE' or 'TORQUE' in Makefile, and run 'make'. All analyses will be split into multiple smaller units and then be submitted to the cluster. The NGSQC pipeline will ensure smaller analysis units with dependency relationships executed in proper order.
Detailed usage is described in the 'README' file. The NGSQC package comes with some sample files and a user can just run 'make' for a test after unzipping. The NGSQC project website at http://brainarray.mbni.med.umich.edu/brainarray/ngsqc/ will also provide updated usage information, sample data files and sample output.
NGSQC output
NGSQC generates an html file that organizes the quality measures in a three-layered hierarchical output. At the first level, quality control outputs are listed by lane/sample names used in the input data file. Once a sample/lane name is clicked, it will open a window displaying the quality control measures associated with this sample. Clicking on individual quality measures will show the corresponding quality analysis results in heat maps, line graphs or bar charts. The “<help>” signs on the right side of each QC measure are linked to examples on the NGSQC project webpage, if more detailed explanations are needed. The following is a brief summary of the output.
1. Full Sample View: provide sample level overview of several QC measures, including the distribution base/color code, target hit count under different mismatch criteria and target hit levels (unique or multiple), sequencing read density, and quality score based on the corresponding average values for each tile/panel used by a sample. The results are presented in the same spatial layout as it is the deep sequencing assay to facilitate a quick identification of trends/patterns of quality issues in the whole sample assay.
2. All Tiles/Panels Summary View: The above quality measures from all tiles/panels are used to create heatmaps based on individual x-y locations on the two dimensional tile/panel surfaces. This set of QC graphs is designed for detecting QC problems that are repeated for every tile/panel.
3. Individual tile/panel QC: Individual tile QC maps can be used for identifying QC issues in individual tiles. To facilitate quick identification of problematic tiles/panels, we try to rank the unevenness of two measures, the read count and the genomic hit on each tile/panel, across x-y coordinates. Currently, we use a simple fixed grid for detecting unevenness.
4. Cycle-based QC plot: the average of quality measures from all tiles/panels as well as rows and columns of tiles/panels plotted against the base/color position in the sequence reads. We also include the mismatch rate derived from unique genome hits for each base position along the full-length sequence reads. These graphs will not only help to identify cycle-specific sequencing issues but also spatial-related issues based on tile/panel rows and columns.
5. Target hit plot: These graphs present sequence alignment results across the target genome or transcriptome sequences. They are useful for identifying uneven distribution of sequences when compared to reference targets or help to identify sequence structural differences between the sample and the reference genome/transcriptome.
6. QC for user-defined sequence lists. If a user analyzes the lists of sequences related to specific biological conclusions, the resulting QC data will be listed under the above categories in the output. The QC graphs for user-defined sequences will be displayed side-by-side with the QC graph for all sequences in the sample thus users can quickly find out whether their interested sequences are from problematic regions of sequencing assay.
7. Library QC for paired-end/mate-pair sequencing: The paired-end/mate-pair library overview graph presents the percentage of good pairs (correct orientation on the same chromosome), unpaired reads, chimeric pairs from different chromosomes, chimeric pairs with the wrong orientation from the same chromosome, and chimeric pairs less than or greater than the user-defined library fragment range in a bar chart. The pair distance distribution plot can be used to judge whether the matched paired-end/mate-pair reads exhibit the correct distance distribution. The distance between each good pair is calculated by the starting position of the first end, minus the starting position of the second end, thus the distance values can be negative. We also plot pairs hitting different strands of the target separately. As a result, the pair distance distribution plot can also be used to detect strand bias.