SAMQA: error classification and validation of high-throughput sequenced read data
© Robinson et al; licensee BioMed Central Ltd. 2011
Received: 11 May 2011
Accepted: 18 August 2011
Published: 18 August 2011
The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.
SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.
The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.
The growth in high-throughput sequencing means that there is a need for standardized, robust and scalable tools to perform quality assurance tests on the resulting data. The QA tools need to be extensible and are required to do more than check for instrument specific problems. The tools need to be able to check for both adherence to known standards and also to identify problems that may have arisen in sample preparation.
This paper introduces such a tool, built for analysis of data from The Cancer Genome Atlas (TCGA). This is the largest cancer genomics study to date, characterizing thousands of patients across 20 different cancers. The potential to discover new mechanisms and therapeutics from such a large-scale project is hugely important to the cancer community. However, the scale of the genomic data being generated by TCGA alone has already outpaced gains in computational power (associated with Moore's Law) making available analysis tools unusable on standard hardware. New sequencing technologies and the increasing interest in population-scale genomic analysis will only exacerbate the computational problems. Within low-throughput genomic data many errors can be characterized simply using the Sequence Alignment/Map (SAM) specification  and tools are available for reading and manipulation of sequence files. However, as the size and scale of genomic data increases these tools often struggle to perform at the necessary speeds (or at all). SAMQA provides quality checking using a high-performance computing framework to quickly and automatically capture and report errors across population-scale data. SAMQA has been designed to use the latest advances in high performance computing to ensure it is able to scale from gigabytes to petabytes of genome sequences regardless of sequencing platform, depth of coverage or data type.
An important step in the analysis of these large-scale genomic data is verification of a minimum standard of quality. With large scale sequencing projects there are a number of data quality issues that must be addressed, as biases are introduced at multiple levels. In population-level investigations data can significantly influence secondary analysis , especially when looking at cancer data where genomic variation is high and largely uncharacterized. Within TCGA alone, issues affecting the quality of sequence data range from sample collection to selection of sequence alignment tools:
• Biological sample collection. Despite the use of standard clinical methods and procedures, the samples may lack consistent purity  resulting in highly variable data within the population.
• Laboratory methods. Coverage discrepancies can result from the use of specific genome amplification methods  (e.g. GenomiPhi, Repli-G, PCR), potentially differentially representing areas of the genome.
• Instrumentation bias. Sequencing instruments are known to introduce specific anomalies, from Illumina's G-C bias  to the 454 Life Sciences  errors in regions that vary too much from the reference sequence.
• Bioinformatics. Multiple tools and tool versions are available to align reads to a reference genome. Within TCGA versions of BWA , Bowtie , GATK , Maq , and BFAST  are all used for sequence alignment. This results in data with a variety of computational errors as reads are matched and scored.
The next section discusses the types of tests that SAMQA uses to identify technical errors in the structure of the sequenced read data and biological errors which are assessed by extracting features which correspond to likely anomalies from the data. The Results section introduces how the high-performance computing system has been designed to scale to the required level. The Results section gives an example QA analysis that was undertaken across approximately 400 genome files.
When attempting to classify anomalous data sets, we must begin with the question, "What constitutes an anomaly?" This is a difficult question to answer when we expect our data sets to be highly divergent as: the samples can be gathered from different cohorts and populations; they typically will contain a high number of polymorphisms and structural variations which differ from the references sequence(s); and they often contain batch effects due to sample collection and instrumentation differences.
Our approach has been to classify anomalies in sequence data sets relative to individual reads within a sample. The detection of these can be completely automated through the use of technical specifications, and allow for the inclusion of a level of data and biologically specific tests which require expert input.
SAMQA Technical Tests
Consistency of chromosome identifiers, presence of read groups and sequence dictionary, valid reference sequence.
Correct regular expression for flag value, query name, mapping position, mate mapping position, inferred insert size, etc.
Incorrect flags set (e.g. Reads not mapped in a proper pair, with proper pair flag set), adjacent indel present in CIGAR value, invalid or missing tag NM, and functional dependent fields do not align or are invalid.
SAMQA Biological Tests
Low Phred-adjusted mapping quality score
Shortened read lengths for a given sequencing technology
Low aggregate number of reads for a given sequencing technology
Low number of reads for a given set of kilobase regions
Low coverage for a given read group, chromosome, or kilobase region
High numbers of localized structural variation
Anomalous sequence data
Instances of "random" chromosomes from human assembly 
Population estimates of structural variation
Very high projected structural variation across different platform units
Read group correlation
Low mapping quality correlation for megabase regions, across read groups
Low coverage correlation of megabase regions, across read groups
At its core, SAMQA is a series of tools that determine if a sample has syntactic errors and uses a series of heuristic measures to identify likely biologically improbable anomalies. The system is designed to be extensible so that further tests can be easily added. In implementation, SAMQA is two sets of tools that process Binary Alignment/Map files using a defined standard (SAM) and expert analysis.
High Performance Computing
Even for low coverage sequence files, traditional approaches to processing these data on a local machine are highly limited relative to the enormity of input data from sequencing methods. The SAMQA system has been designed to scale to meet the needs of investigations that may generate thousands of genomes. When high-performance computing (HPC) is required, the BAM processing layer is handled by the Hadoop-BAM project , with minor modification to allow BAM sequence indexing to occur entirely within the Hadoop Distributed File System (HDFS), a storage solution that spans disks in a cluster to provide one large, distributed file system [15, 16].
Run Time for SAMQA on Hadoop
One Patient, aggregate time in minutes (over 23GB)
Ten Patients, aggregate time in minutes (over 405GB)
One Hundred Patients, aggregate time in minutes (over 3,904GB)
Mapping Ratio Test
Read Group Validation
Mapping Quality Test
File Structure Validation
Pearson Coverage Correlation
Pearson Mapping Quality Correlation
Run Time SAMQA on Hadoop vs. Standard Server
Single Core Server (in hours)
Hadoop Cluster (aggregate in hours)
1 Sample (23GB)
10 Samples (405GB)
100 Samples (3904GB)
Each feature extraction tool outputs a series of key-value pairs in a human-readable and machine-parsable text format. Each key-value pair describes some feature of the data as defined by the tool being run. For example the pair < BIN_example_0_COVERAGE, 9.4126293> has the compound key BIN_example_0_COVERAGE describes the coverage for the kilobase region 0 for our example sequence, and the value 9.4126293 denotes coverage over this region. While somewhat terse in their contrived meaning, all compound keys output by SAMQA are fully representative and uniquely identifiable. This simple representation is designed to make it easier to integrate with downstream analysis tools while remaining extensible to the operating standards of those wishing to integrate the tool into their own workflow.
For all Mapper-only jobs, the mappers are used to perform parallel computations against isolated, atomic data fragments. In a MapReduce operation, each Mapper performs input pre-processing, the Combiner (if specified) aggregates these intermediate results, and the Reducer performs each final calculation. This is especially vital for any tests across the entire input data such as Pearson correlations or calculations of coverage.
The SAMQA toolkit was developed to support work being undertaken at the Center for Systems Analysis of the Cancer Regulome, which is one of the TCGA Genome Data Analysis Centers.
COAD/READ SAMQA Results
"CIGAR should have zero elements for unmapped read"
236 COAD/READ exon capture sequence files.
Files were completely unpaired in sequencing
Files contained unpaired reads
"CIGAR should have zero elements for unmapped read"
48 COAD/READ whole genome files
"MAPQ should be 0 for unmapped read"
"RG ID on SAMRecord not found in header"
The biological validation tests are used to further explore the data. The exploration allows for the identification of files that have similar properties, which will result in batch effects (e.g. lower average quality scores or coverage - see Figures 3 and 4), as well as individual outliers. The tests themselves are output as a single file, and can be read directly into an analysis program. The supplementary materials contains the output for the default tests that have been run across both the COAD/READ samples, as well as Glioblastoma (GBM) and Ovarian (OV) cancer samples.
SAMQA is a QA analysis toolkit that runs a series of tests over sequenced read data and is optimized for large numbers of files. The tests are for both verification of errors according to the SAM specification, and for assigning scores relating to the biological implausibility of structurally valid, dubious reads that relate to putative erroneous samples. It provides a simple, extensible, and robust framework built on top of Google MapReduce and Apache Hadoop, and is capable of processing large volumes of data quickly in a highly parallel manner. We have used the tool for analysis over data sets of OV, GBM, and COAD/READ cancer data from TCGA. The software can be used, with minimal extension, to provide useful analysis of any form of sequenced read data in SAM-defined formats.
We believe that this tool is valuable to the medical research and bioinformatics communities specifically, as it provides a sanity check and second opinion that release data sets are valid. The tool can be used as part of an automated pipeline, and the HPC system means that the tool can be run repeatedly on increasingly large files as investigations evolve. SAMQA also provides an improvement in standards for release quality, which we feel is valuable in a community that relies heavily on custom and highly vendor-specific technologies for sequencing and data processing. SAMQA is released as a free and open-source tool to the community.
Availability and requirements
Project Name: SAMQA
Project home page: http://informatics.systemsbiology.net/project/samqa
Operating system: Platform independent.
Programming language: Java
Other requirements: Java 1.6 or higher, computational cluster or single server running Hadoop 0.20.2. Basic familiarity with Google MapReduce and Apache Hadoop. Further information on setting up and running a Hadoop cluster can be found in Hadoop's Quick Start guide , Cluster Setup guide  or in Tom White's book, Hadoop: The Definitive Guide .
License: Apache version 2.0
Acknowledgements and Funding
This work was supported by grant U24CA143835 and R01CA1374422 from the National Cancer Institute, P50GM076547 and R01GM087221 from the National Institute of General Medical Sciences, and NIH contract HHSN272200700038C from the National Institute of Allergy and Infectious Diseases. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
- The SAM Format Specification Working Group. The SAM Format Specification (v1.3-r882). 2010, [http://samtools.sourceforge.net/SAM1.pdf]
- Johnson PLF, Slatkin M: Accounting for bias from sequencing error in population genetic estimates. Molecular biology and evolution. 2008, 25 (1): 199-PubMedView Article
- Koboldt DC, Ding L, Mardis ER, Wilson RK: Challenges of sequencing human genomes. Briefings in bioinformatics. 2010, 11 (5): 484-10.1093/bib/bbq016.PubMed CentralPubMedView Article
- Pinard R, De Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, Egholm M, Rothberg JM, Leamon JH: Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics. 2006, 7 (1): 216-10.1186/1471-2164-7-216.PubMed CentralPubMedView Article
- Dunning MJ, Barbosa-Morais NL, Lynch AG, TavarÈ S, Ritchie ME: Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008, 9 (1): 85-10.1186/1471-2105-9-85.PubMed CentralPubMedView Article
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome biology. 2007, 8 (7): R143-10.1186/gb-2007-8-7-r143.PubMed CentralPubMedView Article
- Li H, Durbin R: Fast and accurate short read alignment with BurrowsñWheeler transform. Bioinformatics. 2009, 25 (14): 1754-10.1093/bioinformatics/btp324.PubMed CentralPubMedView Article
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralPubMedView Article
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research. 2010, 20 (9): 1297-10.1101/gr.107524.110.PubMed CentralPubMedView Article
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008, 18 (11): 1851-10.1101/gr.078212.108.PubMed CentralPubMedView Article
- Homer N, Merriman B, Nelson SF: BFAST: an alignment tool for large scale genome resequencing. PLoS One. 2009, 4 (11): e7767-10.1371/journal.pone.0007767.PubMed CentralPubMedView Article
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-10.1093/bioinformatics/btp352.PubMed CentralPubMedView Article
- SAMTools: Picard. 2009, [http://picard.sourceforge.net/]
- Hadoop-BAM: SourceForge Project Page. 2010, [http://hadoop-bam.sourceforge.net]
- White T: Hadoop: The Definitive Guide. 2010, Yahoo Press
- The Apache Software Foundation: Hadoop. [http://hadoop.apache.org/]
- Dean J, Ghemawat S: MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008, 51 (1): 107-113. 10.1145/1327452.1327492.View Article
- Hadoop Quick Start Guide. 2010, [http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html]
- The Apache Software Foundation. Hadoop 0.20 Cluster Setup. 2009, [http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.