The effect of strand bias in Illumina short-read sequencing data
© Guo et al.; licensee BioMed Central Ltd. 2012
Received: 10 August 2012
Accepted: 9 November 2012
Published: 24 November 2012
Skip to main content
© Guo et al.; licensee BioMed Central Ltd. 2012
Received: 10 August 2012
Accepted: 9 November 2012
Published: 24 November 2012
When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias.
We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers.
Extreme strand bias indicates a potential high false-positive rate for SNPs.
Strand bias examples from real data
Forward Strand Genotype
Reverse Strand Genotype
Unbalanced strand mapping is a phenomenon when the number of reads mapped to forward and reverse strands are significantly different. In extreme cases, all reads are mapped to one strand, leaving the other strand completely uncovered. Unbalanced strand mapping is also considered to be a type of strand bias. Fundamentally, however, it is a different problem from strand bias in the calls. In a previous study , we have shown that unbalanced strand mapping is an artifact of the exome capturing mechanism and does not affect the quality of the genotyping. Thus, in the current study we focused only on strand bias related to the genotype call difference between the forward and reverse strands.
We randomly selected whole exome sequencing data for 22 breast cancer patients recruited to the Shanghai Breast Cancer Study (SBCS). The SBCS is a large, population-based case–control study of women in urban Shanghai, the details of which have been previously described [4, 5]. All patients had very early-onset (22–32 years old) breast cancer or early-onset (38–41 years old) plus a first-degree family history of breast cancer. Approval of the study was granted by the relevant institutional review boards in both China and the United States. Genomic DNA from buffy coat samples was extracted using QIAmp DNA kit (Qiagen, Valencia, CA) following the manufacturer’s protocol. All samples have been genotyped using the Affymetrix 6.0 array in a previous genome wide association study . All patients in this study signed written informed consent. The approvals of the study were given by the institutional review board of Shanghai Cancer Institute and Vanderbilt University.
Sequencing was performed at the Genome Service Lab at Hudson Alpha Institute. Data were 72-base paired-end reads generated from Illumina GA IIx machines. Each sample was run on a single lane of a flowcell. DNA enrichment was done using the Agilent SureSelect Human All Exon kit v1, which was designed to target 165,637 genomic regions (37.8 million bases; 71.6% inside exons; average length 228 bp).
We shifted the Illumina base quality scores (Phred+64) to the Sanger scale (Phred+33)  and performed an initial alignment to the National Center for Biotechnology Information human reference genome HG19 using the program Burrows-Wheeler Aligner (BWA) . We then marked duplicates with Picard and carried out regional realignment and quality score recalibration using the Genome Analysis Toolkit (GATK) . For variant calling, we only used reads with a mapping quality score (MAPQ) ≥20 (i.e., ≤1% probability of being wrong) and bases with base quality score (BQ) ≥20. We used GATK's Unified Genotyper to call SNPs simultaneously on all samples.
The same 22 samples were genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0 which features 1.8 million genetic markers, including more than 906,600 SNPs. Approximately one-third of the SNPs on the Affymetrix Genome-Wide Human SNP Array 6.0 reside in the exome regions covered by the Agilent SureSelect Human All Exon kit v2.
Two additional independent datasets were used to validate our findings. The first additional data set contains six samples, randomly selected from the 1000 Genomes Project , that were sequenced on the Illumina GAII with capturing performed with an array based method. The second additional dataset contains six samples sequenced on the Illumina HiSeq 2000 sequencer, with capturing performed with the Illumina TruSeq capture kit. The variety of capture methods and sequencers used to collect those data provided more robustness for our study.
SB: SB is defined as . The calculation of SB has been used previously in a mitochondria heteroplasmy study 
GATK-SB: GATK-SB is the strand bias score calculated by GATK , and it is defined as Max
Fisher Score: The Fisher Score is derived from Fisher p-value and is calculated in the standard way using the 2 by 2 table. To ensure directional consistency with the SB and GATK-SB scores, the Fisher Score is defined as 1 minus the p-value.
Both SB and GATK-SB scores have ranges from 0 to infinity, while the Fisher score has a range from 0 to 1. For all 3 scores as we have defined them, lower values mean less strand bias and higher scores mean a more severe strand bias.
Genotype consistency between sequencing calls and genotyping chip calls has been used as a quality control for sequencing data . For example, GATK has a built-in tool that uses genotyping chip consistency as a SNP quality recalibration criterion. We performed a consistency analysis between the genotype inferred from the exome sequencing data and the genotype inferred from the SNP chip data. When inferring genotypes using sequencing data, all the genomic positions can be divided into two major categories: homozygous and heterozygous. Strand bias has no effect on the quality of homozygous genotype calls, because regardless of how severe the strand bias might be, it lacks sufficient influence to force the genotype caller to make a false heterozygous inference. Thus, in our analysis, we only considered heterozygous SNPs called by GATK's Unified Genotyper. The consistency is defined as the number of heterozygous SNPs with a consistent genotype between the exome sequencing data and the SNP chip data divided by all overlapped (with exome sequencing) heterozygous SNPs in the SNP chip data.
The SNPs present in both the Affymetrix 6.0 SNP Chip and the exome sequencing data are only a small percentage of the SNPs identified by exome sequencing. Thus, we computed other genotype quality control parameters such as the transition/transversion (Ti/Tv) ratio, and the number of novel non-synonymous SNPs. The Ti/Tv ratio is around 3.0 for SNPs inside exons and about 2.0 elsewhere ; it also differs between synonymous and non-synonymous SNPs . Because the target regions of exome capture kits often cover more than just exons, the Ti/Tv ratio for SNPs inside these target regions is expected to lie between 2.0 and 3.0 with the value depending on the fraction of exons inside target regions. We also compared the Ti/Tv ratios between SNPs with low or no strand bias and SNPs with extreme strand bias in novel SNPs and the SNPs reported in dbSNP. Furthermore, the number of novel non-synonymous SNPs can also be a very good indicator of the false positive rate. A study  have shown that only 200–300 novel nonsynonymous SNPs should be identified per person by exome sequencing; a higher number would likely indicate a higher false-positive rate.
To identify the cause of strand bias, we want to initially determine if the strand bias occurs systematically across subjects. Thus, we examined the strand bias score consistency between samples. For the 22 breast cancer samples, there are total of 231 possible pairs. For each pair, we selected positions in the top 20 percent of strand bias scores from one subject in the pair, and computed the Pearson correlation coefficient using the strand bias scores at the selected positions between the two samples in the pair. By only selecting the positions with high strand bias scores in one subject in the pair, we can effectively capture the scenario where two subjects have significantly different strand bias scores at the same positions. Box plots were used to show the distribution of the correlations across the 231 pairs.
We also hypothesized that post analysis procedures may also contribute to the cause of strand bias. After initial alignment, several popular enrichment steps are often used to reduce the genotyping false-positive rate. Such steps include local realignment, base quality score recalibration, base alignment quality recalibration (BAQ), and removal of duplicate reads. Based on the popularity of these steps, we examined the strand bias score using four different processing pipelines: (1) initial alignment (without performing any enrichment steps, denoted as initial alignment); (2) realignment, recalibration and removal of duplicates (denoted as realignment); (3) base alignment quality recalibration (denoted as BAQ); (4) local realignment plus BAQ (denoted as realignBAQ). We computed the Pearson's correlation coefficients, and plotted scatter plots of strand bias scores between each processing pipeline. Finally, we realigned our data using a second aligner Bowtie, and compared the strand bias scores computed from alignment bam files between Bowtie and BWA.
The 22 breast cancer patient samples sequenced with the Agilent SureSelect capture kit were taken from 2776 patients who were genotyped using the Affymetrix SNP 6.0 array in a genome-wide association study; detailed genotyping methods and stringent QC criteria were described in Zheng et al. . The original scan included three quality control samples in each 96-well plate, and the SNP calls showed a very high concordance rate (mean 99.9%; median 100%) for the quality control samples. In addition, 742 SNPs were genotyped using alternative genotyping platforms for a subset of subjects; these SNPs also had a high concordance rate with genotypes obtained from the SNP chip (mean 99.1%; median 99.8%). The SNP chip call rate for the 22 samples investigated here ranged from 97.83% to 97.84%.
Our sequencing data has high quality. Table S1 contains detailed summaries of the samples studied. For the 22 samples sequenced with the Agilent SureSelect capture kit, we obtained an average of 68.9 (range 44.6-78.2) million reads per subject, with 45x median depth for the SureSelect target regions. On average, 91.4% (88.4-93.8%) of the reads were aligned to the human reference genome, and 86.2% (82.1-89.3%) had an insert size ≤ 500. The six samples sequenced with the Illumina TrueSeq capture kit had an average of 93.8 (range 91.3-98.0) million reads and achieved 48x median depth for the TrueSeq target regions. On average, 95.1% (94.9-95.6%) of the reads had insert size ≤ 500. The six samples sequenced by the 1000 Genomes Project had an average of 67.9 (range 47.5-83.7) million reads and achieved 59x median depth for their target regions. On average, 89.7% (72.1-99.1%) of the reads had insert size ≤ 500.
SNP quality by MAF and subset 1
All Seq SNPs
Seq SNPs - Chip SNPs
dbSNP SNPs in Seq
dbSNP SNPs not on Chip
SB and GATK-SB difference
All analyses described were repeated using two additional independent datasets. Similar findings were observed without exception. The results for the additional datasets can be viewed in the supplementary material.
Due to its vast popularity and easily accessible data, we focused our study on exome sequencing data from the Illumina sequencing platform. We did not evaluate strand bias on data generated by other sequencing platforms, such as 454 Life Science and Applied Biosystems. We speculate that the same phenomenon exists for the Applied Biosystems sequencing platform, because its technology can generate a similar depth of data compared to Illumina’s platform. However, it would be hard to observe or study strand bias on the 454 Life Science’s sequencing platform, due to its limited depth.
An interesting finding from this study is that sequencing data and genotyping chip consistency might not represent the whole picture of sequencing data quality, as we previously thought. The overlapping subset of SNPs is small compared to all SNPs identified through exome sequencing. Also, all SNPs on the standard genotyping chips are also in dbSNP and the overlapping portion is not a random sample from all SNPs identified by exome sequencing. The exact SNP selection criteria for the Affymetrix 6.0 SNP chip are unknown to us. We speculate that criteria such as GC content, proximity to other SNPs and the ease of enzyme ligation should be considered; these factors also impact the sequencing quality. We observed a better quality for genotyping chip SNPs in sequencing, raising concerns about the effectiveness of using genotyping chip consistency rate as a quality control for SNPs identified by sequencing. We believe that the genotyping chip consistency can still be used as a quality control; however, the use of other quality control parameters, such as the Ti/Tv ratio and the number of novel non-synonymous SNPs in conjunction, is essential for obtaining an accurate description of the SNPs identified by sequencing.
We found that strand bias does not consistently occur at the same genomic sites across different samples. By comparing strand bias using different post-analysis pipelines, we found some evidence to support the hypothesis that post-analysis procedures can cause strand bias, especially for the processing pipeline that applies both local realignment and BAQ. Such processing pipelines can introduce more SNPs with higher strand bias, which in turn results in more false-positive SNPs. Use of local realignment and BAQ in the same processing pipeline should be avoided. The correlation of allele frequency scores between BWA and Bowtie were very high, indicating that strand bias is not likely due (although not completely ruled out) to the artifacts of alignment, but is more likely an artifact or due to errors from the library preparation or sequencing. A portion of the strand bias can also be caused by sampling variation during the sequencing.
For the three strand bias scores we have studied, SB, GATK-SB, and Fisher scores, we evaluated their effectiveness in capturing true false-positive SNPs. Based on our results, the Fisher and SB scores can capture true false-positive SNPs better than the GATK-SB score. By comparing exome sequencing data with SNP chip data, the SB and Fisher scores indicated slight drops in heterozygous consistency when strand bias scores were over the 80th percentile. However, the magnitude of the consistency rate drop is minor. From the other genotype quality control parameters, such as Ti/Tv ratio and number of novel non-synonymous SNPs, we observed an overall better quality for SNPs with low or no strand bias than for SNPs with extreme strand bias. Based on our findings, strand bias can negatively affect the genotyping quality of sequencing data. We recommend caution when applying strand bias as a filter. Only SNPs with extreme strand bias should be regarded as false-positive candidates. We considered SNPs with strand bias score in the top 10% as extreme. VarScan , a variant calling tool also uses the top 10% rather than a fixed numerical score as a strand bias filter. Indiscriminant use of strand bias as a filter will result in a large loss of true positive SNPs.
This work was partly supported by NIH grants R01HG004517 (JH, CL) and R01CA124558 and R01CA62477 (JL, JH, WZ, CL). Patient recruitment and sample collection was supported by R01CA64277. Sample preparation was conducted at the Survey and Biospecimen Core, which is supported in part by the Vanderbilt-Ingram Cancer Center (P30 CA68485). We also would like to thank Peggy Schuyler and Margot Bjoring for their editorial support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.