The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data
BMC Genomics volume 18, Article number: 690 (2017)
High throughput sequencing technology enables the both the human genome and transcriptome to be screened at the single nucleotide resolution. Tools have been developed to infer single nucleotide variants (SNVs) from both DNA and RNA sequencing data. To evaluate how much difference can be expected between DNA and RNA sequencing data, and among tissue sources, we designed a study to examine the single nucleotide difference among five sources of high throughput sequencing data generated from the same individual, including exome sequencing from blood, tumor and adjacent normal tissue, and RNAseq from tumor and adjacent normal tissue.
Through careful quality control and analysis of the SNVs, we found little difference between DNA-DNA pairs (1%–2%). However, between DNA-RNA pairs, SNV differences ranged anywhere from 10% to 20%.
Only a small portion of these differences can be explained by RNA editing. Instead, the majority of the DNA-RNA differences should be attributed to technical errors from sequencing and post-processing of RNAseq data. Our analysis results suggest that SNV detection using RNAseq is subject to high false positive rates.
Single nucleotide variants (SNVs) are often measured in human specimens to correlate with other phenotypic variables. In general, there are two major classes of SNVs: germline mutations, which are inherited with one allele from each parent (also known as germline), and somatic mutations which are acquired at late stage of life. Germline mutations are usually used to assess the risk of developing certain diseases. Somatic mutations are often associated with tumorigenesis. Both germline and somatic mutations have been studied extensively in biomedical research. Single nucleotide polymorphisms (SNPs) describe germline mutations at population level.
The detection of SNVs can be achieved through a variety of methods, including real time polymerase chain reaction (RT-PCR), genotyping array, Sanger sequencing, and high throughput sequencing. All of these methods use genomic DNA as the input source. For example, genome wide association studies (GWAS) typically use DNA extracted from blood to infer SNPs due to the easy collection and storage of blood. Somatic mutations occur in tumor tissues and are usually identified by comparing the DNA sequences of tumor tissue to blood or adjacent normal tissue. One of the most basic assumptions of human DNA is that in the absence of somatic mutations, every cell in the body is essentially identical. A study in 2009 by Gottlieb et al. challenged this conventional paradigm by identifying three SNPs in tissues that were not present in blood . This finding received great attention, while simultaneously receiving criticism for the inaccuracy of the analyses .
High throughput sequencing technology enables investigators to screen for SNVs in the entire genome or exome at a reasonable cost. Over the same time period, the development of RNAseq technology has replaced microarrays as the primary tool for gene expression profiling [3,4,5,6]. Unlike microarrays, RNAseq is based on high throughput sequencing technology, and thus investigators can now also examine RNA genomic sequences at a single nucleotide resolution. RNAseq technology introduces an opportunity to compare the genomic sequences of DNA and RNA at an unprecedented large scale.
RNAseq data is often thought to be a less-than-ideal source for SNV detection due to higher false positive rates . The higher false positive rates can be attributed to several reasons, including higher complexity in alignment due to the RNA splicing , random errors introduced during reverse transcription, PCR  and RNA editing . Numerous attempts have been made to overcome these difficulties [8, 10, 11] with only moderate success.
The differences between DNA and RNA sequences have been previously documented. For example, Li et al.  reported that they observed widespread differences between the RNA and DNA sequences of the same human cells. Since its publication, three other independent follow-up studies [13,14,15] challenged the conclusion by Li et al., arguing that the differences found by Li et al. are attributed to alignment artifacts, RNA editing, etc. To address this controversy, a more in-depth analysis of sequence data is required to discern the true differences between RNA and DNA sequences.
To date, there is no clear consensus on how much genomic difference we should expect to see between blood and tissue, and between DNA and RNA of the same subject. Answering these questions can greatly contribute to the accuracy of SNV and somatic mutation identification from multiple sources.
To fully understand the single nucleotide differences between blood-tissue and DNA-RNA pairs, we conducted a thorough study that compared the nucleotide sequences between each sample-sequencing pair type, using a unique set of sequencing data from TCGA. From TCGA, we obtained sequencing data of 50 samples from 10 breast cancer patients. Each patient had five samples collected and sequenced: 1) DNA exome sequencing on blood (DNA-NB), 2) DNA exome sequencing on tumor primary tissue (DNA-TP), 3) DNA exome sequencing on adjacent normal tissue (DNA-NT), 4) RNAseq on tumor primary tissue (RNA-TP), 5) RNAseq on adjacent normal tissue (RNA-NT). Data for all samples were downloaded from the Cancer Genomics Hub in aligned BAM format . Since the DNAseq data and RNAseq data were processed by different facilities, to ensure data integrity, we converted all BAM files to raw FASTQ formats and performed alignment against the human genome reference (HG19) using BWA  for DNAseq data and TopHat 2  for RNAseq data. The alignment statistics can be found in Table 1. Next, we marked duplicates using Picard , then performed local realignment and local recalibration using the Genome Analysis Toolkit (GATK)  developed by the Broad Institute.
Genotypes were inferred by HaplotypeCaller from GATK. GATK best practice filters were used to filter out potential false positive SNPs. For the five sample-sequencing types—DNA-NB, DNA-NT, DNA-TP, RNA-NT, and RNA-TP—there are 10 possible pairs. For each pair of samples, we computed the heterozygous genotype consistency as follows: the number of consistent heterozygous genotypes between sample A and sample B divided by the total number of heterozygous genotypes in sample A or in sample B. We only considered genomic positions covered with at least 10 reads in both samples of the pair (denoted as “callable” sites). Also, we focused our study on haploid genomes (chromosome 1–22). Chromosome X, Y and mitochondrial DNA were not considered in this study.
The identification of an alternative allele at a certain genomic location is highly dependent on the depth of coverage. For a heterozygous position, the reads that support the alternative allele should ideally follow a binomial distribution,Binomial(D, 0.5). Thus, we expect to observe an alternative allele at 50% allele frequency. The probability of observing an alternative allele increases as the depth increases (Fig. 1a). As seen from this figure, setting the depth threshold at 10 allows a higher probability to observe an alternative allele. However, the distribution of the alternative-allele frequency among the reads produced by a sequencing dataset usually follows a normal distribution (Fig. 1b). Thus, there are some genotypes with extremely high or low allele frequencies that deviate from 50%. Also, during sequencing alignment, reference preferential biases can also skew the distribution of allele frequencies by 2 to 5% toward the reference allele . Reference preferential bias is a type of alignment artifact, since most aligners will penalize the alignment of a read based on the number of mismatches within that read. A true SNP is counted as a mismatch during alignment, thus the aligner preferentially prefers reads with no mismatches in the alignment and slightly undercounts the reads containing alternate alleles. Furthermore, four of our five sample types are extracted from tissues in or around the tumor with acquired somatic mutation, which may contain some tumor cells carrying somatic mutaitons. For these reasons, the percentage of mutated alleles at a genomic location does not strictly follow the Binomial(D, 0.5) distribution. It is possible that only a small percentage of reads support mutated alleles. To take this into consideration, we also computed a loose genotype consistency between a pair of samples. The loose genotype consistency is computed in the same way as described above, with the exception that consistent heterozygous genotypes between samples A and B are defined as the genotypes that are consistent if they have the same alternative allele supported by at least one read that passed the quality filter (base quality >20). Thus, the actual genotype call by the HaplotypeCaller is irrelevant in this calculation.
We studied the pattern of potential RNA editing by examining the flanking sequences of the different RNA-DNA sites and the frequency of the RNA-DNA nucleotide change types. Motif analysis was carried out using HOMER ; cluster analysis was performed using Heatmap3 . Additional annotation on the different RNA-DNA sites was done using previously reported editing sites, as described in the databases RADAR  and DARNED .
The heterozygous consistency analysis results showed high consistency rates (0.96–0.99) (Fig. 2, Table 2) between sequencing data pairs of DNA samples from all three tissue sources. After RNA data was introduced into the pairings, the heterozygous consistency rate dropped substantially (0.79–0.90). As expected, the loose heterozygous consistency is higher in comparison to the regular heterozygous consistency, achieving a range of 0.97–0.99 for DNA-DNA pairs, and a range of 0.82–0.91 for DNA-RNA pairs. Due to the occurrence of errors and noise during library preparation, sequencing, and/or alignment, there will always be nucleotide differences even between DNA sequencing data of technically-replicated samples. The minor differences observed between DNA-NB and DNA-NT can also be contributed to tumor contamination in the adjacent normal tissue. The large differences observed between the DNA-RNA pairs confirm previous findings that large amounts (10–21%) of callable nucleotides are observed differently in DNA as compared to RNA sequencing data. We demonstrate the DNA-RNA difference using one example by Integrative Genomics Viewer (Additional file 1).
We performed dinucleotide distribution analysis. First, we computed the global dinucleotide frequencies for the human genome (Fig. 3). The most preferred dinucleotide is TT (9.78%) and AA (9.75%), and the least preferred dinucleotide is CG (1.00%), followed by GC (4.29%). For all of the SNV differences that we observed between any sample-sequencing type of the same subject, we extracted the up- and down-stream dinucleotides of the site, then we normalized them to the human genome background dinucleotide frequencies. Clean patterns emerged when we used the normalized dinucleotide frequencies in cluster analysis. For the two nucleotides upstream and downstream of the discordant genotype sites, two major clusters were formed by sample-sequence types: a smaller cluster containing two pairs involving only DNA samples, and a larger cluster containing eight pairs, in which seven involve RNA samples (Fig. 4). The most preferred dinucleotides both up- and down-stream were GG (upstream: 12%, downstream: 14%), followed by CC (upstream: 10%, downstream: 11%). These findings suggest GC content plays a role in the mismatched genotypes. The least preferred dinucleotide for both upstream and downstream was AT (upstream: 3.1%, downstream: 2.9%). For the upstream two nucleotides, pairs with RNAseq data had higher frequencies for CC (t test p = 1.03E-5) and CG (t test p = 0.001), while pairs with only DNA sequencing data had higher frequencies in AA (t test p = 0.0008) and TT (t test p = 0.002). For the downstream two nucleotides, pairs with RNAseq data had higher frequencies for GG (t test p = 7.56 × 10−6) and CG (t test p = 1.41 × 10−5), while pairs with only DNA sequencing data had higher frequencies in AA (t test p = 2.25 × 10−8) and TA (t test p = 2.52 × 10−8). The detailed allele frequencies and analysis are presented in Tables 3 and 4. Next, we examined whether or not a pattern could be observed in the DNA-RNA difference. We took 10 nucleotides up and downstream of the DNA-RNA difference sites and identified no significant motif using Homer .
We also performed cluster analysis using the allele change frequencies between all possible pairs of the sample-sequencing types. With the four possible nucleotides, there are six possible changes (A-C, A-T, A-G, C-T, C-G, and G-T). Similar to the cluster analysis using the frequency of two upstream and downstream nucleotides, cluster analysis showed that pairs with only DNA sequencing data form one cluster, while pairs with RNAseq data form another cluster (Fig. 5). Transition changes were clearly more preferred than transversion changes (t test p = 6.04E-14). The average Ti/Tv ratio of the DNA-RNA difference sites was 1.98 (range: 1.71–2.86). The Ti/Tv ratio has been shown to be strongly related to genomic region and often serves as a quality control measurement [26,27,28]. Our Ti/Tv ratio result suggested that the DNA-RNA differences were not random. The detailed change frequencies and analysis are presented in Table 5.
Lastly, we categorized the differences between DNA and RNA. The overall DNA-RNA differences per DNA-RNA pair category can be view in Fig. 6a. There are thousands of differences per category, which agrees with previous finding from Li et al. . Out of all of these DNA-RNA differences, there were a total of 41,529 unique sites. Only a small portion of these sites, 877, have been documented in existing RNA editing databases; and 14,876 sites are recorded in dbSNP (Fig. 6b). Because we required both samples in the DNA-RNA comparisons to have a depth of 10 or higher, the majority of the differences are located in exonic regions (Fig. 6c, Table 6). Of these exonic differences, a majority of them (61.3%) are nonsynonymous (Fig. 6d, Table 7).
It has been proposed that single nucleotide variants, such as SNPs and somatic mutations, can be detected using RNAseq data [8, 29]. At the same time, strong evidence demonstrates large differences of detected nucleotides between DNA and RNA sequencing data [7, 12]. To further examine the genotype differences inferred from high throughput sequencing data between DNA and RNA, and among various sample sources, we designed a study to compare the genotypes obtained from five types of high throughput sequencing data that were generated from ten individuals. With thorough analysis, we observed large differences between the genotypes inferred from DNA and RNA sequencing data, which agrees with Li et al.’s findings . However, Li et al. asserts these observed differences are the true differences between DNA and RNA, not accounting for differences introduced by technical errors. The study was conducted using TCGA data. Since we do not have access to the original samples, we could not perform at qPCR validation of the DNA-RNA differences.
DNA-RNA differences can be attributed to two categorical factors: biological and technical. The biological factors can be summarized as RNA editing and polyadenylation, which are a part of the natural biological process. RNA editing is the process that results in RNA nucleotide sequences that differ from the DNA template. Polyadenylation is the addition of a Poly(A) tail to the 3′ end of the mRNA during the transcription of DNA to RNA. Technical factors include reverse transcription errors, sequencing errors, and alignment errors, which are technical difficulties that we have yet to overcome. Reverse transcription errors occur during the reverse transcription from RNA to cDNA—a mandatory step for RNAseq. Sequencing errors can result from the high throughput sequencing technology, as all types of high throughput sequencing technologies have known limits and advantages. For example, Illumina’s high throughput sequencing technology is known to be sensitive to GC content [30, 31], while 454 Life Sciences’ sequencing technology produces low quality reads with long Poly (A) and (T) tracts. Alignment errors often occur while finding the best genomic locations for a read. The current alignment algorithm is largely based on the Borrows Wheeler Transformation, an algorithm that is used in computer science to compress repeated strings that contain repeated characters. Even though alignment can happen at a global level, the human genome is too complicated and contains a vast number of homologous regions, and, alignment of RNA reads to a DNA reference sequence requires that splicing of the gene exons be taken into account. All of these factors can substantially convolute the RNA alignment process and introduce potential alignment errors. Since 2001, there have been 20 reference human genomes released. Substantial improvements have been made with each new release providing more precise descriptions of the transcriptome, which in turn increases the accuracy of alignment of RNA reads. It is possible that with further advancements to the reference human genome, we will observe fewer DNA-RNA differences.
In our analysis, we also observed differences between adjacent normal tissue and blood in DNA, which are both considered to be germline. Some of these differences can be explained by tumor contamination of the adjacent normal tissue, and/or technical errors. Our results support Gottlieb’s finding that there are potential SNP differences between normal tissue and blood .
In conclusion, based on our analysis results, there are large differences (10%) between genotypes inferred from DNA and RNA sequencing data of the same individual. At the present time, it is difficult to assess what portion of these differences are due to biological processes and what portion of the differences are the result of technical errors. When RNAseq data is used to infer SNPs or somatic mutations, the DNA-RNA difference will result in large amounts of false positives , thus making RNAseq data a less than ideal source for detecting SNVs.
Gottlieb B, Chalifour LE, Mitmaker B, Sheiner N, Obrand D, Abraham C, Meilleur M, Sugahara T, Bkaily G, Schweitzer M. BAK1 gene variation and abdominal aortic aneurysms. Hum Mutat. 2009;30(7):1043–7.
Kury S, Airaud F, Piloquet P, Bezieau S. BAK1 gene variation and abdominal aortic aneurysms--results may have been prematurely overrated. Hum Mutat. 2010;31(10):1174–6. author reply 1177-1178
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63.
Shendure J. The beginning of the end for microarrays? Nat Methods. 2008;5(7):585–7.
Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data. PLoS One. 2013;8(8):e71462.
Guo Y, Li CI, Ye F, Shyr Y: Evaluation of read count based RNAseq analysis methods. BMC Genomics 2013, 14 Suppl 8:S2.
Sheng QH, Zhao SL, Li CI, Shyr Y, Guo Y. Practicability of detecting somatic point mutation from RNA high throughput sequencing data. Genomics. 2016;107(5):163–9.
Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNA-seq data. Am J Hum Genet. 2013;93(4):641–51.
Han L, Vickers KC, Samuels DC, Guo Y. Alternative applications for distinct RNA sequencing strategies. Brief Bioinform. 2014;
Miller AC, Obholzer ND, Shah AN, Megason SG, Moens CB. RNA-seq-based mapping and candidate identification of mutations from forward genetic screens. Genome Res. 2013;23(4):679–86.
Zhang P, Samuels DC, Lehmann B, Stricker T, Pietenpol J, Shyr Y, Guo Y. Mitochondria sequence mapping strategies and practicability of mitochondria variant detection from exome and RNA sequencing data. Brief Bioinform. 2015;
Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG. Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011;333(6038):53–8.
Kleinman CL, Majewski J. Comment on "widespread RNA and DNA sequence differences in the human transcriptome". Science. 2012;335(6074):1302. author reply 1302
Lin W, Piskol R, Tan MH, Li JB. Comment on "widespread RNA and DNA sequence differences in the human transcriptome". Science. 2012;335(6074):1302. author reply 1302
Pickrell JK, Gilad Y, Pritchard JK. Comment on "widespread RNA and DNA sequence differences in the human transcriptome". Science. 2012;335(6074):1302. author reply 1302
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome project data processing S: the sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25(14):1754–60.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.
Daniels J, Holmans P, Williams N, Turic D, McGuffin P, Plomin R, Owen MJ. A simple method for analyzing microsatellite allele image patterns generated from DNA pools and its application to allelic association studies. Am J Hum Genet. 1998;62(5):1189–97.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491−+.
Guo Y, Samuels DC, Li J, Clark T, Li CI, Shyr Y. Evaluation of allele frequency estimation using pooled sequencing data simulation. ScientificWorldJournal. 2013;2013:895496.
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.
Zhao S, Guo Y, Sheng Q, Shyr Y. Advanced heat map and clustering analysis using heatmap3. Biomed Res Int. 2014;2014:986048.
Ramaswami G, Li JB. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 2013;
Kiran A, Baranov PV. DARNED: a DAtabase of RNa EDiting in humans. Bioinformatics. 2010;26(14):1772–6.
Wang J, Raskin L, Samuels DC, Shyr Y, Guo Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics. 2015;31(3):318–23.
Guo Y, Long J, He J, Li CI, Cai Q, Shu XO, Zheng W, Li C. Exome sequencing generates high quality data in non-target regions. BMC Genomics. 2012;13:194.
Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform. 2014;15(6):879–89.
Quinn EM, Cormican P, Kenny EM, Hill M, Anney R, Gill M, Corvin AP, Morris DW. Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 genomes data. PLoS One. 2013;8(3):e58815.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9.
Guo Y, Zhao S, Lehmann BD, Sheng Q, Shaver TM, Stricker TP, Pietenpol JA, Shyr Y. Detection of internal exon deletion with exon del. Bmc Bioinformatics. 2014;15:332.
We would also like to thank Stephanie Page Hoskins for editorial support.
The study was supported by P30 CA68485. The publication fee was paid by funding from Center for Quantitative Sciences, Vanderbilt University.
Availability of data and materials
All data used in this study were downloaded from TCGA.
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 6, 2017: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2016: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-6).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Integrative Genomics Viewers screenshot of position chromosome 1:12,520,386. Alignment results for all five samples for patient A7-A0D9 are displayed. Top three are DNA samples, and bottom two are RNA samples. The reference is C. Both RNA samples detected alternative allele G, two DNA samples did not detect alternative allele G. (PNG 211 kb)
About this article
Cite this article
Guo, Y., Zhao, S., Sheng, Q. et al. The discrepancy among single nucleotide variants detected by DNA and RNA high throughput sequencing data. BMC Genomics 18 (Suppl 6), 690 (2017). https://doi.org/10.1186/s12864-017-4022-x
- DNA-RNA difference
- RNA editing
- Single nucleotide variant