Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data
© Duitama et al.; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Massively parallel transcriptome sequencing (RNA-Seq) is becoming the method of choice for studying functional effects of genetic variability and establishing causal relationships between genetic variants and disease. However, RNA-Seq poses new technical and computational challenges compared to genome sequencing. In particular, mapping transcriptome reads onto the genome is more challenging than mapping genomic reads due to splicing. Furthermore, detection and genotyping of single nucleotide variants (SNVs) requires statistical models that are robust to variability in read coverage due to unequal transcript expression levels.
In this paper we present a strategy to more reliably map transcriptome reads by taking advantage of the availability of both the genome reference sequence and transcript databases such as CCDS. We also present a novel Bayesian model for SNV discovery and genotyping based on quality scores.
Experimental results on RNA-Seq data generated from blood cell tissue of three Hapmap individuals show that our methods yield increased accuracy compared to several widely used methods. The open source code implementing our methods, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/.
Recent advances in sequencing technologies have enabled the completion of a growing number of individual genomes, including several cancer genomes (see  for a recent review). While whole-genome sequencing provides a near-complete catalog of variants and individual genotypes, sequencing of mRNA transcripts (RNA-Seq) is becoming the method of choice for studying functional implications of genetic variability [2–8]. In particular, RNA sequencing is an important source of information for studying the effect of genetic variation on transcription regulation and establishing causal relationships between mutations and disease. For cancer research, comparison of RNA-Seq data generated from normal tissue and tumor samples can provide the information needed to discover driver mutations or to find new therapy targets .
Analysis of RNA-Seq data poses several challenging computational problems . First, eukaryotic mRNA transcripts are typically the result of splicing, whereby non-coding regions called introns are removed from the pre-mRNA molecule. This makes the use of tools for mapping of DNA reads to the reference genome like Maq  or Bowtie  not suitable for finding the genomic location of reads spanning splicing sites. Several methods based on spliced alignment have been proposed to identify splicing sites and assemble full transcripts [7, 13–17], however these methods incur a high computational cost and require very high sequencing depth, typically with paired reads. Even when accurate read mapping is achieved, differences in transcription levels result in unequal sequencing depths of different transcripts, making it difficult to identify variants in regions transcribed at low levels. Although it is possible to overcome this difficulty by sequencing both genomic DNA and mRNA and identifying variants from the genomic DNA reads using standard methods, when the interest is in expressed variants it is significantly more cost effective to identify them directly from mRNA reads .
Our main contributions are an efficient strategy for accurate mapping of mRNA reads and a new method for single nucleotide variant (SNV) detection and genotyping. Note that we use the term SNV instead of the better known term SNP (Single Nucleotide Polymorphism) because SNPs are normally defined relative to a population and imply a minimum minor allele frequency whereas we are interested in finding and genotyping in an individual all sequence variants that do not match the reference genome sequence, regardless of their frequency in the population. To improve the success rate and accuracy of read mapping, we map mRNA reads against both the reference genome and a transcript library such as the consensus coding sequences (CCDS) database  and then combine mapping results using a simple rule set. Our method for SNV detection and genotyping is based on computing, for each locus, conditional probabilities for each of the ten possible genotypes given the reads, and then choosing the genotype with highest posterior probability using Bayes' rule. The underlying probabilistic model assumes independence among reads and fully exploits the information provided by base quality scores. Unlike other widely used Bayesian methods [11, 20], we consider all four possible alleles at each position, and do not apply a separate test of heterozygosity.
We validated our methods on publicly available Illumina RNA-Seq datasets generated from blood cell tissue of three individuals extensively genotyped as part of the international Hapmap project . The results indicate that the combined mapping strategy yields improved genotype calling accuracy compared to performing genome or CCDS mapping alone and that our SNV detection and genotyping method is more sensitive than existing methods at equal levels of specificity. We also assess the effect on sensitivity and specificity of commonly used data curation strategies such as read trimming, filtering of read copies to correct for variable transcription levels and PCR artifacts, and imposing minimum allele coverage thresholds.
Mapping strategy for mRNA reads
Mapping mRNA reads against the reference genome using standard mapping programs such as Bowtie  or Maq  does not require gene annotations but leaves reads spanning exon junctions unmapped. Spliced alignment methods such as  could theoretically overcome this difficulty but in practice they are computationally intensive and not well suited for very short reads. On the other hand, mapping against a reference transcript library like the Consensus Coding Sequences Database (CCDS)  recovers reads spanning known splicing junctions but fails to recover reads coming from unannotated genes.
Decision table for merging of read alignments
SNV detection and genotyping
Our new Bayesian method, named SNVQ, computes for each locus the posterior probability of each of the ten possible genotypes given the reads. For a locus i we let R i denote the set of mapped reads spanning this locus. In all Bayesian methods, the posterior probability of a genotype is calculated from its prior and conditional probabilities by using the Bayes rule, . The main difference between models lies in the way conditional probabilities are calculated . Both Maq and SOAPsnp use a different model to calculate probabilities of homozygous and heterozygous genotypes. Maq uses a binomial distribution on the alleles having the two highest counts while SOAPsnp uses a rank test to determine heterozygosity. SOAPsnp also assumes as prior information that the homozygous reference genotype is the most likely one and calculates conditional probabilities based on Illumina specific knowledge about the reads . We decided instead to use a uniform set of assumptions for calculating conditional probabilities of all genotypes. Assuming independence between reads, the conditional probability of genotype G i can be expressed as a product of read contributions, i.e., . For a mapped read r ∈ R i let r(i) be the base spanning locus i and ε r(i) be the probability of error sequencing the base r(i), which we estimated from the quality score q(i) calculated during primary analysis using the Phred formula ε r(i) = 10-q(i)/10. We discarded allele calls with quality scores zero and one. Let H i and be the two real alleles at locus i, or in other words, let . The observed base r(i) could be read from either H i or . If there is an error in this read, we assume that the error can produce any of the other three possible bases with the same probability. Thus, the probability of observing a base r(i) given than the real base is different is ε r(i) /3 while the probability of observing r(i) without error is 1 - ε r(i) .
Finally, a variant is called if the genotype with highest posterior probability is different than homozygous reference. In the next section we show a comparison of results among these methods by reanalyzing a publicly available dataset.
Software and performance issues
We implemented mapped read merging strategies and SNVQ in Java 1.6 and we packed both programs with a few additional utilities in a single jar file. The open source code, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/.
In order to enable integration with other analysis tools we use the SAM format  for both the input and the output of mapped read merging. We also sort alignments by chromosome and absolute position to enable efficient processing in subsequent modules and fast merging of results from different lanes if available. SAM files produced by the merging module can be used directly as input for the SAMtools package  to produce run statistics, pileup information, and for variants detection. We recommend to run the merging process lane by lane because it needs to load all unique alignments in memory in order to sort them at the end of the process. We used space efficient data structures that allow us to process more than ten million reads in a few minutes, using up to 16 Gb of memory. The code implementing SNVQ is able to receive as input either alignments in SAM format or pileup information in the format described in the SAMtools package. The pileup format is recommended because it enables faster processing and reduces the memory requirements. Our experiments indicate that SNVQ is able to process a whole transcriptome pileup file in about 20 minutes using a single processor and up to 4 Gb of memory.
Results and discussion
Sequencing and read mapping (using hard merge) statistics for the three RNA-Seq datasets used in our experiments
Genotype calling accuracy was assessed using as gold standard Hapmap SNP genotype calls for the three individuals. To measure accuracy of genotype calling, we defined as true positive a correctly called heterozygous or homozygous non reference SNP and as false positive an incorrectly called homozygous SNP. We did not consider as error a heterozygous SNP called homozygous or not called because this can be due to lack of read coverage for one or both alleles. We consider that one method is more accurate than another when it is able to detect more true positives for the same number of false positives, or conversely if it detects the same number of true positives with fewer false positives. When assessing SNV detection accuracy, we define as true positive a detected heterozygous or homozygous non reference SNP, no matter which is the actual genotype call, and as false positive a homozygous reference SNP marked as having a variant. Thus, calling as heterozygous a homozygous not-reference SNP is considered a true positive for SNV detection, because the variant was detected, but a false positive for genotype calling because an inexistent reference allele is being called.
Comparison of read mapping strategies
Mapping statistics for the NA12878 dataset (million reads)
Comparison of SNV calling and genotyping methods
Comparison of strategies for data curation
In practice SNV detection is the problem of separating allele calls that are different from the reference because of sequencing errors from calls that are different from the reference because they were sampled from a variant locus. With the current sequencing error rates, if sequencing errors were randomly distributed, it is not difficult to show that any of the discussed methods would have high accuracy. Unfortunately, each sequencing technology has different error patterns which can break the randomness assumption. In this section we describe three systematic error patterns characteristic to Illumina sequencing and assess how common ways to solve these issues performed in our testing data.
Another common source of false positive results are PCR amplification artifacts that produce a large number of copies of the same read . One usual way to deal with this issue is to allow just one read to start at each possible locus. This filter eliminates artificial high coverage at every locus, which can be beneficial not only for discarding reads generated by PCR artifacts but also for normalizing biases produced by variable transcription levels. The main drawback of this strategy is that it can throw away useful read data, thus affecting sensitivity. An intermediate approach consists on allowing a small number x of different reads per start locus as described in . Figure 5 shows that selecting just one read per start locus is indeed too restrictive for the NA12878 dataset but the three reads filter of  did not affect sensitivity and even improved it for stringent specificity requirements. Although the improvement is not as consistent as the one obtained by trimming aligned reads, we consider that this filter may be generally useful for correcting coverage biases without loosing sensitivity.
Effect of coverage on genotyping accuracy
Genotype calling from heterogeneous samples
Second generation sequencing of mRNA is becoming the method of choice to investigate the behavior of human cells and to reveal the functional effects of variation. In this paper we introduced a mapping strategy for mRNA reads which fully utilizes the information contained in both the reference genome sequence and reference databases of known transcripts such as CCDS. We also presented a Bayesian model for SNV detection and genotyping called SNVQ that seeks to improve genotype calls by fully exploiting the information contained in base quality scores. Finally, we performed a comparison among commonly used mapping, SNV detection and genotyping methods, and data curation strategies with the aim to select the most effective methods to identify expressed single nucleotide variants from RNA-Seq data, taking advantage of the availability of RNA-Seq data for Hapmap individuals with well characterized genotypes. Our experiments indicate that the double reference mapping and merging strategy yields improved SNV calling and genotyping accuracy compared with methods based on mapping to a single reference. The experiments further suggest that SNVQ achieves improved accuracy over existing methods, and retains its power to detect variants with a high specificity even from heterogeneous RNA-Seq samples.
In future work we seek to integrate our tools with methods for estimating isoform expression levels  and to extend our model by incorporating allele specific expression of isoforms . We also plan to integrate additional transcript annotation sources such as dbEST and UCSC, and to integrate our methods in a bioinformatics pipeline enabling personalized cancer immunotherapy based on tumor transcriptome sequencing.
This work has been supported in part by NSF awards IIS-0546457 and IIS-0916948 and NIFA award 2011-67016-30331.
This article has been published as part of BMC Genomics Volume 13 Supplement 2, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S2.
- Snyder M, Du J, Gerstein M: Personal genome sequencing: current approaches and challenges. Genes & Development. 2010, 24: 423-431. 10.1101/gad.1864110. [http://genesdev.cshlp.org/content/24/5/423.full]View ArticleGoogle Scholar
- Cloonan N, Forrest A, Kolle G, Gardiner B, Faulkner G, et al: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 2008, 5 (7): 613-619. 10.1038/nmeth.1223.View ArticlePubMedGoogle Scholar
- Marguerat S, Bähler J: RNA-seq: from technology to biology. Cellular and Molecular Life Sciences. 2009, 67 (4): 569-579.PubMed CentralView ArticlePubMedGoogle Scholar
- Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M: Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques. 2008, 45: 81-94. 10.2144/000112900.View ArticlePubMedGoogle Scholar
- Sultan M, et al: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science. 2008, 321 (5891): 956-960. 10.1126/science.1160342.View ArticlePubMedGoogle Scholar
- Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, et al: mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods. 2009, 6 (5): 377-382. 10.1038/nmeth.1315.View ArticlePubMedGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuch B, Laborde R, Xu X, Gu J, Chung C, et al: Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. Plos One. 2010, 5 (2): e9317-10.1371/journal.pone.0009317.PubMed CentralView ArticlePubMedGoogle Scholar
- Maher C, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan A: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458 (5): 97-101.PubMed CentralView ArticlePubMedGoogle Scholar
- Costa V, Angelini C, DeFeis I, Ciccodicola A: Uncovering the complexity of transcriptomes with RNA-Seq. Journal of Biomedicine and Biotechnology. 2010, 2010: 853916-PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008, 18: 1851-1858. 10.1101/gr.078212.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- De Bona F, Ossowski S, Schneeberger K, Rätsch G: Optimal spliced alignments of short sequence reads. Bioinformatics. 2008, 24 (16): i174-i180. 10.1093/bioinformatics/btn300.View ArticlePubMedGoogle Scholar
- Guttman M, Garber M, Levin J, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol M, Gnirke A, Nusbaum C, Rinn J, Lander E, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology. 2010, 28 (5): 503-510. 10.1038/nbt.1633.PubMed CentralView ArticlePubMedGoogle Scholar
- Marioni J, Mason C, Mane S, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Cirulli E, Singh A, Shianna K, Ge D, Smith J, Maia J, Heinzen EL, Goedert J, Goldstein D, et al: Screening the human exome: a comparison of whole genome and whole transcriptome sequencing. Genome Biology. 2010, 11 (5): R57-10.1186/gb-2010-11-5-r57.PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt K, Harrow J, Harte R, et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Research. 2009, 19: 1316-1323. 10.1101/gr.080531.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J: SNP detection for massively parallel whole-genome resequencing. Genome Research. 2009, 19: 1124-1132. 10.1101/gr.088013.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium TIH: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (18): 851-861.Google Scholar
- Dalca AV, Brudno M: Genome variation discovery with high-throughput sequencing data. Briefings in Bioinformatics. 2010, 11: 3-14. 10.1093/bib/bbp058.View ArticlePubMedGoogle Scholar
- Ewing B, Green P: Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research. 1998, 8: 186-194.View ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, et al: The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 2010, 38 (Database issue): D613-D619.PubMed CentralView ArticlePubMedGoogle Scholar
- Chepelev I, Wei G, Tang Q, Zhao K: Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Research. 2009, 37 (16): e106-10.1093/nar/gkp507.PubMed CentralView ArticlePubMedGoogle Scholar
- Bentley D, et al: Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry. Nature. 2008, 456: 53-59. 10.1038/nature07517.PubMed CentralView ArticlePubMedGoogle Scholar
- Kozarewa I, Ning Z, Quail M, Sanders M, Berriman M, Turner D: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nature Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralView ArticlePubMedGoogle Scholar
- Srivastava N, Srivastava P: Modeling the Repertoire of True Tumor-Specific MHC I Epitopes in a Human Tumor. PLoS ONE. 2009, 4 (7): e6094-10.1371/journal.pone.0006094.PubMed CentralView ArticlePubMedGoogle Scholar
- Nicolae M, Mangul S, Mandoiu I, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNA-Seq data. Proc 10th Workshop on Algorithms in Bioinformatics. Edited by: Singh M, Moulton V. 2010, Lecture Notes in Computer Science, 202-214.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.