Comparing reference-based RNA-Seq mapping methods for non-human primate data
© Benjamin et al.; licensee BioMed Central Ltd. 2014
Received: 4 November 2013
Accepted: 25 June 2014
Published: 7 July 2014
The application of next-generation sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing, has transformed the way in which gene expression studies are conducted and analyzed. These advances are of particular interest to researchers studying organisms with missing or incomplete genomes, as the need for knowledge of sequence information is overcome. De novo assembly methods have gained widespread acceptance in the RNA-Seq community for organisms with no true reference genome or transcriptome. While such methods have tremendous utility, computational cost is still a significant challenge for organisms with large and complex genomes.
In this manuscript, we present a comparison of four reference-based mapping methods for non-human primate data. We utilize TopHat2 and GSNAP for mapping to the human genome, and Bowtie2 and Stampy for mapping to the human genome and transcriptome for a total of six mapping approaches. For each of these methods, we explore mapping rates and locations, number of detected genes, correlations between computed expression values, and the utility of the resulting data for differential expression analysis.
We show that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian data with no true reference, and the details of mapping methods should be carefully considered when doing so. Critical algorithm features include short seed sequences, the allowance of mismatches, and the allowance of gapped alignments in addition to splice junction gaps. Such features facilitate sensitive alignment of non-human primate RNA-Seq data to a human reference.
KeywordsRNA-Sequencing Genomics Mapping
For the past decade, microarray gene expression data has revolutionized all areas of life science - allowing quantification of thousands of genes in several samples simultaneously, and paving the way for countless research studies. With the advent of RNA-Seq data, researchers now have the ability to perform untargeted gene expression analysis via next generation sequencing (NGS) technology, obtaining qualitative sequence information as well as quantitative gene expression data. RNA-Seq provides a comprehensive gene expression profile of each sample – with the potential to quantify and annotate all genes and isoforms. This untargeted approach proves particularly useful when quantifying gene expression in polymorphic cell lines and in organisms with a nonexistent or provisional reference genome where the sequence of features to be quantified is unknown [1, 2]. Due limited availability of genomic resources for under-characterized species, RNA-Seq is a popular method of choice for transcriptome analyses. We present a comparison of reference-based mapping methods for RNA-Seq data originating from Non-Human Primates (NHPs), and their implications in downstream differential expression analysis.
Mapping and assembly
RNA-Seq obtains gene expression estimates by assigning next-generation sequencing (NGS) reads to transcripts, either by mapping to a reference sequence, or assembling into contiguous stretches of sequence data (contigs) utilizing overlapping sequence amongst the reads themselves. These methods are termed reference-based alignment approaches, and de novo assembly approaches, respectively. Reference-based alignment and de novo assembly each have advantages and disadvantages. The optimal strategy likely depends on the experimental design and available genomic and computational resources. Reference-based approaches are far less computationally intensive than de novo approaches, and are well-suited for detection of low abundance transcripts . However, reference-based mapping relies on the availability and accuracy of a reference sequence. Mapping approaches must also handle reads with more than one potential mapping location. Such uncertainty can arise from paralogous gene families, repetitive sequences, and shared exons of alternatively spliced transcripts . Conversely, De novo assemblers have the advantage of not requiring a reference sequence. This allows discovery of transcripts not present in a reference genome or transcriptome. De novo assemblers overcome challenges caused by mapping uncertainty, as well as long introns that may be missed by reference-based mapping methods. While the freedom from a reference sequence is appealing, de novo assemblers require large amounts of computational resources and assembly time. In addition, deeper sequencing is needed for adequate de novo transcript assembly than is required by reference-based approaches . De novo assembly can be very useful for organisms with no reference genome, but due to its computational demands is most commonly used in cases of small genomes such as bacteria, archaea, and lower eukaryotes . In the case of large and complex transcriptomes, such as plants and mammals, reference-based mapping methods can overcome the computational and isoform-resolving challenges faced by de novo assemblers . In general, reference-based mapping approaches are the appropriate choice when a reliable reference genome exists. The choice becomes less clear when working with complex mammalian species with little to no genomic resources, such as under-characterized NHPs.
We propose the use of a human reference genome (or transcriptome) for reference-based mapping and expression quantification of NHP RNA-Seq data. An elegant study by Hornett et al.  evaluated the utility of divergent species gene sets for annotation of de novo assembly. When utilizing reference gene sets from divergent species, the authors found little bias in expression levels and strong correlation in gene expression up to approximately a 100 million year window. More divergent species (greater than 100 million years apart) suffered from incorrect assignment of assembled contigs to genes. The authors found little difference in the number of genes having contigs assigned, when using chimpanzee, orangutan, macaque, or marmoset vs. human. In addition, the authors compared the use of de novo assembled transcriptomes to mapping directly to the reference predicted gene set for quantifying gene expression. When comparing the mapping of the reads directly to the predicted gene set vs. the de novo assemblies, the authors found that mapping to the gene set recovered expression data for more genes, and that expression levels were strongly correlated within gene sets detected by both methods .
In this manuscript, we examine the mapping efficacy and utility for differential expression analyses of various reference-based approaches when the reference sequence originates from a closely related species. Specifically, we utilize the human genome and transcriptome as a reference for non-human primate RNA-Seq data from yellow baboons, Papio cynocephalus. We compare different reference-based mapping approaches – one representative method from four different mapping method categories  to map to the reference genome, and two of these methods to map to the reference transcriptome as well.
Reference-based mapping methods
Reference-based mapping methods overview - summary of the four reference-based mapping method categories compared in this study
Index reference sequence, Rapidly look up candidate mapping loci.
Typically faster and less sensitive than Seed Methods.
Align short subsequences of reads to find candidate mapping loci,
Narrow candidates by extending alignments. Typically slower and
more sensitive than Burrows-Wheeler Transform Methods.
Exon First Method
Align whole reads with Unspliced Aligners, Search for spliced
alignments in remaining reads. Typically faster and less sensitive
than Seed and Extend Methods.
Seed and Extend Method
Align short subsequences of reads to find candidate mapping loci,
Narrow candidates by extending alignments. Typically slower and
more sensitive than Exon First Methods.
The first major category has been referred to as “unspliced read aligners”, which align reads to a reference without allowing large gaps such as those arising from reads spanning exon boundaries, or splice junctions [6–12]. Unspliced aligners may be used to align reads to a reference transcriptome or a reference genome. Aligning to a reference transcriptome alleviates the need to handle splice junctions, but is limited to the analysis of known transcripts. When utilizing unspliced read aligners to map to a reference genome, reads may be mapped to potentially novel exons, however reads spanning splice junctions are likely to remain unmapped. Unspliced read aligners are generally divided into two subcategories based on their methodology -“seed methods” and “Burrows-Wheeler transform (BWT) methods” . Seed methods align short subsequences, or seeds, from each read to a reference, requiring a perfect match in the seed subsequence. More sensitive alignment methods are then used to eliminate candidate regions where seeds cannot be extended to full read alignments. The unspliced seed method we chose to test is Stampy . BWT methods create a Burrows-Wheeler index of the reference sequence and efficiently search for perfect matches. Mismatches may be allowed with an exponential increase in computational complexity. In general, Burrows-Wheeler transform methods are faster than seed methods, but seed methods provide increased sensitivity . The unspliced BWT method we chose to test is Bowtie2 . A similar analysis reported that when the true reference transcriptome is available BWT methods are faster with minimal differences in alignment specificity. When the reference transcriptome of a distant species is available, seed methods result in large increases in sensitivity . These increases in sensitivity have also been observed when aligning reads to polymorphic regions .
Methods in the unspliced read aligner category are either limited by their inability to handle reads spanning splice junctions when mapping to a reference genome, or limited to known exons and splice sites when mapping to a reference transcriptome. The second major category of mapping methods, “spliced aligners”, align reads to the whole genome, with intron-spanning reads requiring large gaps. Spliced aligners accommodate junction-spanning reads by splitting them up into smaller segments and determining the best match based on alignment scores and known di-nucleotide splice signals [14–18]. The spliced aligners also fall into two major categories based on their methodology - “exon-first” methods and “seed-and-extend” methods. Exon-first methods begin by mapping whole reads to the genome using unspliced read aligners, and then search for spliced alignments with the remaining reads. Exon-first approaches are efficient, but may miss true spliced alignments when an unspliced alignment is available in a pseudogene . Seed-and-extend methods break reads into seeds which are mapped to the genome, and much like seed-based unspliced aligners, candidate mapping locations are examined with more sensitive alignment methods. Iterative extension and merging of initial seeds is performed to determine the spliced alignment. As with unspliced aligners, seed-and-extend methods are slower, but more sensitive, and show improved performance when mapping reads from polymorphic samples . The exon-first and seed-and-extend methods we chose to test are TopHat2 , and GSNAP , respectively.
After mapping to a reference genome, a transcriptome reconstruction step is required to appropriately assign reads to transcipts. Aligned reads spanning splice junctions are “connected”, and read counts to various isoforms of each gene are reported. This is accomplished by building a graph to represent all possible isoforms of an expressed feature. Different paths through the graph represent individial isoforms [20, 21].
We present a comparison of reference-based mapping methods for RNA-Seq data having no true reference, but that of a closely related model organism. We assess the various methods using mapping rates, mapping locations, correlation of gene expression, as well as the utility of the data for differential expression analyses. To better understand the differences between the methods, we examine more closely the default behaviors of the four mapping methods used.
Bowtie2 is a BWT-based unspliced aligner, with the recent addition of supported gapped alignments. This method may actually be considered a combination BWT and seed unspliced mapper. Bowtie2 extracts multiple substrings or seeds from each read and aligns them using a BWT approach with no gaps, then extends alignments using a Smith-Waterman-like scoring scheme. By default, seeds are 22 bp and no mismatches are allowed within the seed. Base call quality scores are incorporated by assigning more severe mismatch penalties at high-quality read positions. Gap initiation and extension penalties are also utilized, while the number and lengths of gaps within extended alignments are not restricted. Bowtie2 does not guarantee that the alignment reported is the best in terms of alignment score, and when there is more than one potential mapping location of equal score, one reported location is selected at random.
Stampy is a seed-based unspliced aligner that uses a hash table to store the locations of 15-mers in the reference sequence. For each read, candidate alignment locations are identified with a hash lookup of the 15-mers in the read, allowing for one mismatch. Candidate mapping locations are screened for sequence similarity with the read, and then full alignments are attempted at each remaining candidate location. As with Bowtie2, Stampy also considers base quality calls, and allows gaps in this alignment step. Stampy also allows the use of BWT as a “pre-mapping” step to increase speed, however the manual does not recommend this for paired-end data. For this reason, and for method-comparison purposes, we did not use the BWT option. Stampy uses a Bayesian probabilistic model to represent mapping quality, and reports the single best alignment location.
TopHat2 is an exon-first spliced read aligner that uses Bowtie2 as a base algorithm. Like the original TopHat, potential splice sites are detected within candidate alignment locations, and used in a subsequent step to align reads spanning exon-junctions. TopHat2 first maps to the transcriptome with Bowtie2. Remaining whole reads are then mapped to the reference genome, and then spliced alignments are attempted. Most of the default Bowtie2 parameters when run within Tophat2 are the same as the default standalone Bowtie2 parameters, with the exception of seed length and intervals between seeds. TopHat2 seeds within Bowtie2 are 20 bp, and the interval between seeds is longer. TopHat2 reduces alignment to pseudogenes by aligning reads preferentially to genes within provided annotation. This use of annotation by TopHat2 has been shown to increase sensitivity and accuracy of mapping. We provided TopHat2 with annotation information for this purpose. In addition to gapped alignment in the Bowtie2 step, TopHat2 also allows indels in the spliced alignment detection step.
GSNAP is a seed-extend spliced aligner that allows for long and even chromosome spanning gaps, likely resulting from gene-fusion events. GSNAP uses all 12-mers in the read to identify candidate mapping locations, not favoring positions within short reads. Alignments are extended at candidate loci, requiring that the read has a consecutive stretch of 14 nucleotides exactly matching the reference sequence. GSNAP allows multiple mismatches and long indels, but only allows one splice or indel per read. Splicing is identified in two different ways - searching surrounding sequence for splice signals, or a user-provided set of known exon-intron boundaries.
Also worth noting is each methods’ default handling of reads mapping to multiple locations. With the default parameters, TopHat2 reports the single alignment having the best alignment score. If there are multiple locations with the same optimal alignment score, TopHat2 will report up to 20 of these results. Bowtie2 by default performs a non-exhaustive search for distinct valid alignments, and then reports the single best result of the identified alignments. If there are multiple alignments with the same alignment score, a single result is selected at random. Because the search is not exhaustive, Bowtie2 does not guarantee that the alignment reported is the absolute best in terms of alignment score. Similar to Bowtie2, Stampy reports the single best alignment, selecting one result at random when more than one alignment shares the best mapping quality. GSNAP reports up to 100 valid mapping loci based on match, mismatch, and gap thresholds. As with the matching parameters, all of the reporting parameters of these mapping tools may be adjusted.
Results and discussion
Study participants and RNA sequencing read pairs in millions before, and after pre-mapping quality control
Bacterial dose (CFU)
Summary of mapping metrics results for the compared reference-based methods
Base Mismatch Rate
Intragenic Mapping Rate
Intergenic Mapping Rate
rRNA Mapping Rate
Correlation Between Baseline Samples
Differentially Expressed Genes
We also observe higher base mismatch rates when mapping to the reference transcriptome. This may be partly due to the fact that the default settings of Bowtie2 and Stampy are more sensitive, and so reads may be successfully mapped to more divergent regions. Another possible contributing factor is the presence of splice variants in baboon not present in human. The human reference transcripts may contain additional or missing exons with respect to baboon, causing mapped reads that span the true exon junction to have a high mismatch rate for a short stretch of sequence. Unique, duplication, and rRNA rates are all similar across the mapping methods.
We also examined the mapping locations - intergenic and intragenic (exonic and intronic). Mapping to the reference transcriptome obtains an intragenic mapping rate of 1, simply due to the nature of the mapping procedure. When mapping to the reference genome, we see that Bowtie2, Stampy, and GSNAP all obtain a significantly higher intergenic and intronic mapping rates than TopHat2, ranging from 0.1015 to 0.186 (Bowtie2-G), 0.0917 to 0.2011 (Stampy-G), 0.0856 to 0.1537 (GSNAP), and 0.0248 to 0.0654 (TopHat2). Conversely, TopHat2 obtains a significantly higher exonic mapping rates ranging from 0.8705 to 0.9258 compared to 0.6519 to 0.7722 for Bowtie2-G, 0.5927 to 0.7285 for Stampy, and 0.7023 to 0.7938 for GSNAP. This is likely due to TopHat2’s preferential mapping to the reference transcriptome prior to exploration of genomic locations.
Correlation of gene expression
We also computed the Pearson correlation between the different methods for identical samples. This is illustrated by the blocks off the diagonal in Panel A of Figure 3. Bowtie2-G and Stampy-G had the highest correlations, ranging from 0.9805 to 0.9877. GSNAP and Stampy had the lowest correlations, ranging from 0.8690 to 0.9047. As performed in the above analysis,the Spearman correlation coefficients were also computed. When examining correlations between different methods for identical samples we also observed similar results with slightly lower correlation coefficients overall. These results are shown in Additional file 1: Figure S1.To further illustrate the concordance in gene expression estimates obtained with each mapping method, we constructed a dendrogram, computing the Euclidean distance in gene expression between methods for each of the baseline (0 hours) samples, and then averaging the distances. Examining the dendrogram in Panel C of Figure 3, we observe TopHat2 and Bowtie2 have the shortest distance between gene expression estimates, followed by GSNAP, Bowtie2-G, and Stampy-G, and finally Stampy. These results are in accordance with the correlations shown in Panel A of Figure 3. The weakest correlations are seen between the most and least sensitive methods (with respect to detecting human genes), Stampy and GSNAP, respectively. This difference is less pronounced for methods of comparable sensitivity such as Bowtie2 and TopHat2. For the less sensitive methods, it is possible that reads from divergent regions of transcripts are not successfully mapped, affecting expression estimates.
Differential expression analysis
Read counts by evolutionary distance
Summary of comparison results
Transcriptome Mapping with Bowtie2
Transcriptome Mapping with Stampy
Genome Mapping with Bowtie2
Genome Mapping with Stampy
Genome Mapping with TopHat2
Genome Mapping with GSNAP
We obtained data from 12 adult male colony-bred yellow baboons (Papio cynocephalus), inoculated with five different levels of Streptococcus pneumoniae, a bacterial pathogen causing pneumonia . Animals were housed in the Duke University Vivarium (Durham, NC) and handled in accordance with American Association for Accreditation of Laboratory Animal Care guidelines. The experimental protocol was approved by the Duke University Institutional Animal Care and Use Committee.
The bacterial doses administered to the participants included 10 9 colony forming units (CFU) (n = 4), 10 8 CFU (n = 3), 10 7 CFU (n = 1), 10 6CFU (n = 1), and 0 CFU (n = 3). Peripheral blood samples for gene expression analysis were taken at five different time points - immediately before inoculation, and 6, 24, 48, and 168 hours following inoculation. Antibiotics were administered immediately following the 48 hour time point. Each participant was evaluated to determine clinical pneumonia status using pre-determined criteria. A participant was classified as developing clinical pneumonia if each of the three following conditions were met:
A white blood cell count of greater than 15,400, less than 3,400, a 2–fold change from baseline measurement, or greater than or equal to 90% neutrophilia at 24 or 48 hours.
A positive culture of Streptococcus pneumoniae from Bronchoalveolar Lavage (BAL) or blood samples at 48 hours.
Any one or more of the following at 24 or 48 hours:
Heart rate of greater than 100 bpm
A 25% or greater increase in respiratory rate from baseline
Positive indication of infiltrate on a chest X-ray
Decreased Food Intake
A fever of 38.2 degrees Celsius or greater
Three of the 60 samples did not meet RNA quality standards. Table 2 shows the participants meeting clinical criteria at 48 hours. Sequence data at baseline for all participants has been deposited in NCBI’s Gene Expression Omnibus  and are accessible through GEO Series accession number GSE57485 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57485).
RNA isolation and library preparation
Samples were collected in PreAnalytiX PAXgene Blood RNA collection tubes, and total RNA was isolated using the PreAnalytiX PAXgene Blood RNA miRNA isolation kit. RNA quality was assessed using Agilent Bioanalyzer and Nanodrop spectrophotometry, and samples with RNA integrity number greater than or equal to 7, and greater than or equal to 1 microgram of RNA were deemed sufficient for RNA-Seq. Abundant globin transcripts were depleted with the GlobinClear Globin RNA Reduction for RNA-Seq protocol. The fragment library was prepared with the Illumina TruSeq RNA Seq protocol, and Illumina HiSeq RNA Sequencing was performed, run in 6-plex per flow cell lane, obtaining 50 bp paired-end reads.
Read quality control and trimming
Read quality analysis was performed on the raw data using FastQC version 0.10.1 . Quality trimming and adapter clipping were performed using Trimmomatic version 0.25 , trimming trailing bases below quality 20, clipping Illumina adapters, and discarding clipped reads shorter than 25 bp. FastQC was used to re-assess the integrity of the clipped reads prior to subsequent mapping and analysis. Reads whose mates were discarded due to quality trimming and length constraints were removed from the fastq files used for mapping. Table 2 shows the number of read pairs in millions for each sample, before and after quality trimming.
The Illumina iGenomes UCSC hg19 human reference genome and annotation was used as a reference, downloaded March 2013. To generate a fasta file of transcripts for a reference transcriptome, the BEDTools “getfasta” utility version 2.17.0  was used to extract the transcript sequences specified by coordinates in the Ensembl GTF file, from the reference genome sequence. Both the Ensemble reference annotation and genomic sequence files were those included in the iGenomes download. Clipped reads were mapped to the hg19 genome and transcriptome using Bowtie2 version 2.0.6. , and Stampy version 1.0.17. . Clipped reads were mapped to the hg19 human genome using Tophat version 2.0.7 , and GMAP (GSNAP) version 2013-03-12 . Default parameter settings were used for all methods. SAM/BAM conversions, sorting, indexing, and marking of PCR duplicates were performed with SAMtools version 0.1.18  and Picard version 1.83 .
Quantification and normalization
Read counts for each transcript were obtained with HTSeq , specifically the intersection-nonempty mode of htseq-count. When quantifying expression for differential expression analysis, only the primary alignments for each read were counted by HTSeq. For the purpose of this manuscript, we investigated the default sensitivity of each mapping method when mapping NHP RNA-seq data to a human reference, and so the reporting of suboptimal alignments was not critical. Conditional Quantile Normalization (CQN) was used to obtain normalized gene expression estimates .
To compare the reference-based mapping methods, we examined several mapping metrics including mapping rates, mapping locations, transcripts detected, mate pair concordance, and coverage over transcripts. All mapping metrics were computed with RNA-SeQC version 1.1.7 , for the entire gene set, as well as subsets of genes in Gene Ontology  functional groupings, and evolutionary distance groupings. See the corresponding methods sections for more details on the generation of these gene lists.
To assess the number of detected genes within different functional groupings, we took the sets of genes present within each of the 23 top-level biological process ontology terms to generate 23 functional group gene lists. BEDtools version 2.17.0  was used to select reads mapping to each gene list from each of the BAM files, which were then analyzed with RNA-SeQC to determine the number of detected genes.
Evolutionary distance groups
To assess the number of detected genes at various evolutionary distances, we obtained the mRNA sequence of all olive baboon (Papio anubis) RefSeq genes from the UCSC genome browser, and determined the human ortholog and evolutionary distance of each reference baboon gene. Orthologs were found by performing a BLASTN  search against the human transcriptome, and taking the top hit by percent identity. Jukes-Cantor distances  were then computed between the orthologous sequences. Five subset gene lists were created using evolutionary distance: greater than or equal to 0.1 (n = 19), less than 0.1 and greater than or equal to 0.075 (n = 20), less than 0.075 and greater than or equal to 0.05 (n = 62), less than 0.05 and greater than or equal to 0.025 (n = 202), and less than 0.025 (n = 155). As with the functional gene lists, BEDtools version 2.17.0  was used to select reads mapping to each gene list from each of the BAM files.
Differential expression analysis
For each of the four mapping methods, we determined the differentially expressed transcripts using edgeR . The CQN package was used to compute gene-specific correction factors, or offsets, and these offsets were provided to the glm functions of edgeR to correct for GC content and transcript length biases. This procedure is detailed in the cqn manual . Across methods, we report the number of shared genes exhibiting significant differential expression (p <0.05 after Bonferonni correction) between healthy participants, and participants with clinical pneumonia at maximal symptoms (48 hours). A 4-way venn diagram web tool was used to generate figures of overlapping DE genes .
All read mapping and quality control analyses were performed on the Duke Shared Computing Resource, a cluster of Intel x86 compute notes running Linux. Each sample using each mapping method was run on a single node with 16 GB of memory.
Colony forming unit
Next generation sequencing
We would like to acknowledge the support of the NIH CTSA (Clinical and Translational Science Award) 1UL1RR024128-01, and the Gates Foundation, number OPP1017554 (G.S.G.). We would also like to acknowledege BE, LH, CW, KW, BK, BN, CM, MB, and TV for helpful discussions.
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63.PubMed CentralPubMedView ArticleGoogle Scholar
- Hornett EA, Wheat CW: Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species. BMC Genomics. 2012, 13: 361-PubMed CentralPubMedView ArticleGoogle Scholar
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12 (10): 671-682.PubMedView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26 (4): 493-500.PubMed CentralPubMedView ArticleGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8 (6): 469-477.PubMedView ArticleGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714.PubMedView ArticleGoogle Scholar
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-1967.PubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-PubMed CentralPubMedView ArticleGoogle Scholar
- Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009, 5 (5): e1000386-PubMed CentralPubMedView ArticleGoogle Scholar
- Lunter G, Goodson M: Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011, 21 (6): 936-939.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9 (4): 357-359.PubMed CentralPubMedView ArticleGoogle Scholar
- Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009, 25 (24): 3207-3212.PubMed CentralPubMedView ArticleGoogle Scholar
- De Bona F, Ossowski S, Schneeberger K, Rätsch G: Optimal spliced alignments of short sequence reads. Bioinformatics. 2008, 24 (16): i174—i180-PubMedView ArticleGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111.PubMed CentralPubMedView ArticleGoogle Scholar
- Au KF, Jiang H, Lin L, Xing Y, Wong WH: Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 2010, 38 (14): 4570-4578.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010, 38 (18): e178-PubMed CentralPubMedView ArticleGoogle Scholar
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010, 26 (7): 873-881.PubMed CentralPubMedView ArticleGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14 (4): R36-PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515.PubMed CentralPubMedView ArticleGoogle Scholar
- Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28 (5): 503-510.PubMed CentralPubMedView ArticleGoogle Scholar
- DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, Reich M, Winckler W, Getz G: RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012, 28 (11): 1530-1532.PubMed CentralPubMedView ArticleGoogle Scholar
- Weber APM, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 2007, 144: 32-42.PubMed CentralPubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29.PubMed CentralPubMedView ArticleGoogle Scholar
- Kraft BD, Piantadosi CA, Benjamin AM, Lucas JE, Zaas AK, Betancourt-Quiroz M, Woods CW, Chang AL, Roggli VL, Marshall CD, Ginsburg GS, Welty-Wolf K: Development of a novel pre-clinical model of pneumococcal pneumonia in non-human primates. Am J Respir Cell Mol Biol. 2014, 5: 995-1004.View ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash A: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210.PubMed CentralPubMedView ArticleGoogle Scholar
- Andrews S: FASTQC. A quality control tool for high throughput sequence data. 2010, [http://www.bioinformatics.babraham.ac.uk/projects/fastqc]Google Scholar
- Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B: RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012, 40 (W1): W622—W627-PubMed CentralPubMedView ArticleGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.PubMed CentralPubMedView ArticleGoogle Scholar
- Picard software package (version 1.83). [http://picard.sourceforge.net]
- Planet E, Attolini CSO, Reina O, Flores O, Rossell D: htSeqTools: high-throughput sequencing quality control, processing and visualization in R. Bioinformatics. 2012, 28 (4): 589-590.PubMedView ArticleGoogle Scholar
- Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012, 13 (2): 204-216.PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
- Matsubara H, Jukes TH, Cantor CR: Structural and evolutionary relationships of ferredoxins. Brookhaven Symp Biol. 1968, 21: 201-216.PubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140.PubMed CentralPubMedView ArticleGoogle Scholar
- Hansen KD, Wu Z: CQN (Conditional quantile normalization). 2012, [http://www.bioconductor.org/packages/release/bioc/vignettes/cqn/inst/doc/cqn.pdf]Google Scholar
- Oliveros JC: VENNY. An interactive tool for comparing lists with Venn diagrams. 2007, [http://bioinfogp.cnb.csic.es/tools/venny/index.html]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.