Limitations of alignment-free tools in total RNA-seq quantification

Background Alignment-free RNA quantification tools have significantly increased the speed of RNA-seq analysis. However, it is unclear whether these state-of-the-art RNA-seq analysis pipelines can quantify small RNAs as accurately as they do with long RNAs in the context of total RNA quantification. Result We comprehensively tested and compared four RNA-seq pipelines for accuracy of gene quantification and fold-change estimation. We used a novel total RNA benchmarking dataset in which small non-coding RNAs are highly represented along with other long RNAs. The four RNA-seq pipelines consisted of two commonly-used alignment-free pipelines and two variants of alignment-based pipelines. We found that all pipelines showed high accuracy for quantifying the expression of long and highly-abundant genes. However, alignment-free pipelines showed systematically poorer performance in quantifying lowly-abundant and small RNAs. Conclusion We have shown that alignment-free and traditional alignment-based quantification methods perform similarly for common gene targets, such as protein-coding genes. However, we have identified a potential pitfall in analyzing and quantifying lowly-expressed genes and small RNAs with alignment-free pipelines, especially when these small RNAs contain biological variations. Electronic supplementary material The online version of this article (10.1186/s12864-018-4869-5) contains supplementary material, which is available to authorized users.


Background
RNA-seq continues to pose great computational and statistical challenges. These challenges range from accurately aligning sequencing reads to accurate inference of gene expression levels [1,2]. The central computational problem in RNA-seq remains the efficient and accurate assignment of short sequencing reads to the transcripts they originated from and using this information to infer gene expression [3][4][5][6]. Conventionally, read assignment is carried out by aligning sequencing reads to a reference genome, such that relative gene expression levels can be inferred by the alignments at annotated gene loci [2,7]. These alignment-based methods are conceptually simple, but the read-alignment step can be timeconsuming and computationally intensive despite recent advancements in fast read aligners [4,8,9]. Recently, several novel tools introduced alignment-free transcript quantification utilizing k-mer-based counting algorithms [4][5][6]. These alignment-free pipelines are orders of magnitude faster than alignment-based pipelines, and they work by breaking sequencing reads into k-mers and then performing fast matches to pre-indexed transcript databases [4]. To achieve fast transcript quantification without compromising quantification accuracy, different sophisticated algorithms have been implemented in addition to kmer counting, such as pseudoalignments (Kallisto [5]) or quasi-mapping along with GC-and sequence-bias corrections (Salmon [6]). Given the wide variety of choices in RNA-seq tools, several studies have benchmarked subsets of read aligners and quantification software. These studies generally suggest that most of the current RNA-seq tools display comparable accuracy [10][11][12].
However, the existing benchmarking studies were generally carried out on either simulated RNA-seq datasets [12] or RNA-seq datasets that focused only on long RNAs, such as messenger RNAs (mRNAs) and long noncoding RNAs (lncRNAs) [10,11,13,14]. Consequently, they did not evaluate whether these tools are suitable for total RNA quantification in datasets that include small RNAs, such as transfer RNAs (tRNAs) and small nucleolar RNAs (snoRNAs). To some extent, the lack of a comprehensive comparison between small and long RNA quantification may be due to the inability of most current RNA-seq methods to efficiently recover these small RNAs [15]. Recently, however, a novel method has overcome this problem by using a thermostable group II intron reverse transcriptase (TGIRT) during RNA-seq library construction [15]. This method enables more comprehensive profiling of full-length structured small non-coding RNAs (sncRNA) along with long RNAs in a single RNAseq library workflow [15][16][17]. Thus, it is now possible to benchmark RNA-seq quantification tools on structured small non-coding RNAs.
To address whether current RNA-seq tools can quantify small RNAs as accurately as they do with long RNAs, we tested four gene quantification pipelines on a previously sequenced TGIRT RNA-seq (TGIRT-seq) dataset [15] obtained from the well-studied microarray/sequencing quality control consortium (MAQC) sample set [18,19]. Of the four tested pipelines, two are alignment-based and two are alignment-free. We found that all four pipelines are mostly concordant in quantifying common differentially-expressed gene targets, such as mRNAs and mRNA-like spike-ins. However, with respect to quantifying small or lowly-expressed genes, we found that the alignment-based pipelines significantly outperformed the alignment-free pipelines.

Results
Study design. We tested four RNA-seq quantification pipelines, including two alignment-free and two alignment-based pipelines ( Fig. 1): (A) Kallisto, a kmer counting software that uses pseudoalignments for reducing quantification error and improving speed [5]; (B) Salmon, another k-mer counting software that learns and corrects sequence-specific and GC biases on-the-fly, in addition to using quasi-mapping for further improvement in transcript quantification [6]; (C) HISAT2+featureCounts, a conventional alignment-based pipeline aligning sequencing reads to human genome by a splice-aware aligner, HISAT2 [9], and quantifying genes by featureCounts [3]; and (D) TGIRT-map, a customized alignment-based pipeline using an iterative genome-mapping procedure (Additional file 1).
The benchmarking dataset we used here consists of TGIRT-seq libraries for four well-defined samples (samples A-D) from the microarray/sequencing quality control consortium (MAQC [18,19]), each obtained in triplicate [15]. The MAQC samples A and B represent universal human reference total RNA and human brain reference total RNA, respectively, that are mixed with corresponding External RNA Controls Consortium (ERCC) spike-in transcripts. Samples C and D are mixtures of samples A and B at different ratios [18,19]. The known mix ratios in samples C and D allow for the calculation of expected fold-changes between samples C and D from the measured fold-changes between samples A and B for each gene [18,19]. Mapping statistics for all pipelines are summarized in Additional file 2: Supplementary Tables S1-4.

Gene detection and quantification.
To test if all four pipelines produced concordant gene quantifications, we first compared the numbers of detected genes across all methods. We considered a gene as detected if it was assigned with a transcripts per million (TPM) value > 0.1. We found that the numbers of detected genes were similar among all tested pipelines (Fig. 2a). Moreover, by comparing the identities of the detected genes, we found that the vast majority of the genes were detected by all tested pipelines (Additional file 3). However, we also found that Salmon and TGIRT-map consistently detected more unique genes compared to the other two pipelines (Additional file 3; Friedman test, p = 4 × 10 −7 ;Paired Wilcoxon test, p < 0.05 for all pairwise pipeline comparisons).
Even though Salmon and TGIRT-map both detected more unique genes than did the other two pipelines, the additional genes detected were different. Salmon primarily recovered more long RNAs (labeled as antisense, other ncRNAs, and protein-coding genes; Additional file 3). This enrichment in long RNAs could be pipeline-typespecific (when compared to alignment-based pipelines) or algorithm-specific (when compared to Kallisto). Pipelinetype-specific differences could be due to the probabilistic gene quantification methods of Salmon [6]. While Salmon can assign a single fragment to multiple genes, each fragment can only be assigned to a single gene by the alignment-based pipelines under our parameters. In terms of algorithm, the result might be due to how Salmon corrects for GC and sequencing biases or how it handles equivalent classes (i.e. multiply-mapped reads) relative to Kallisto [6]. On the other hand, TGIRT-map recovered more miRNAs (likely to be mis-counting of fragmented exons or unannotated exons in these libraries), some long non-coding RNAs (annotated as other ncRNAs), and small non-coding RNAs (annotated as other sncRNA) (Additional file 3). These enrichments under TGIRTmap could be pipeline-type-specific when compared to alignment-free pipelines, which may be affected by the choice of k-mer size. The differences between TGIRT-map and HISAT2+featureCounts were possibly the result of an additional local-alignment step (BOWTIE2) [20] after the spliced-read mapping step (HISAT2) [9] in TGIRT-map (Additional file 1).

Fig. 1
Analysis pipelines and experimental design. We used two pipelines each for the alignment-based and alignment-free approach. The alignment-based pipelines consisted of a HISAT2+featureCounts pipeline using HISAT2 [9] for aligning reads to the human genome and using featureCounts [3] for gene counting, and TGIRT-map, a customized pipeline for analyzing TGIRT-seq data. Further details regarding the custom TGIRT-map pipeline are provided in Methods and in Additional file 1. Two alignment-free tools, Kallisto [5] and Salmon [6], were used for quantifying transcripts. For alignment-free tools, gene-level abundances were summarized by Tximport [33]. All differentially-expressed gene tests were done by DESeq2 [21] To evaluate if gene expression level estimates were concordant among the tested pipelines, we made pairwise comparisons of these estimates between pipelines (Additional file 4). The gene expression level estimates were generally highly correlated, with Pearson's correlations ranging from 0.68-0.99 (Additional file 4). The Pearson's correlation coefficients were consistently very high for pairwise comparisons between alignment-free pipelines (Kallisto vs Salmon; 0.98-0.99) or between alignment-based pipelines (HISAT2+featureCounts vs TGIRT-map; 0.95-0.96). By contrast, any pairwise correlation between an alignment-free tool and an alignmentbased pipeline was generally lower (0.68-0.72; Additional file 4).
Further analyses revealed that different gene types showed distinct variations in gene expression level correlations for every pairwise pipeline comparison (Additional file 4). For instance, ERCC spike-ins, in vitro transcripts that mimic protein-coding transcripts, were recovered with very high correlations for all pairwise pipeline comparisons (Pearson's correlations: 0.99-1; Additional file 4). In comparison to the true abundances that were spiked into the RNA samples [15], the relative expression levels of these ERCC spike-ins were estimated as they were designed and tightly correlated to their true concentrations. We observed a near-perfect linear relationship between inferred TPM values and true concentrations (both log-transformed) for all pipelines ( Fig. 2b; R 2 > 0.94 for every sample and pipeline; Kruskal-Wallis-test p = 0.472). By contrast to ERCC transcripts, the gene expression level estimates of other common gene targets (antisense, protein-coding genes, etc.) were not as highly correlated among tested pipelines (Additional file 4). To identify the source of this discrepancy, we divided genes into quantile groups of gene lengths or gene expression levels, and found that the abundance estimation inconsistencies among pipelines were largely caused by short gene lengths and low expression levels (Additional file 5), as suggested previously [13].

Differential expression measurements of long genes.
The most popular application of RNA-seq is the detection of differentially-expressed genes. To compare the accuracy of differential expression inference among pipelines, we plotted the deviation of measured log2 fold-changes to the expected log2 fold-changes between samples A and B for every ERCC spike-in transcript (Fig 3a; 23 ERCC transcripts in each expected differentially-expressed group). For all pipelines, fold-changes between samples A and B were mostly underestimated (negative log2 fold-changes; Fig. 3a), which may be the result of DESeq2 fold change shrinkage [21]. To quantify the accuracy of fold-change detection for each method, we computed root mean square errors (RMSE) for each ERCC group (Additional file 2). Comparisons of the RMSE values for each ERCC group among pipelines indicated that alignment-free pipelines had comparable performance to alignment-based pipelines in estimating differential expression of ERCC spike-ins (Friedman-test p = 0.016; two-sided paired Wilcoxon-test p > 0.12 for all pairwise comparisons). To test whether the four pipelines provided reliable gene expression level estimates for calling differentiallyexpressed genes, we called differentially expressed ERCC spike-in transcripts using different p-value cutoffs, compared each call to the known spike-in concentration differences, and plotted the results as receiver operating characteristic (ROC) curves (Fig. 3b). By design, there are 23 spike-ins with same concentrations and 69 spike-ins with different concentrations between samples A and B [22]. Using the areas under the curves (AUCs) of the ROC curves, we found that alignment-free tools (Kallisto and Salmon) performed slightly better than alignment-based pipelines in accurately calling differentially-expressed spike-in transcripts (AUC: 0.65, 0.66, 0.71, and 0.68 Receiver operating characteristic curves for calling ERCC spike-ins as differentially-expressed. Areas under the curve were computed using p-values assigned by the differential expression caller (DESeq2 [21]) on abundance estimations of each ERCC transcript from each pipeline for HISAT2+featureCounts, TGIRT-map, Kallisto, and Salmon respectively; one non-differentially-expressed and two differentially-expressed ERCC spike-ins had TPM = 0 in all pipelines). In addition to analyzing in vitro ERCC transcripts, we also verified that all pipelines performed nearly identically in terms of quantifying in vivo transcripts, by comparing estimated expression levels to TaqMan qRT-PCR results published previously (Additional file 6) [18].

Whole transcriptome differential expression analysis.
To benchmark the suitability of different gene quantification pipelines for differential expression analysis of all RNA types, we used the known sample mix ratios and the fold-change measurements between samples A and B to construct the expected fold-changes between samples C and D for every gene [18]. By comparing the measured to the expected fold-changes between samples C and D, we found that both alignmentbased pipelines showed superior performance over alignment-free pipelines (R 2 = 0.63, 0.62, 0.47, and 0.45 for HISAT2+featureCounts, TGIRT-map, Kallisto, and Salmon, respectively; Fig. 4a). Further analyses showed that low correlations between measured and expected fold-changes were due to lowly-expressed genes (Fig. 4a), as suggested previously [15,19]. However, to our surprise, Kallisto and Salmon had exceptionally poor fits to the corresponding models for these lowly-expressed genes ( Fig. 4a; R 2 = 0.44, 0.43, 0.17, and 0.13 for the lowest 75% expressed genes from HISAT2+featureCounts, TGIRT-map, Kallisto, and Salmon, respectively).
In addition to lowly-expressed genes, we also found that short gene lengths greatly decreased the accuracies in fold-change analyses, particularly for alignment-free R 2 values were computed from the expected and measured log2 fold-change values between samples C and D for each pipeline using different gene sets grouped by average gene expression levels. The first gene set, labeled "Total RNA", includes all genes. The subsequent gene sets include only the genes with the top 1%, top 10%, top 25%, or bottom 75% expression levels, as indicated. Bars are color-coded by pipelines. b Gene lengths influenced accuracies of fold-change estimation. Genes from each pipeline were grouped by their gene lengths into four quantile groups. For each quantile group, R 2 values were computed from the expected and measured log2 fold-change values between samples C and D. Bars are color-coded by pipelines. Coefficients of determination (R 2 ) were computed by R2 function from R caret package [35]. Negative R 2 values indicate exceptionally bad fold-change predictions [23] from the software as illustrated in Additional files 7 and 8, where the fold-change prediction do not fit well to the samples mix-ratio pipelines ( Fig. 4b; R 2 = 0.29, 0.33, − 0.16, and − 0.17 for the shortest 25% genes from HISAT2+featureCounts, TGIRT-map, Kallisto, and Salmon, respectively). As we are testing whether the measured (i.e., observed) foldchanges fit into the expected fold-changes (i.e., the model) constructed by the fold-changes between samples A and B and the known mix-ratios of samples A and B in samples C and D, R 2 values served as a metric of quantifying the measurement errors relative to the known model. Here, we used where y i is the predicted and f i the expected log 2 fold change for gene i, and y is the mean log 2 fold change across genes. Thus, a negative R 2 indicates highly discordant measurements relative to the expected fold-changes predicted by the known mixratios and the measured fold-changes between samples A and B [23], as illustrated in Additional files 7 and 8. These discordant measurements reflect the difficulty the quantification tools have in making accurate fold-change predictions either between samples A and B (f i ) or between samples C and D (y i ) for small genes (Additional file 7).
To evaluate whether the deficiency in small gene quantification was specific to certain gene types, we computed the root mean square errors (RMSE) for the expected versus measured log2 fold-changes between samples C and D for genes grouped by their gene types. Consistent with the gene quantification results, Salmon and Kallisto showed slightly better fold-change estimation performance on ERCC spike-ins while alignment-based pipelines showed slightly better fold-change estimation on protein-coding genes ( Fig. 5a and Table 1). Furthermore, alignment-based and alignment-free pipelines produced similar results for the majority of small RNA types, such as snoR-NAs. For snoRNAs and other small non-coding RNAs (labeled as other sncRNA), Salmon and TGIRT-map recovered more genes than did HISAT2+featureCounts and Kallisto, but Salmon and TGIRT-map also showed similar RMSE values that were comparable to the RMSE values from HISAT2+featureCounts and Kallisto (Fig. 5a and Table 1). We also found that Salmon quantified ribosomal RNAs (rRNA) better than other pipelines in these rRNA-depleted libraries ( Table 1). The reason may be that Salmon handled multiply-mapped reads for these highly repetitive rRNA differently (comparing to alignment-based pipelines) or used bias corrections (comparing to Kallisto). However, for tRNAs, alignment-based pipelines detected higher numbers of tRNA isoacceptors compared to alignment-free pipelines and showed advantages in detecting differential tRNA expression, yielding lower RMSE values (Fig. 5b and Table 1).
Since we found performance differences in tRNA foldchange analysis between alignment-free and alignmentbased pipelines and a performance decrease in small Numbers of genes tested by DESeq2 are shown in parentheses. RNA-type annotations were generated from Ensembl [30]. "Other ncRNA" represents the following RNA types: sense intronic, 3prime overlapping ncRNA, processed transcript, sense overlapping, lincRNA, and all pseudogenes. "Other sncRNA" represents the following RNA types: misc RNA, snRNA, scaRNA, sRNA, scRNA. Mt represents all mitochondrial genes, including mitochondrial-encoded tRNAs. vaultRNA represents any Ensembl gene names with Vault or VTRNA gene quantification for alignment-free pipelines, we anticipated the comparison in tRNA fold-change predictions among the different pipelines would give us insights into the deficiency of alignment-free pipelines in small gene quantification. We hypothesized that the deficiency of alignment-free pipelines in small gene quantification was possibly due to the choice of a long k-mer size (31-mer) relative to the sizes of the small RNA transcripts, such as ∼ 75 nt for tRNAs. To investigate whether the choice of k-mer size had any effect on small RNA quantification, we tested Salmon with four different k-mer sizes, ranging from 11 to 31 (11,15,21, and 31; default is 31 for both Salmon and Kallisto). We found that at k = 21, higher R 2 values were observed for total RNA (Additional file 9).
Using tRNA as a model for comparisons in small gene quantifications, we also found a performance improvement (detected higher number of tRNA and lower RMSE) using k = 21. However, Salmon with this k-mer size still yielded two-fold higher RMSE relative to TGIRT-map for tRNA fold-change prediction (Additional files 2 and 9).
To gain a better understanding to the problem of small gene quantification in alignment-free pipelines, we further inspected the tRNAs that were undetected by both Salmon and Kallisto. Using TGIRT-map alignment results, we found that the read alignments mapped to these tRNAs displayed a high abundance of non-reference bases (Additional file 10). These non-reference bases may be caused by post-transcriptionally modified RNA bases that could introduce reverse-transcription errors by changing base-pairing interfaces [24,25], as it is known that tRNAs are post-transcriptionally-modified and have abundant base modifications [15,17,26]. Thus, we predict that these abundant non-reference bases in small RNAs, tRNA in this case, may have prevented k-mer-based algorithms from successfully counting these reads.
To further test our hypothesis of k-mer assignment being the problem for quantifying small genes, we took advantage of the alignment-based mode of Salmon and compared it to the k-mer-based algorithm implemented in Salmon [6]. Salmon has two different algorithm implementations that accept different input types, either raw sequencing reads (k-mer-based mode) or alignments of sequencing reads pre-assigned to transcripts (alignmentbased mode). While the k-mer-based mode performs read assignment and transcript quantification simultaneously, the alignment-based mode only perform transcript quantification on pre-aligned reads [6]. The difference between the two modes allowed us to isolate k-mer matching from transcript quantification in this comparison. Using Salmon alignment-based mode to perform gene quantification on Bowtie2 [20] alignments against the transcriptome, we found that Salmon alignment-based mode is more accurate in quantifying short genes than k-merbased mode. We observed a higher R 2 value for short gene quantification in Salmon alignment-based method than that in k-mer-based methods with any k value that we have tested (Additional file 9). In additional to short genes, using Salmon alignment-based mode also resulted in a more comparable performance in tRNA quantification relative to TGIRT-map (Additional file 9). These results further supported our hypothesis that k-mer matching algorithms were the sources of inaccurate small RNA quantification.

Discussion
We have performed an in-depth comparison among four RNA-seq pipelines, including two alignment-based and two alignment-free pipelines, to determine the relative performance of these tools for simultaneous quantification of long and small RNAs. The two alignment-based pipelines that we have tested were a widely-adopted alignand-count pipeline (HISAT2+featureCounts) [3,9] and a custom gene-counting procedure with multi-step iterative alignments (TGIRT-map; Additional file 1); the two tested alignment-free pipelines were k-mer counting tools with and without bias corrections (Salmon [6] and Kallisto [5]). We have tested these four pipelines on quantification of both long and small RNAs in a novel benchmarking dataset generated by a thermostable group II intron reverse transcriptase (TGIRT) [15]. This dataset is unique in that small non-coding RNAs are highly represented, while the quality of long RNA quantification is comparable to that of other widely-adopted RNA-seq methods [15]. Using this dataset, we have found that while all four pipelines perform similarly on long and highly-expressed RNAs, alignment-free tools have clear limitations for small and lowly-expressed RNA quantification.
For long gene quantification, we have found that all four pipelines quantify common gene targets (e.g. ERCC spikeins and protein-coding genes) with similar results, confirming a previously benchmark study on poly-A selected RNA-seq [13]. Generally, gene quantification tends to be more similar between pipelines of the same type (i.e., HISAT2+featureCounts and TGIRT-map, or Kallisto and Salmon) than pipelines of the other type (alignmentbased versus alignment-free). This result further supports a previous finding which showed more similar transcript isoforms quantifications were found among alignment-free or among alignment-based pipelines than among pipelines of the different types [14], suggesting that alignment-based pipelines may have somewhat different quantification biases than do the probabilistic models of alignment-free pipelines. Regardless of the difference in gene quantification, our results on differentially-expressed gene detection for long genes showed that all four pipelines performed comparably for ERCC spike-ins and protein-coding genes when compared to their expected fold-changes (ERCC spike-ins) [22] or MAQC TaqMan assay measurements (protein-coding genes) [18]. This result further confirmed previous benchmark studies where alignment-free and alignment-based tools gave similar results for differentially-expressed gene detection [13,14].
For total gene quantification and differential expression analysis, all tested pipelines generally have performed similarly, with most disagreements occurring between pipeline comparisons of different pipeline types (i.e. an alignment-based pipeline vs an alignment-free pipeline). In the analyses of genes that were inconsistently quantified among pipelines, our results have confirmed that both high gene expression and long gene length were crucial to consistent abundance estimation, as suggested previously [13,27]. Using fold-change analyses for comparisons in quantification accuracies, we have found that alignmentbased tools were more accurate in quantifying lowlyexpressed or small genes. This result likely reflects the nature of probabilistic assignments of k-mers and the inferences of TPM values. Although we have found that alignment-free pipelines were unreliable for quantifying extremely small RNAs (with shorter gene lengths) in total transcriptome analysis, alignment-free tools performed comparably to alignment-based tools for most of the small RNA types, such as snoRNAs or other sncR-NAs, in differential expression analyses. This disagreement between small non-coding RNAs and small mRNAs is possibly due to their differences in secondary structures and their different sensitivities to RNA fragmentation prior to RNA-seq library construction, such that overlyfragmented mRNA fragments may be too short to be usable in alignment-free quantifications.
Even though we have found that all pipelines performed similarly on the majority of small non-coding RNAs, our results have revealed that pipelines involving readalignment steps were superior to alignment-free tools in quantifying tRNAs specifically. We initially hypothesized the differences in performance were due to the choice of k-mer size in the alignment-free pipelines and we have found an improvement in small gene quantifications when a moderately smaller k was chosen (k = 21). However, we have also found that the performance of alignment-free quantification at this optimal k was still not comparable to alignment-based pipelines for small genes. Using tRNA as a model, our results suggested that this performance difference was likely due to a combinatorial effect of the choice of k-mer size and misincorporations introduced by post-transcriptionally-modified RNA bases during reverse transcriptions. We reason that a relatively large k-mer size and erroneous sequencing reads likely impede matching to the indexed transcriptome even when these sequencing reads are shredded into k-mers, since all k-mers inherit the same errors or mismatches. Since mismatches on sequencing reads can either be reverse transcription errors or biological variations on small RNAs [28], we expect the same phenomenon may occur if the small RNAs contain single-nucleotide polymorphisms or other biological variations.

Conclusions
In summary, we have shown that different types of pipelines performed similarly for common differentiallyexpressed gene targets such as protein-coding genes. However, accurate quantification of lowly-expressed or small RNA is difficult to achieve with alignment-free pipelines. Using tRNA as a model, we have also found that k-mer counting algorithms are not compatible for quantifying small genes with abundant biological variations regardless of the choice of k-mer size.

Methods
Data and reference preparation. Raw sequencing reads for TGIRT-seq data generated from the well-studied MAQC samples were downloaded from the NCBI Sequence Read Archive, accession number SRP066009 [15]. In brief, this dataset includes triplicates of four different human total RNA samples (called A-D) spiked with External RNA Control Consortium (ERCC) transcripts. ERCC spike-in transcripts are 92 in vitro transcripts with 250-2000-nt long [22]. Two ERCC spike-in mixes (mixes 1 and 2) with different concentrations for each transcript were spiked into RNA samples A (universal human RNA) and B (human total brain RNA), respectively, to provide known fold-changes of these spike-in transcripts between these two samples [18,22]. By design, these two spikein mixes establish four different differentially-expressed gene groups with relative ratios of 0.67:1, 1:1, 1:2, and 1:4 between samples A and B. Samples C and D are mixes of samples A and B in ratios of 3:1 and 1:3, respectively. For detailed library preparations, please refer to Nottingham, et al. [15].
Adaptor trimmings. Raw reads were adapter-and quality-trimmed by atropos v1.1.16 [31] via: HISAT2 mapping. Reads were aligned to the genome using HISAT2 v2.1.0 [9] via: HISAT2+featureCounts. Trimmed reads were aligned using HISAT2 mapping. Gene counts from HISAT2 mapping were generated by featureCounts v1.5.3 [3] via: TGIRT-map. Our custom pipleine TGIRT-map first filters out all trimmed reads that can be aligned to tRNA or rRNA references to reduce multiply-mapped reads (Additional file 1). The remaining unaligned reads are then sequentially aligned to the human genome using a splice-aware aligner (HISAT2) [9] and a sensitive local aligner (BOWTIE2) [20]. A single alignment locus is picked from the multiply-mapped fragments by asserting the assumptions that (A) RNA-seq fragments are small (smallest insert size), (B) ribosomal genes are abundant in the genome (ribosomal gene loci), and (C) fragments are unlikely to be originated from haplotype or patch sequences (as defined by Ensembl annotations) of the artificially assembled genome. Finally, gene quantification is done on genomic loci, except for tRNA and rRNA which require an additional step of re-aligning and counting.
Trimmed reads were first aligned to rRNA (GenBank accession numbers: X12811.1 and U13369.1) and tRNA sequences with BOWTIE2 v2.3.3.1 [20] via: We define the mapped reads as tRNA and rRNA premapped reads. Unmapped reads were then aligned to the genome with HISAT2 as described above in HISAT2 mapping [9]. Unaligned reads from HISAT2 mapping were then rescued by re-aligning locally to the genome with BOWTIE2 [20] via: All alignment pairs with > 10 nucleotides of soft-clipped bases on either reads or were discordant pairs were discarded. Multiply-mapped reads from the two genomemapping steps were grouped and a pair of best alignment was chosen by the following ordered criteria: the alignment pair (A) had the smallest insert size, (B) was mapped to ribosomal gene locus, or (C) was mapped to chromosome 1-23, X or Y. If none of these criteria was met by a single alignment pair, a random alignment pair was chosen from the multiply-mapped loci. These filtered alignments were then merged with the uniquely-mapped alignments. All reads that mapped to tRNA or rRNA loci were extracted by BEDtools v2.26 using pairtobed command with options -s -f 0.01 [32], combined with tRNA and rRNA pre-mapped reads and re-aligned to tRNA and rRNA references. Counts were generated from the aligned BAM file. Counts for each anticodon were aggregated. Other gene counts were calculated by converting the genome alignments to fragment coordinates in a BED file using the BEDtools bamtobed command and counted using the BEDtools coverage command [32]. TGIRT-map pipeline is available at: https://github. com/wckdouglas/tgirt_map.
We called Salmon with the following command-line arguments:  [21] was used for all differential expression analyses.
Because DESeq2 does not accept TPM values as input, transcript TPM values from Salmon and Kallisto were converted to gene-level counts using Tximport v1.4.0 [33] before any differential-expression analyses.

Predicted log 2 fold change between samples C and D.
Predicted fold changes between samples C and D were computed using the following equation as described in Su, et al. [19]: where C D indicates predicted fold change between samples C and D, A B indicates measured fold change between samples A and B, k 1 = 3z 3z+1 , and k 2 = z z+3 . In this work, z = 1.43 was used as suggested previously [19]. m 2 G26: N2-methylguanosine at position 26; I34: Inosine at position 34; m 1 A58: 1-methyladenosine at position 58; see also [15,17,36]). Unannotated mismatches from both ends can be untrimmed adapters or tRNA-precusor sequences. We observe that on these tRNAs, the errors induced by RNA-base-modification occurred at positions ∼ 20, 30, or 50 on ∼ 75 nucleotide long tRNA transcripts. Therefore, we expect that every k-mer originating from an erroneous sequencing read inherits at least one of these errors, and therefore k-mers cannot be matched accurately to the transcript database. The effect is worse when a large k-mer size is selected, such as the default k = 31. (PNG 337 kb) Abbreviations AUC: area under the curve; ERCC: External RNA controls consortium; MAQC: microarray quality control consortium; RMSE: root mean square error; ROC: receiver operating characteristic; TGIRT: Thermostable group II intron reverse transcriptase; TPM: transcripts per million