DNA-protein alignment attains similar performance to using a transcriptome reference
We first demonstrate, under ideal conditions, the soundness of our idea of aligning RNA-seq reads to a proteome reference for differential gene expression analysis, by comparing our performance to that of the traditional case of using an established transcriptome reference.
We aligned the fruit fly simulated RNA-seq reads (described earlier in the Implementation section) to the UniProt D. melanogaster proteome UP0000 00803, which contains 1 representative protein sequence per gene, and fed the counts obtained by our method to DESeq2 [31] for differential analysis. To serve as baseline for comparison, we ran a typical pipeline consisting of Bowtie2 [39] for read alignment to the D. melanogaster transcriptome, followed by RSEM for transcript quantification, tximport [40] for gene-level aggregation of counts, and finally DESeq2 [31] for differential analysis at the gene level. Details of the two pipelines are provided in the Supplementary Material.
We evaluated the two approaches based on their recall and precision in predicting differentially expressed (DE) genes. Recall is the proportion of actual DE genes that were correctly predicted to be DE, and precision is the proportion of predicted DE genes that were actually DE. We require that the direction of fold-change (up/down) match between the ground truth and prediction to be classified as a correct prediction. To compute recall and precision of our method, we mapped our predictions from the set of proteins to the set of genes using the mapping between UniProt protein ID and FlyBase gene ID obtained from Ensembl.
Figure 1(left) shows the Precision-Recall curves obtained by varying the false discovery rate (FDR) threshold of DESeq2, and Fig. 1(right) shows the distribution of the estimated log fold-change at an FDR threshold of 0.1. There is almost no difference between the two approaches in the ability to detect DE genes and in the trend of under- or over-estimation of true fold change. This result demonstrates that for differential gene expression analysis, there is no performance degradation when aligning RNA-seq reads to a proteome reference, even though we are effectively only using reads from the coding region of transcripts.
In the absence of a close reference, our method outperforms a typical assembly-based approach
Next we simulated the scenario faced in the case of non-model organisms by pretending that the D. melanogaster reference sequences are not available, and that the closest species with a well-annotated reference proteome is a distant relative, the mosquito Anopheles gambiae. The two species are of the same order Diptera with their lineages thought to have separated roughly 250 million years ago [42]. The evaluation process described below uses the mosquito proteome as reference; but to calibrate the effect of the degree of evolutionary and sequence divergence, we repeated the process with reference proteomes of closer relatives of D. melanogaster: D. ananassae and D. grimshawi.
We compared our performance to that of a typical assembly-based pipeline consisting of: Trinity [43] for de-novo transcriptome assembly, followed by Bowtie2 for mapping the reads to the assembly, RSEM for counting, tximport for gene-level aggregation using the gene-to-transcript mapping provided by Trinity, and finally DESeq2 for differential analysis. We used the Dammit pipeline [44] to annotate the assembly against the mosquito proteome. Details of the two pipelines are provided in the Supplementary Material.
To facilitate the comparison, we obtained a pre-computed orthology map between A. gambiae and D. melanogaster, from the website of InParanoid [45]. Consider a D. melanogaster protein-coding gene g, and let Fg be the set of protein products of g. For a D. melanogaster protein f, let Of be the set of mosquito proteins in the same ortholog group as f. We associate with g the set Mg of mosquito proteins m such that m∈Of for some f∈Fg.
We computed recall and precision of our method as follows. An actual up-regulated (down-regulated) D. melanogaster DE gene g was defined as correctly predicted if there was at least one protein in Mg that was predicted to be up-regulated (down-regulated). Recall was defined as the number of correctly predicted DE genes divided by the number of actual DE genes. Precision was defined as the number of correctly predicted DE genes divided by the number of genes g for which at least one protein in Mg was predicted to be DE.
We computed recall and precision of the assembly-based approach as follows. For a Trinity gene t, let Dt be the set of mosquito proteins that Dammit assigned to the isoforms of t (if there were multiple alignments for an isoform, we kept only one with the lowest E-value). With an actual D. melanogaster gene g, we associated a set Tg of Trinity genes, where t∈Tg if Dt∩Mg≠∅. An actual up-regulated (down-regulated) DE D. melanogaster gene g was defined to be correctly predicted if there was at least one up-regulated (down-regulated) Trinity gene in Tg. Recall was defined as the number of correctly predicted DE genes divided by the number of actual DE genes. Precision was defined as the number of correctly predicted genes divided by the number of genes g for which at least one gene in Tg was predicted to be DE.
The definitions of recall and precision are necessarily slightly different for the two approaches. Our hope is that they convey a similar meaning – that an actual D. melanogaster DE gene is represented by a set of orthologous mosquito proteins (in our method) or by a set of Trinity genes for which there was an annotation to an orthologous mosquito protein (in the assembly-based method), and that the gene is considered to be correctly predicted if at least one of the representatives are predicted to be DE.
Figure 2 shows the precision and recall of our method and the assembly-based approach when using the mosquito reference proteome. It also contains the PR-curves when using the D. ananassae and D. grimshawi reference proteomes. The curves were obtained by varying the FDR threshold of DESeq2. When using the two Drosophila reference proteomes, the performance of our method varied slightly, but in both cases, outperformed the assembly-based approach. When using the A. agambiae reference, recall was lower for both methods, mainly because the orthology map contains only 60% of the fruit fly proteins – there were 7341 ortholog clusters involving 7863 fruit fly proteins and 8090 mosquito proteins.
Overall, across any setting of FDR threshold or any choice of a reference proteome, our approach outperformed the assembly-based approach.
So far, to compute recall and precision of the assembly-based approach, we used all the alignments reported by the Dammit pipeline, even including many short local alignments. It is not uncommon in practice to filter short alignments. We repeated the analysis by keeping only those alignments predicted by the Dammit pipeline that covered at least 50% of a contig. The precision-recall curves for this cases is shown in Fig. 3, which shows a significant drop in recall of the assembly-based approach.
With real data too, our method outperforms the assembly-based approach
We applied our pipeline and the assembly-based approach to a recently published real RNA-seq dataset ArrayEx- press E-MTAB-8090 ERR3393437–42. The dataset contains RNA-seq reads of the hemocyte tissue of D. melanogaster samples with and without injury, with 3 replicates for each condition. After cleaning and trimming low-quality reads using fastp [46], there were roughly 110 million pairs of reads.
Continuing with the assumption that no reference sequences are available for D. melanogaster, we applied our pipeline and the assembly-based pipeline as in the previous section using the mosquito and two Drosophila reference proteomes. Since we do not know the ground truth for this dataset, to serve as baseline, we additionally ran the Bowtie2-RSEM-DESeq2 pipeline using D. melanogaster reference transcriptome. To be able to compare the DE call sets, we mapped our predicted DE genes to the D. melanogaster gene names using the same Inparanoid orthology maps as before.
Intersection of the sets of DE genes obtained from the two approaches and the baseline are shown in Fig. 4. At FDR threshold of 0.01, there were 104 genes identified as DE by the baseline method. Based on the observation from Fig. 1 that the Bowtie2-RSEM-DESeq2 pipeline has high precision at FDR 0.01, let us assume that all of these baseline calls are correct and that they constitute the empirical ground truth. At the same FDR threshold, when using the D. ananassae reference proteome, our method was slightly more sensitive than the assembly-based approach, being able to predict 68 out of the 104 baseline DE genes, compared to 58 by the assembly-based approach. Our method was also slightly more precise, with the 68 calls corresponding to roughly 78% of the calls, compared to 74% for the assembly-based method. This is in line with observation from Fig. 2 that our method has slightly better sensitivity and precision than the assembly-based approach. Similar results were obtained when using the D. grimshawi reference proteome.
When using the A. gambiae proteome, there is a significant decrease in the size of the overlaps between the baseline and the two approaches, consistent with the drop in sensitivity observed in Fig. 2. The two approaches are similarly sensitive (25 calls by our approach vs. 27 by assembly-based) while our method is more precise (25 out of 34 calls by our approach vs. 27 out of 46 calls by assembly-based).
Avoiding assembly dramatically reduces running time and memory usage
All the experiments above were carried out on a system with Intel Xeon Silver 4114 Processor with 10 cores and 20 threads. For the real dataset E-MTAB-8090 which contains roughly 110 million pairs of cleaned reads, de-novo assembly alone took more than 24 hours. In contrast, DNA-protein alignment, which is the most compute-intensive part of our pipeline, takes less than 20 minutes per sample containing roughly 20 million pairs of reads, using 20 threads. While the de-novo assembly had a massive peak memory usage of ∼ 65 Gbytes, the memory requirement of our method is dominated by the size of the proteome index, which was just ∼ 33 Mbytes for the mosquito proteome. In general, the index size is roughly 5×n bytes, where n is the length of the proteome.