Skip to main content
Fig. 3 | BMC Genomics

Fig. 3

From: A robust (re-)annotation approach to generate unbiased mapping references for RNA-seq-based analyses of differential expression across closely related species

Fig. 3

Schematic representation of length bias in inter-species differential expression analysis and our reciprocal re-annotation strategy to correct it. a Length bias in the analysis of a non-differentially expressed gene. Coloured rectangles represent the part of the transcript which is included as reference for the RNA-seq reads to map to, while unfilled rectangles are regions of the transcript which are omitted and to which RNA-seq reads cannot be mapped. Red “N”s represent sequencing errors that prevent the complete annotation of a transcript. Mapped reads are shown as thin black lines and the number bellow indicates the total of reads mapped. (upper panel) If one transcript is shorter in one of the references compared to its orthologs, for the same expression levels fewer reads will map to it. This can result in false positives in the analysis of differential expression. (lower panel) Our strategy to correct this bias is to shorten the orthologs in the other references to match the length of the shorter sequence. b Pipeline of reciprocal transcriptome re-annotation method. Black numbers in white circles represent genome annotation steps using the “est2genome” command of Exonerate [58]. Grey numbers in grey circles represent conversion of the resulting GFF file into a new transcript set. Filled horizontal bars represent the annotated set of transcripts; non-filled horizontal bars at the start/end of the transcripts represent parts of the transcript that cannot be correctly annotated in one reference and are therefore eliminated from the transcript set. The boxes with red frame indicate the transcript sets that will be used as reference for RNA-seq read mapping (after confirmation by reciprocal blast). Step 1: the transcript set of the best annotated genomes (D. melanogaster in our study) is used to annotate one of the other genomes (D. simulans in our study) and generate a new transcript set for this species. Due to sequencing errors, some transcripts will be shorter. Step 2: the new transcript set form D. simulans is used to annotate the last genome (D. mauritiana in our study). The gene set generated contains shorter transcripts due to sequencing errors in D. mauritiana but also in D. simulans. Step 3: the transcript set from D. mauritiana is used to re-annotate the previously generated set from D. simulans to integrate the information from the D. mauritiana assembly. Step 4: the second transcript set from D. simulans is used to annotate the D. melanogaster set in order to integrate the information from D. simulans and D. mauritiana

Back to article page