Proteotranscriptomics assisted gene annotation and spatial proteomics of Bombyx mori BmN4 cell line

Background The process of identifying all coding regions in a genome is crucial for any study at the level of molecular biology, ranging from single-gene cloning to genome-wide measurements using RNA-seq or mass spectrometry. While satisfactory annotation has been made feasible for well-studied model organisms through great efforts of big consortia, for most systems this kind of data is either absent or not adequately precise. Results Combining in-depth transcriptome sequencing and high resolution mass spectrometry, we here use proteotranscriptomics to improve gene annotation of protein-coding genes in the Bombyx mori cell line BmN4 which is an increasingly used tool for the analysis of piRNA biogenesis and function. Using this approach we provide the exact coding sequence and evidence for more than 6200 genes on the protein level. Furthermore using spatial proteomics, we establish the subcellular localization of thousands of these proteins. We show that our approach outperforms current Bombyx mori annotation attempts in terms of accuracy and coverage. Conclusions We show that proteotranscriptomics is an efficient, cost-effective and accurate approach to improve previous annotations or generate new gene models. As this technique is based on de-novo transcriptome assembly, it provides the possibility to study any species also in the absence of genome sequence information for which proteogenomics would be impossible.

: Read representation statistics of the Trinity assembly Supplemental  Figure S1: Distribution of length distributions across RNA expression level bins. Trinity transcripts lengths (nt) peak at around the 80 th percentile of expression levels of all individual transcripts (indicated by blue dotted line). Figure S2: Comparison of Transrate assembly scores to other publicly available assemblies. Transrate assembly scores of 255 assemblies analyzed by (Smith-Unna et al., 2016). Blue dotted horizontal lines mark the 70 th percentile of the assembly or optimal assembly scores of all assemblies analyzed. Red dotted horizontal lines indicate the assembly or optimal assembly score of our trinity assembly. Figure S3: Overall MS detected transcripts show improved assembly features.

Supplemental
(A) Barplot of hit percentage coverage bin of all transdecoder predictions and predictions that were detected by MS compared to current Bombyx mori annotations.(B) Boxplot of transcript lengths of all transdecoder predictions (grey) and predictions that were detected by mass spectrometry (MS) (red). (C) Barplot of mass spectrometric identification enrichment metrics across TransRate score bins. The plot shows that contigs with high TransRate scores tend to also be detected by mass spectrometry.

Shorter than SilkBase
Longer than SilkBase Supplemental Figure S6: Intracluster distances of SOM clusters can be used to filter out clusters that have high variability within the cluster (see Figure 4A). Boxplot summarizing the intra-cluster distances between fractionation profiles. For final analyses only clusters with mean intra-cluster distances (red dots and value shown) below the 75%-tile of all intra-cluster distances were kept (blue boxes). All others were combined into one cluster of uncategorized profiles (colored in gray in Fig. 4A).

ExN50 (median length (bases) of transcripts in expression perc bin)
Expression percentile

ExN50 (median length (bases) of transcripts in expression perc group)
Supplemental