Skip to main content

A homology-guided, genome-based proteome for improved proteomics in the alloploid Nicotiana benthamiana



Nicotiana benthamiana is an important model organism of the Solanaceae (Nightshade) family. Several draft assemblies of the N. benthamiana genome have been generated, but many of the gene-models in these draft assemblies appear incorrect.


Here we present an improved proteome based on the Niben1.0.1 draft genome assembly guided by gene models from other Nicotiana species. Due to the fragmented nature of the Niben1.0.1 draft genome, many protein-encoding genes are missing or partial. We complement these missing proteins by similarly annotating other draft genome assemblies. This approach overcomes problems caused by mis-annotated exon-intron boundaries and mis-assigned short read transcripts to homeologs in polyploid genomes. With an estimated 98.1% completeness; only 53,411 protein-encoding genes; and improved protein lengths and functional annotations, this new predicted proteome is better in assigning spectra than the preceding proteome annotations. This dataset is more sensitive and accurate in proteomics applications, clarifying the detection by activity-based proteomics of proteins that were previously predicted to be inactive. Phylogenetic analysis of the subtilase family of hydrolases reveal inactivation of likely homeologs, associated with a contraction of the functional genome in this alloploid plant species. Finally, we use this new proteome annotation to characterize the extracellular proteome as compared to a total leaf proteome, which highlights the enrichment of hydrolases in the apoplast.


This proteome annotation provides the community working with Nicotiana benthamiana with an important new resource for functional proteomics.


Nicotiana benthamiana has risen to prominence as a model organism for several major reasons. First, N. benthamiana is highly susceptible to viruses, resulting in highly efficient virus-induced gene-silencing (VIGS) for rapid reverse genetic screens [42]. This hypersusceptibility to viruses is due to an ancient disruptive mutation in the RNA-dependent RNA polymerase 1 gene (Rdr1), present in the lineage of N. benthamiana which is used in laboratories around the world [1]. Reverse genetics using N. benthamiana have confirmed many genes important for disease resistance [43, 58]. Second, N. benthamiana is highly amenable to the generation of stable transgenic lines [6, 48] and to transient expression of transgenes [17]. This easy manipulation has facilitated rapid forward genetic screens and has established N. benthamiana as the plant bioreactor of choice for the production of biopharmaceuticals [50]. Finally, N. benthamiana is a member of the Solanaceae (Nightshade) family which includes important crops such as potato (Solanum tuberosum), tomato (Solanum lycopersicum), eggplant (Solanum melongena), and pepper (Capsicum ssp.), as well as tobacco (Nicotiana tabacum) and petunia (Petunia ssp.).

N. benthamiana belongs to the Suaveolentes section of the Nicotiana genus, and has an ancient allopolyploid origin (~6Mya, [5]) accompanied by chromosomal re-arrangements resulting in a complex genome with 19 chromosomes per haploid genome, a reduced number when compared to the ancestral allotetraploid 24 chromosomes. The estimated haploid genome size is ~ 3.1Gb [17, 27, 56]. There are four independent draft assemblies of the N. benthamiana genome [3, 32], as well as a de-novo transcriptome generated from short-read RNAseq [33]. These datasets have greatly facilitated research in N. benthamiana, allowing for efficient prediction of off-targets of VIGS [14] and genome editing using CRISPR/Cas9 [29], as well as RNAseq and proteomics studies [18].

During our research using proteomics and reverse genetic approaches, we realized that many of the gene models in these draft assemblies are incorrect, and that putative truncated gene-products are often annotated as functional protein-encoding genes. This is at least partly because these draft assemblies are highly fragmented and because N. benthamiana is an paleotetraploid. Mapping short read sequences onto polyploid draft genomes frequently results in mis-assigned transcripts and mis-annotated gene models [53]. The de-novo transcriptome assembly also has a high proportion of chimeric transcripts. Because of incorrect annotations, extensive bioinformatics analysis is required to select target genes for reverse genetic approaches such as gene silencing and editing, or for phylogenetic analysis of gene families.

We realized that the genome annotations of several other species in the Nicotiana genus (N. sylvestris, N. tomentosiformis, N. tabacum, and N. attenuata) generated using the NCBI Eukaryotic Genome Annotation Pipeline provided more accurate gene models, and we therefore decided to re-annotate the proteome of available N. benthamiana draft genomes using these Nicotiana gene models as a template. We generated a full proteome annotation on the Niben1.0.1 draft genome assembly and extracted proteins missing from this annotation from the other draft genome assemblies. Here we show that this dataset explains activity-based proteomics datasets better, is more accurate and sensitive for proteomics, and that this annotation greatly facilitates phylogenetic analysis of gene families.

Results and discussion

Generation of the genome-based proteomes in N. benthamiana genome assemblies

The observation that predicted proteins from other Nicotiana species were more accurate for several gene families inspired us to use the corresponding coding sequences to re-annotate the proteome of N. benthamiana. We used Scipio [23] to transfer the gene models from Nicotiana species to N. benthamiana. Scipio refines the transcription start-site, exon-intron boundaries, and the stop-codon positions of protein-encoding sequences aligned to the genome using BLAT [23]. When input protein-encoding sequences are well-annotated, this method is more accurate and sensitive than other gene prediction methods [22].

To generate the input dataset, we selected the predicted protein sequences from recently sequenced Nicotiana species [44, 45, 59] generated using the NCBI Eukaryotic Genome Annotation Pipeline (Fig. 1) and used CD-HIT at a 95% identity cut-off to reduce the redundancy and remove partial sequences (Fig. 1, Step 1). The resulting dataset, Nicotiana_db95, contains 85,453 protein-encoding sequences from various Nicotiana species. We then used this Nicotiana_db95 dataset as an input dataset to annotate gene-models in the Niben1.0.1 draft genome assembly using Scipio (Fig. 1, Step 2 Additional files 5, 8 and 9).

Fig. 1
figure 1

Bioinformatics pipeline for improved Nicotiana benthamiana proteome annotation. The predicted proteins of Nicotiana species generated by the NCBI Eukaryotic Genome Annotation Pipeline were retrieved from Genbank and clustered at 95% identity threshold to reduce redundancy (Step 1), and used to annotate the Niben1.0.1 genome assembly (Step 2). Only those proteins with an alignment coverage ≥60% to the Nicotiana predicted proteins as determined by BLASTP were retained (Step 3) to produce the NbD core dataset. Similarly, the other draft genome assemblies were annotated (Step 4), and only those proteins with an alignment coverage ≥90% to the Nicotiana predicted proteins as determined by BLASTP were retained (Step 5). CD-HIT-2D was used at 100% sequence identity to retain proteins missing in NbD dataset (Step 6), resulting in supplemental dataset NbE. NbD and NbE can be combined (NbDE) to maximise the spectra annotation for proteomics experiments

The Scipio-generated dataset contains a majority of full-length protein-encoding sequences, but also partial sequences caused by fragmentation in the Niben1.0.1 dataset. In addition, we noticed many incomplete gene products caused by premature stop-codons and frameshift mutations. To remove these partial sequences from the protein dataset, we used BLASTP against the NCBI Nicotiana reference proteomes and retained only protein-encoding sequences with ≥60% coverage. This resulted in 53,411 protein-encoding gene annotations (NbD dataset, Fig. 1, Step 3).

To complement the NbD dataset, we used the same approach to re-annotate the Nbv0.5, Niben0.4.4 and Nbv0.3 draft assemblies using Scipio and the Nicotiana_db95 dataset as template (Step 4). We retained proteins with ≥90% BLASTP coverage with the Nicotiana reference proteomes in NCBI (Fig. 1, Step 5) and we used CD-HIT-2D at a 100% threshold to select coding sequences not present in the NbD dataset (Step 6). The resulting supplemental NbE dataset contains 21,391 additional protein-encoding sequences (Fig. 1, Additional files 6 and 7) . Besides new gene models, this NbE dataset contains sequences that are nearly identical to those in the NbD dataset. These proteins may be derived from homeologs caused by polyploidisation, but can also be caused by sequence polymorphisms between different sequenced plants and by sequencing and assembly errors. For protein annotation in mass-spectrometry (MS) experiments, however, the combination of the NbD and NbE datasets (the NbDE dataset, Additional files 10, 11, 12, 13 and 14) will maximise the annotation of the spectra.

The NbDE proteome is more complete, sensitive, accurate, and relatively small

We next compared the predicted proteome to the published predicted proteomes. The NbD and NbDE datasets have relatively few entries when compared to the other datasets (Fig. 2a). The preceding datasets include the predicted proteomes from the Niben0.4.4 (76,379 entries, [3])) and Niben1.0.1 (57,140 entries, [3])). We also included a previously described curated dataset in which gene-models from Niben1.0.1 were corrected using RNAseq reads (74,093 entries, [18])), and the predicted proteome (Nbv5.1*) derived from a de-novo transcriptome (191,039 entries, [33]). We also included the Nicotiana_db95 dataset (85,453 entries) in this comparison.

Fig. 2
figure 2

Increased lengths, coverage and annotation of N. benthamiana proteins. a NbD/NbDE datasets have relatively few entries when compared to preceding datasets. b NbD/NbDE datasets contain nearly all benchmark genes as full-length genes, according to Benchmarking Universal Single-Copy Orthologs (BUSCO) of embryophyta. c The NbD/NbDE datasets have higher number of annotated PFAM domains. d NbD/NbDE datasets have relatively longer protein lengths. Violin and boxplot graph of log10 protein length distribution of each dataset. Jittered dots show the raw underlying data. e NbD/NbDE annotated proteins have a higher percentage coverage to the tomato proteins as determined by BLASTP

To determine the completeness of the NbD and NbDE datasets, we analysed the presence of BUSCOs (Benchmarking Universal Single-Copy Orthologs [47, 57]). We used the embryophyta BUSCO set, which contains 1440 highly conserved plant genes that are predominantly found as single-copy genes [47]. Nicotiana_db95 has one fragmented and nine missing BUSCOs, indicating that this dataset contains 99.3% of the Nicotiana genes (Fig. 2a). In the NbD dataset, 98.1% of the BUSCO proteins were identified as full-length proteins, and only 0.6% of the BUSCOs is fragmented and 1.3% is missing (Fig. 2a). The 98.1% full length BUSCO proteins in the NbD dataset consists of 45% single-copy genes and 53.1% duplicated genes, consistent with the allotetraploid nature of N. benthamiana. The combined NbDE dataset contains 99.0% of the BUSCOs and contains 71.1% duplicated full length BUSCOs (Fig. 2a), but part of this increased duplication may be due to small sequence variations between the different genome assemblies rather than from genes missing in the NbD annotation. The BUSCO scores of the NbD and NbDE datasets are superior when compared to previously published annotations. The best previously predicted proteome is from dataset Nbv5.1, which has 96.9% of the BUSCOs as full-length proteins, whereas 0.8% is missing and 2.2% is incomplete (Fig. 2a). However, Nbv5.1 contains nearly five times more protein coding sequences than NbD and 69.7% of BUSCOs are duplicated.

Second, we investigated the number of unique PFAM identifiers found with each entry in each proteome, because mis-annotated sequences and fragmented gene products are less likely to receive a PFAM annotation [53]. Indeed, over 80% of the proteins in the NbD and NbDE datasets get at least one PFAM identifier whereas this match is lower with the preceding datasets, indicating that proteins in NbD and NbDE are better annotated (Fig. 2b).

Third, we mapped the length distributions of the predicted proteins in comparison with the Nicotiana_db95 dataset. The proteins in the NbD and NbDE datasets are significantly longer than those in the previously predicted proteomes except for the Nbv5.1 primary + alternate proteome (Fig. 2c). This is probably because the Niben0.4.4, Niben1.0.1 and curated datasets contain many partial genes, while the Nbv5.1 primary + alternative proteome has a high proportion of chimeric sequences, which are biased towards long transcripts due to the assembly of short sequencing reads.

Fourth, to verify that the increased protein lengths is not due to incorrect annotation, we performed BLASTP searches against the Solanum lycopersicum RefSeq proteins (annotation v103), using the N. benthamiana annotated proteins as the query and the S. lycopersicum RefSeq proteins as the target. 70% of the NbD and NbDE proteins show a full coverage of tomato proteins, whereas in the other datasets less than 60% of the proteins show full coverage (Fig. 2d). This indicates that in addition to a more complete annotation with fewer missing genes as determined by BUSCO, the NbD and NbDE datasets are also more accurate in the annotation of gene models.

Finally, since phylogenetic analysis of gene families in closely related species often relies on gene-annotations, we compared our NbD proteome annotation against the predicted proteomes of Solanaceae species for which genomes have been sequenced (Additional file 1: Figure S1). Our NbD proteome compare well to the predicted proteomes of other sequences Solanaceae species generated by the NCBI Eukaryotic Genome Annotation Pipeline. In addition, since the predicted proteomes of some of these species were not annotated using the NCBI pipeline, they miss a relatively high proportion of genes (up to 28.5% of genes missing or fragmented). Therefore, care must be taken to not over-interpret results derived from phylogenetic analysis using these datasets.

Improved annotation of spectra in proteomics experiments

We next tested the annotation of MS spectra from four biological replicates of total leaf extracts using the different datasets. Both the NbD and NbDE datasets outperform the preceding datasets (Fig. 3a). This is notable because the NbD dataset has the fewest entries and yet works well in spectra annotation. More specifically, the NbD dataset also identifies the highest number of unique peptides per protein, consistent with having the fewest entries and increased length (Fig. 3b). These metrics show that the new NbD and NbDE datasets are more sensitive and accurate for proteomics than the currently available datasets.

Fig. 3
figure 3

NbD/NbDE datasets outperform the annotation of spectra in proteomics. a Percentage of annotated MS/MS spectra in total leaf extract samples. b Average number of unique peptides assigned per protein in the different datasets. a and b Means and standard error of the mean are shown for four biological replicates of total leaf extracts. c Mis-annotations of papain-like Cys proteases (PLCPs) detected by activity-based protein profiling [54]. Leaf extracts were labelled with activity-based probes for PLCPs and labelled proteins were purified and analysed by MS. Shown are the protein annotations found in the NbDE (top) Niben1.0.1 (middle) and curated datasets (bottom), highlighting mis-annotations (red) caused by partial transcripts, mis-annotation of exon-intron boundaries, and mis-assemblies

We independently validated the NbD and NbDE annotations on an independent dataset where we re-analysed a previously published apoplastic proteome of agro-infiltrated N. benthamiana as compared to non-infiltrated N. benthamiana (PRIDE repository PXD006708, [18]). Both the NbD and NbDE annotations are more sensitive and accurate than the curated dataset on this experiment (18,059 and 18,352 vs 17,954 peptides detected; 21.85 ± 3.0 and 22.2% ±3.1% vs 21.7 ± 3.0% spectra identified). This independently confirms the high performance of the NbD and NbDE datasets.

To confirm that our new annotations are correct and biologically relevant we examined the annotation of 14 papain-like Cys proteases (PLCPs) that we identified recently from agroinfiltrated N. benthamiana leaves [54]. These proteases were active proteins because they reacted with an activity-based probe (DCG-04, a biotinylated PLCP inhibitor) to facilitate the purification and detection of these proteins. Of the 14 detected PLCPs, eight were identical between the datasets. However, six PLCPs were mis-annotated in the Niben1.0.1 and curated datasets (Fig. 3c). Two mis-annotated PLCPs lacked catalytic residues; four PLCPs lacked a Cys residue that is crucial for PLCP stability; two PLCPs lacked large parts of the sequence, one PLCP (ALP) carried a C-terminal tandem fusion, and one lacked the signal peptide required for targeting to the endomembrane system. Importantly, none of these mis-annotated proteins could have been active proteases, confirming that their annotation is incorrect. By contrast, the NbDE annotation would predict functional, active proteases that could react with activity-based probes and hence explain the purification of these proteins in activity profiling experiments.

Two examples of improved subtilase annotations

To further illustrate gene annotations in the different datasets and show the relevance of adding missing genes from different genome assemblies with the NbE dataset, we analysed the gene-models of two subtilases which are missing in the Niben1.0.1 draft genome assembly. Both subtilases are encoded by a single-exon gene-models in the NbDE dataset (Fig. 4). The sequence of NbE05066806.1 is present on three non-overlapping contigs in the Niben1.0.1 dataset, one of which contains three single nucleotide polymorphisms (SNPs, Fig. 4a). Two of these SNPs are also present in the Niben0.4.4 dataset, but this annotation contains an additional 132 bp insertion which is annotated as coding. The coding sequence of this subtilase is incomplete in the Nbv0.5 dataset and we identified 13 sequences corresponding to partial or chimeric variants of this subtilase in the Nb5.1 primary + alternate predicted proteome.

Fig. 4
figure 4

Examples of subtilase mis-annotations in the different genome assemblies. a Gene-models corresponding to subtilase NbE05066806 and the corresponding annotations in the various datasets. This subtilase gene is fragmented in Niben1.0.1; truncated in Nbv0.5; and carries two SNPs and an extra sequence in Niben0.4.4. b Gene-models corresponding to subtilase NbE03059263 and the corresponding annotations in the various datasets. This subtilase has an inactivated homeolog (dark grey) that was not retained in the NbDE dataset as it encodes a protein with < 60% coverage because it contains premature stop codons. The truncated proteins caused mis-assembly in the Niben1.0.1 dataset, resulting in a hybrid sequence. Mis-annotated exon-intron boundaries also effected gene models in Niben1.0.1, Niben0.4.4 and Nbv5.1. Peptides matched to the different gene models are indicated below the gene models

A more complicated situation exists for subtilase NbE03059263, which we detected in the apoplast by proteomics. This subtilase has a close paralog which contains several inactivating mutations which was not retained in the NbDE annotation because it contains a premature stop codon and the coverage of the encoded protein is therefore below 60% (Fig. 4b). Consequently, only NbE03059263 is detected with 17 unique peptides. In Niben1.0.1, however, sequences of these two homologous subtilases are misassembled, resulting in hybrids. In addition, several exon-intron boundaries have been misannotated, so the predicted proteins lack several crucial sequences. Consequently, only 12 of the peptides match the annotated subtilases in the Niben1.0.1 dataset, of which seven are unique. In Niben0.4.4, one subtilase gene carries a 133 bp insertion that is annotated as an intron, but this insertion causes an amino acid substitution that prevents a match with one of the peptides. The paralog in Niben0.4.4 is mis-annotated as protein-encoding and consequently only 10 of the 16 peptides are unique. Finally, this subtilase is represented by four transcripts in the Nbv5.1 dataset, two of which encode a C-terminally truncated protein that lacks a match with three peptides, and none of the peptides are unique. Transcript sequences encoding the truncated and inactivated proteins are absent from this Nbv5.1 dataset. These two examples illustrate the different issues that occur with the gene annotations in the different datasets.

Inactivated subtilases are consistent with a contracting functional genome

To study inactivation of homeologs further, we investigated the subtilase-encoding gene family. Several subtilases are implicated in immunity, most notably the tomato P69 clade of subtilases [51]. Our NbDE dataset contains 64 complete subtilase genes, and one partial gene. Sixteen of the subtilase genes are likely duplicated, and an additional 18 are inactive subtilases with > 60% coverage to the NCBI Nicotiana RefSeq proteins. Three subtilases are missing in the NbD dataset, highlighting the utility of combining the NbD and NbE datasets to increase coverage of the proteome. Remarkably, no SBT3 clade family members were identified in N. benthamiana (Table 1, Additional file 1: Figures S2 and S3). Three N. benthamiana subtilases may possess phytaspase activity based on the presence of an aspartic acid residue at the pro-domain junction as well as a histidine residue in the S1 pocket which is thought to bind to P1 aspartic acid (Additional file 1: Figures S2 and S3, [38]).

Table 1 Subtilase gene number according to familya

In comparison, the Niben1.0.1 genome annotation predicts 103 different subtilase gene products. However, 38 of these genes encode inactive subtilases and 16 subtilase genes are absent from Niben1.0.1 (Additional file 2: Table S1 for a comprehensive comparison). Importantly, none of the remaining 49 subtilase-encoding genes in the Niben1.0.1 dataset are correctly annotated to encode a functional product. Furthermore, the predicted proteome from the Nbv5.1 primary + alternate transcriptome contains more than 400 subtilase gene products, which can be explained by chimeric sequences.

By searching the Niben1.0.1 and Niben0.4.4 genome assemblies using BLASTN, we identified 43 subtilase genes with internal stop-codons and/or frame-shift mutations and are therefore likely non-coding. Phylogenetic analysis shows that probable homeologs are often inactivated (Fig. 5). This pattern of inactivation in the subtilase gene family is consistent with a contracting functional genome upon polyploidization, where for many functional protein-encoding genes there is a corresponding homeolog that is inactivated (Fig. 5, Table 1, and Additional file 1: Figure S2). In conclusion, the new genome annotation represents a significant improvement over previous annotations and facilitates more accurate and meaningful phylogenetic analysis of gene families in N. benthamiana.

Fig. 5
figure 5

Birth and death of subtilase paralogs in N. benthamiana. The evolutionary history of the subtilase gene family was inferred by using the Maximum Likelihood method based on the Whelan and Goldman model. The bootstrap consensus tree inferred from 500 replicates is taken to represent the evolutionary history of the taxa analysed. Non-functional subtilases are indicated in grey. Subtilases identified in apoplastic fluid (AF) and/or total extract (TE) are indicated with yellow and green dots, respectively. Naming of subtilase clades is according to [51]. Additional file 1: Figure S2 includes the individual names

Hydrolases are enriched in the leaf secretome of N. benthamiana

Finally, we used the NbDE dataset to analyse the extracellular protein repertoire (secretome) of the N. benthamiana apoplast. The plant apoplast is the primary interface in plant-pathogen interactions [12, 31] and apoplastic proteins include many enzymes that may act in plant-pathogen interactions. To identify apoplastic proteins, we performed MS analysis of apoplastic fluids (AFs) and total extracts (TEs) isolated from the same leaves in four biological replicates. Annotation using the NbDE dataset show that these AF and TE proteomes are clearly distinct (Fig. 6a). We assigned 510 proteins to the apoplast because they were only detected in AF samples or highly enriched in the AF samples over TE samples (log2FC ≥ 1.5, p ≤ 0.01, BH-adjusted moderated t-test, Fig. 6b). Similarly, we assigned 1042 proteins as intracellular because they were only detected in the TE samples or enriched in TE samples over AF samples (log2FC ≤ − 1.5, p ≤ 0.01, Fig. 6b). The remaining 833 proteins were considered both apoplastic and intracellular. As expected, the apoplastic proteome is significantly enriched for proteins containing a SignalP-predicted signal peptide, while the intracellular proteins and proteins present both in the apoplast and intracellular are significantly enriched for proteins lacking a signal peptide (BH-adjusted hypergeometric test, p < 0.001).

Fig. 6
figure 6

Annotation of the N. benthamiana apoplastic proteome. a Correlation matrix heat map of the log2 transformed LFQ intensity of protein groups in the four biological replicates of apoplastic fluid (AF) and total extract (TE) samples. Biological replicates are clustered on similarity. b A volcano plot is shown plotting log2 fold difference (log2FC) of AF/TE over –log10 BH-adjusted moderated p-values. Proteins with log2FC ≥ 1.5 and p ≤ 0.01 were considered apoplastic, as well as those only found in AF. Conversely, proteins with a log2FC ≤ 1.5 and p ≤ 0.01 were considered intracellular, as well as those found only in TE. c Percentage of proteins in each fraction annotated with biological process-associated GO-SLIM terms. d Percentage of proteins in each fraction annotated with molecular function-associated GO-SLIM terms. c and d GO-SLIM annotations are shown when significantly enriched or depleted (BH-adjusted hypergeometric test, p < 0.05) in at least one of the fractions (AF, TE, or both). Each bubble indicates the percentage of genes containing that specific GO-SLIM annotation in that compartment. Colours indicate whether the GO-SLIM annotations are enriched or depleted in that fraction (p < 0.05, n.s., non-significant).

The apoplastic proteome is enriched for proteins acting in catabolic processes and carbohydrate and lipid metabolic processes (Fig. 6c), which is reflected in the enrichment of peptidases, glycosidases and other hydrolases (Fig. 6d, Additional file 3: Table S2 for a full list). Proteins considered predominantly intracellular are enriched for GO-SLIM terms associated with translation, photosynthesis and transport as biological processes (Fig. 6c), and a similar pattern is seen for GO-SLIM terms describing molecular functions (Fig. 6d). Proteins present both in TE and in AF are enriched for GO-SLIM terms associated with biosynthetic processes and homeostasis (Fig. 6c). These processes are usually performed by proteins acting at multiple subcellular localizations.

To specify which peptidases are enriched in the apoplast, we assigned the PFAM-annotations to MEROPS peptidase identifiers [40]. Three of the 15 different families of peptidases detected in the apoplast have significantly more members enriched in the AF as compared to TE: the subtilase family (S08; 13 members, p < 0.001), serine carboxypeptidase-like family (S10; 8 members, p < 0.01), and aspartic peptidase family (A01; 17 members, p < 0.001). By contrast, the proteasome is enriched in the intracellular fraction (T01; 26 members, p < 0.001) (BH-adjusted hypergeometric test, Additional file 4: Table S3 for a full list).


Homology-guided re-annotation of the Nicotiana benthamiana proteins resulted in an improved proteome for the alloploid model plant Nicotiana benthamiana. This approach identifies many genes missing from previous annotations, and improves the annotation of exon-intron boundaries and overcomes problems associated with mis-assignment of short reads from inactivated homeologs. By removing sequences with < 60% coverage to well-annotated Nicotiana proteins, we removed partial sequences caused by frameshift mutations and premature stop codons. Besides the core NbD dataset containing 53,411 coding sequences, we provide a supplemental NbE dataset with 21,391 coding sequences, including homeologs. Both the core NbD dataset and the combined NbDE datasets have longer protein sequences with increased BUSCO scores and higher coverage to the tomato proteome. These datasets also have improved frequency of PFAM annotations while maintaining a relatively low number of entries. Both datasets outperform the preceding datasets in the annotation of spectra during proteomics experiments. These datasets provide the research community with improved capacity to annotate spectra during proteomics experiments. These datasets will provide a valuable basis for further genome annotation. Further improvements of the genome annotation of N. benthamiana will involve long read genome sequencing and mapping of RNA sequencing datasets using a single plant lineage. This may remove remaining misannotated intron-exon boundaries and more accurately describe the expression of splice variants and predict the accumulation of truncated proteins. However, until such full genome annotation is available, the NbDE proteomics dataset presented with this work will be valuable for the research community.


Sequence retrieval

Table 2 summarizes the genomes and genome annotations used in this study.

Table 2 Genomes and gene annotations useda


In order to extract gene-models from the published N. benthamiana draft genomes we combined NCBI Nicotiana RefSeq protein sequences in one dataset, removed all partial proteins and those containing undetermined residues, added 110 genes which we had previously been manually curated and 565 full-length N. benthamiana proteins from GenBank leading to a dataset containing 226,543 protein sequences. We used CD-HIT (v4.6.8) [16] to cluster these sequences at a 95% identity threshold and reduce the redundancy in our dataset while removing partials (Nicotiana_db95; 85,453 sequences). This dataset was used to annotate the gene-models in the different N. benthamiana genome builds using Scipio version 1.4.1 [23] which was run to allow for duplicated genes. After running Scipio we used Augustus (v3.3) [49] to extract complete and partial gene models. Transdecoder (v5.5.0) [19] was used to retrieve the single-best ORF containing homology to the Nicotiana_db95 dataset as determined by BLASTP for gene models containing internal stop codons. The predicted proteins were aligned to the NCBI Nicotiana RefSeq protein sequences using BLASTP (NCBI-BLAST v2.8.1+;), and only those with a coverage ≥60% in the Niben1.0.1 genome assembly, or ≥ 90% in the Niben0.4.4, Nbv0.3, and Nbv0.5 genome assemblies were maintained. A custom R script was used to convert extracted Niben1.0.1 gene models in a full genome annotation, which were manually inspected resulting in NbD. CD-HIT-2D was used at 100% identity to identify gene models missing in NbD, but present in the extracted gene models from the other genome assemblies (the NbE set). The addition of these sequences to NbD resulted in the NbDE annotation. This dataset was annotated using SignalP (v4.1) [35], PFAM (v32) [13] and annotated the with GO terms and UniProt identifiers using Sma3s v2 [4].

We ran BUSCO (v3.0.2; dependencies: NCBI-BLAST v2.8.1+; HMMER v3.1; Augustus v3.3) [47, 57] on the different N. benthamiana predicted proteomes using the plants set (Embryophyta_odb9), to validate their completeness. Additionally, we used BLASTP to compare these proteins with the NCBI Solanum lycopersicum RefSeq protein sequences.

Sample preparation for proteomics and definition of biological replicates

Four-week old N. benthamiana plants were used. The AF was extracted by vacuum infiltrating N. benthamiana leaves with ice-cold MilliQ. Leaves were dried to remove excess liquid, and apoplastic fluid was extracted by centrifugation of the leaves in a 20 ml syringe barrel (without needle or plunger) in a 50 ml falcon tube at 2000×g, 4 °C for 25 min. Samples were snap-frozen in liquid nitrogen and stored at − 80 °C prior to use. TE was collected by removing the central vein and snap-freezing the leaves in liquid nitrogen followed by grinding in a pestle and mortar and addition of three volumes of phosphate-buffered saline (PBS) (w/v). One biological replicate was defined as a sample, AF or TE, consisting of one leaf from three independent plants (3 leaves total). Four independent biological replicates were taken for AF and TE.

Protein digestion and sample clean-up

AF and TE sample corresponding to 15 μg of protein was taken for each sample (based on Bradford assay). Dithiothreitol (DTT) was added to a concentration of 40 mM, and the volume adjusted to 250 μl with MS-grade water (Sigma). Proteins were precipitated by the addition of 4 volumes of ice-cold acetone, followed by a 1 h incubation at − 20 °C and subsequent centrifugation at 18,000×g, 4 °C for 20 min. The pellet was dried at room temperature (RT) for 5 min and resuspended in 25 μL 8 M urea, followed by a second chloroform/methanol precipitation. The pellet was dried at RT for 5 min and resuspended in 25 μL 8 M urea. Protein reduction and alkylation was achieved by sequential incubation with DTT (final 5 mM, 30 min, RT) and iodoacetamide (IAM; final 20 mM, 30 min, RT, dark). Non-reacted IAM was quenched by raising the DTT concentration to 25 mM. Protein digestion was started by addition of 1000 ng LysC (Wako Chemicals GmbH) and incubation for 3 h at 37 °C while gently shaking (800 rpm). The samples were then diluted with ammoniumbicarbonate (final concentration 80 mM) to a final urea concentration of 1 M. 1000 ng Sequencing grade Trypsin (Promega) was added and the samples were incubated overnight at 37 °C while gently shaking (800 rpm). Protein digestion was stopped by addition of formic acid (FA, final 5% v/v). Tryptic digests were desalted on home-made C18 StageTips [39] by passing the solution over 2 disc StageTips in 150 μL aliquots by centrifugation (600–1200×g). Bound peptides were washed with 0.1% FA and subsequently eluted with 80% Acetonitrile (ACN). Using a vacuum concentrator (Eppendorf) samples were dried, and the peptides were resuspended in 20 μL 0.1% FA solution.


The samples were analysed as in [18]. Briefly, samples were run on an Orbitrap Elite instrument (Thermo) [30] coupled to an EASY-nLC 1000 liquid chromatography (LC) system (Thermo) operated in the one-column mode. Peptides were directly loaded on a fused silica capillary (75 μm × 30 cm) with an integrated PicoFrit emitter (New Objective) analytical column packed in-house with Reprosil-Pur 120 C18-AQ 1.9 μm resin (Dr. Maisch), taking care to not exceed the set pressure limit of 980 bar (usually around 0.5–0.8 μl/min). The analytical column was encased by a column oven (Sonation; 45 °C during data acquisition) and attached to a nanospray flex ion source (Thermo). Peptides were separated on the analytical column by running a 140 min gradient of solvent A (0.1% FA in water;; Ultra-Performance Liquid Chromatography (UPLC) grade) and solvent B (0.1% FA in ACN; UPLC grade) at a flow rate of 300 nl/min (gradient: start with 7% B; gradient 7 to 35% B for 120 min; gradient 35 to 100% B for 10 min and 100% B for 10 min) at a flow rate of 300 nl/min.). The mass spectrometer was operated using Xcalibur software (version 2.2 SP1.48) in positive ion mode. Precursor ion scanning was performed in the Orbitrap analyzer (FTMS; Fourier Transform Mass Spectrometry) in the scan range of m/z 300–1800 and at a resolution of 60,000 with the internal lock mass option turned on (lock mass was 445.120025 m/z, polysiloxane) [34]. Product ion spectra were recorded in a data-dependent manner in the ion trap (ITMS) in a variable scan range and at a rapid scan rate. The ionization potential was set to 1.8 kV. Peptides were analysed by a repeating cycle of a full precursor ion scan (1.0 × 106 ions or 50 ms) followed by 15 product ion scans (1.0 × 104 ions or 50 ms). Peptides exceeding a threshold of 500 counts were selected for tandem mass (MS2) spectrum generation. Collision induced dissociation (CID) energy was set to 35% for the generation of MS2 spectra. Dynamic ion exclusion was set to 60 s with a maximum list of excluded ions consisting of 500 members and a repeat count of one. Ion injection time prediction, preview mode for the Fourier transform mass spectrometer (FTMS, the orbitrap), monoisotopic precursor selection and charge state screening were enabled. Only charge states higher than 1 were considered for fragmentation.

Peptide and protein identification

Peptide spectra were searched in MaxQuant (version using the Andromeda search engine [11] with default settings and label-free quantification and match-between-runs activated [9, 10] against the datasets specified in the text including a known contaminants dataset. Included modifications were carbamidomethylation (static) and oxidation and N-terminal acetylation (dynamic). Precursor mass tolerance was set to ±20 ppm (first search) and ± 4.5 ppm (main search), while the MS/MS match tolerance was set to ±0.5 Da. The peptide spectrum match FDR and the protein FDR were set to 0.01 (based on a target-decoy approach) and the minimum peptide length was set to 7 amino acids. Protein quantification was performed in MaxQuant [52], based on unique and razor peptides including all modifications.

Proteomics processing in R

Identified protein groups were filtered for reverse and contaminants proteins and those only identified by matching, and only those protein groups identified in 3 out of 4 biological replicates of either AF or TE were selected. The LFQ values were log2 transformed, and missing values were imputed using a minimal distribution as implemented in imputeLCMD (v2.0) [26]. A moderated t-test was used as implemented in Limma (v3.34.3) [36, 41] and adjusted using Benjamini–Hochberg (BH) adjustment to identify protein groups significantly differing between AF and TE. Bonafide apoplastic protein groups were those only detected in AF and those significantly (p ≤ 0.01) log2 fold change ≥1.5 in AF samples. Protein groups only detected in TE and those significantly (p ≤ 0.01) log2 fold change ≤ − 1.5 depleted in AF samples were considered intracellular. The remainder was considered both apoplastic and intra-cellular. Majority proteins were annotated with SignalP, PFAM [16], MEROPS (v12) [40], GO, and UniProt keywords identifiers. A BH-adjusted Hypergeometric test was used to identify those terms that were either depleted or enriched (p ≤ 0.05) in the bonafide AF protein groups as compared to bonafide AF depleted proteins or protein groups present both in the AF and TE.

Phylogenetic analysis

Predicted proteomes were annotated with PFAM identifiers, and all sequences containing a Peptidase S8 (PF00082) domain were extracted from the different datasets. Additionally, we manually curated the subtilase gene-family in the Niben1.0.1 draft genome, identifying truncated proteins that were annotated as protein-encoding genes, as well as missing genes and incorrect gene models or genes in which the reference sequence was absent in Niben1.0.1. Tomato subtilases were retrieved from Solgenomics, and other previously characterized subtilases [51] were retrieved from NCBI. Clustal Omega [28, 46] was used to align these sequences. The truncated-gene products sequences were substituted with the best blast hit in NCBI to visualize subtilase inactivation in the alignment and phylogenetic tree. Determining the best model for maximum likelihood phylogenetic analysis and the phylogenetic analysis was performed in MEGA X [25]. The evolutionary history was inferred by using the Maximum Likelihood method based on the Whelan and Goldman model. A discrete Gamma distribution was used to model evolutionary rate differences among sites, and the rate variation model allowed for some sites to be evolutionarily invariable. All positions with less than 80% site coverage were eliminated. NbD005386.1 was used to root the phylogenetic trees.

Availability of data and materials

The NbD and NbE datasets are available as supplemental information and can also be downloaded from Oxford Research Archive: This new annotation presented in this manuscript is an improved version when compared to the NbC dataset that was previously posted on bioRxiv in 2018 ( The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [55] partner repository ( with the data set identifier PXD010435. Samples FGH01–04 represent AF and FGH05–08 represent TE.





Apoplastic fluid


Aleurain-like protease


Basic local alignment search tool


Benchmarking universal single copy orthologs


Cluster database at high identity with tolerance


Collision induced dissociation




Formic acid


False discovery rate


Fourier transform mass spectrometry




Liquid chromatography


Liquid Chromography - Tandem Mass Spectrometry


Mass spectrometry


National center for biotechnology information


Database of protein families


Papain-like cysteine protease


Room temperature




Total extract


Ultra-performance liquid chromatography


Virus-induced gene silencing


  1. Bally J, Nakasugi K, Jia F, Jung H, Ho SYW, Wong M, Paul CM, Naim F, Wood CC, Crowhurst RN, Hellens RP, Dale JL, Waterhouse PM. The extremophile Nicotiana benthamiana has traded viral defence for early vigour. Nat Plants. 2015;1:15165.

    Article  CAS  Google Scholar 

  2. Bombarely A, Moser M, Amrad A, Bao M, Bapaume L, Barry CS, Bliek M, Boersma MR, Borghi L, Bruggmann R, Bucher M, D’Agostino N, Davies K, Druege U, Dudareva N, Egea-Cortines M, Delledonne M, Fernandez-Pozo N, Franken P, Grandont L, Heslop-Harrison JS, Hintzsche J, Johns M, Koes R, Lv X, Lyons E, Malla D, Martinoia E, Mattson NS, Morel P, Mueller LA, Muhlemann J, Nouri E, Passeri V, Pezzotti M, Qi Q, Reinhardt D, Rich M, Richert-Pöggeler KR, Robbins TP, Schatz MC, Schranz ME, Schuurink RC, Schwarzacher T, Spelt K, Tang H, Urbanus SL, Vandenbussche M, Vijverberg K, Villarino GH, Warner RM, Weiss J, Yue Z, Zethof J, Quattrocchio F, Sims TL, Kuhlemeier C. Insight into the evolution of the Solanaceae from the parental genomes of Petunia hybrida. Nat Plants. 2016;2:16074.

    Article  CAS  Google Scholar 

  3. Bombarely A, Rosli HG, Vrebalov J, Moffett P, Mueller LA, Martin GB. A draft genome sequence of Nicotiana benthamiana to enhance molecular plant-microbe biology research. Mol Plant-Microbe Interact. 2012;25:1523–30.

    Article  CAS  Google Scholar 

  4. Casimiro-Soriguer CS, Muñoz-Mérida A, Pérez-Pulido AJ. Sma3s: a universal tool for easy functional annotation of proteomes and transcriptomes. Proteomics. 2017;17.

  5. Clarkson JJ, Dodsworth S, Chase MW. Time-calibrated phylogenetic trees establish a lag between polyploidisation and diversification in Nicotiana (Solanaceae). Plant Syst Evol. 2017.

  6. Clemente T. Nicotiana (Nicotiana tobaccum, Nicotiana benthamiana). In: Wang K, editor. Agrobacterium protocols. Methods in molecular biology. Clifton: Humana Press; 2006. p. 143–54. Available at: Accessed 4 Jan 2015.

    Chapter  Google Scholar 

  7. Consortium, T.P.G.S. Genome sequence and analysis of the tuber crop potato. Nature. 2011;475:189–95.

    Article  Google Scholar 

  8. Consortium, T.T.G. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012;485:635–41.

    Article  Google Scholar 

  9. Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, Mann M. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell Proteomics. 2014;13:2513–26.

    Article  CAS  Google Scholar 

  10. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotech. 2008;26:1367–72.

    Article  CAS  Google Scholar 

  11. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res. 2011;10:1794–805.

    Article  CAS  Google Scholar 

  12. Doehlemann G, Hemetsberger C. Apoplastic immunity and its suppression by filamentous plant pathogens. New Phytol. 2013;198:1001–16.

    Article  CAS  Google Scholar 

  13. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The Pfam protein families database in 2019. Nucl Acids Res. 2019;47:D427–32.

    Article  CAS  Google Scholar 

  14. Fernandez-Pozo N, Rosli HG, Martin GB, Mueller LA. The SGN VIGS tool: user-friendly software to design virus-induced gene silencing (VIGS) constructs for functional genomics. Mol Plant. 2015;8:486–8.

    Article  CAS  Google Scholar 

  15. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. The Pfam protein families database: towards a more sustainable future. Nucl Acids Res. 2016;44:D279–85.

    Article  CAS  Google Scholar 

  16. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2.

    Article  CAS  Google Scholar 

  17. Goodin MM, Zaitlin D, Naidu RA, Lommel SA. Nicotiana benthamiana: its history and future as a model for plant–pathogen interactions. Mol Plant-Microbe Interact. 2008;21:1015–26.

    Article  CAS  Google Scholar 

  18. Grosse-Holz FM, Kelly S, Blaskowski S, Kaschani F, Kaiser M, van der Hoorn RAL. The transcriptome, extracellular proteome and active secretome of agroinfiltrated Nicotiana benthamiana uncover a large, diverse protease repertoire. Plant Biotechnol J. 2018;16:1068–84.

    Article  CAS  Google Scholar 

  19. Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, LeDuc RD, Friedman N, Regev A. De novo transcript sequence reconstruction from RNA-seq using the trinity platform for reference generation and analysis. Nat Protocols. 2013;8:1494–512.

    Article  CAS  Google Scholar 

  20. Hirakawa H, Shirasawa K, Miyatake K, Nunome T, Negoro S, Ohyama A, Yamaguchi H, Sato S, Isobe S, Tabata S, Fukuoka H. Draft genome sequence of eggplant (Solanum melongena L.): the representative Solanum species indigenous to the old world. DNA Res. 2014;21:649–60.

    Article  CAS  Google Scholar 

  21. Iorizzo M, Ellison S, Senalik D, Zeng P, Satapoomin P, Huang J, Bowman M, Iovene M, Sanseverino W, Cavagnaro P, Yildiz M, Macko-Podgórni A, Moranska E, Grzebelus E, Grzebelus D, Ashrafi H, Zheng Z, Cheng S, Spooner D, Van Deynze A, Simon P. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution. Nat Genet. 2016;48:657–66.

    Article  CAS  Google Scholar 

  22. Keller O, Kollmar M, Stanke M, Waack S. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics. 2011;27:757–63.

    Article  CAS  Google Scholar 

  23. Keller O, Odronitz F, Stanke M, Kollmar M, Waack S. Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinformatics. 2008;9:278.

    Article  Google Scholar 

  24. Kim S, Park M, Yeom S-I, Kim Y-M, Lee JM, Lee H-A, Seo E, Choi J, Cheong K, Kim K-T, Jung K, Lee G-W, Oh S-K, Bae C, Kim S-B, Lee H-Y, Kim S-Y, Kim M-S, Kang B-C, Jo YD, Yang H-B, Jeong H-J, Kang W-H, Kwon J-K, Shin C, Lim JY, Park JH, Huh JH, Kim J-S, Kim B-D, Cohen O, Paran I, Suh MC, Lee SB, Kim Y-K, Shin Y, Noh S-J, Park J, Seo YS, Kwon S-Y, Kim HA, Park JM, Kim H-J, Choi S-B, Bosland PW, Reeves G, Jo S-H, Lee B-W, Cho H-T, Choi H-S, Lee M-S, Yu Y, Do Choi Y, Park B-S, van Deynze A, Ashrafi H, Hill T, Kim WT, Pai H-S, Ahn HK, Yeam I, Giovannoni JJ, Rose JKC, Sørensen I, Lee S-J, Kim RW, Choi I-Y, Choi B-S, Lim J-S, Lee Y-H, Choi D. Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species. Nat Genet. 2014;46:270–8.

    Article  CAS  Google Scholar 

  25. Kumar S, Stecher G, Li M, Knyaz C, Tamura K, Battistuzzi FU. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35:1547–9.

    Article  CAS  Google Scholar 

  26. Lazar, C. (2015) imputeLCMD: a collection of methods for left-censored missing data imputation, Available at: Accessed 25 Feb 2018.

    Google Scholar 

  27. Leitch IJ, Hanson L, Lim KY, Kovarik A, Chase MW, Clarkson JJ, Leitch AR. The ups and downs of genome size evolution in polyploid species of Nicotiana (Solanaceae). Ann Bot. 2008;101:805–14.

    Article  CAS  Google Scholar 

  28. Li W, Cowley A, Uludag M, Gur T, McWilliam H, Squizzato S, Park YM, Buso N, Lopez R. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 2015;43:W580–4.

    Article  CAS  Google Scholar 

  29. Liu H, Ding Y, Zhou Y, Jin W, Xie K, Chen L-L. CRISPR-P 2.0: an improved CRISPR-Cas9 tool for genome editing in plants. Mol Plant. 2017;10:530–2.

    Article  CAS  Google Scholar 

  30. Michalski A, Damoc E, Hauschild J-P, Lange O, Wieghaus A, Makarov A, Nagaraj N, Cox J, Mann M, Horning S. Mass spectrometry-based proteomics using Q Exactive, a high-performance benchtop quadrupole Orbitrap mass spectrometer. Mol Cell Proteomics. 2011;10:M111.011015.

    Article  Google Scholar 

  31. Misas-Villamil JC, van der Hoorn RAL. Enzyme–inhibitor interactions at the plant–pathogen interface. Curr Opin Plant Biol. 2008;11:380–8.

    Article  CAS  Google Scholar 

  32. Naim F, Nakasugi K, Crowhurst RN, Hilario E, Zwart AB, Hellens RP, Taylor JM, Waterhouse PM, Wood CC. Advanced engineering of lipid metabolism in Nicotiana benthamiana using a draft genome and the V2 viral silencing-suppressor protein. PLoS One. 2012;7:e52717.

    Article  CAS  Google Scholar 

  33. Nakasugi K, Crowhurst R, Bally J, Waterhouse P. Combining transcriptome assemblies from multiple de novo assemblers in the Allo-tetraploid plant Nicotiana benthamiana. PLoS One. 2014;9:e91776.

    Article  Google Scholar 

  34. Olsen JV, de Godoy LMF, Li G, Macek B, Mortensen P, Pesch R, Makarov A, Lange O, Horning S, Mann M. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol Cell Proteomics. 2005;4:2010–21.

    Article  CAS  Google Scholar 

  35. Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011;8:785–6.

    Article  CAS  Google Scholar 

  36. Phipson B, Lee S, Majewski IJ, Alexander WS, Smyth GK. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann Appl Stat. 2016;10:946–63.

    Article  Google Scholar 

  37. Qin C, Yu C, Shen Y, Fang X, Chen L, Min J, Cheng J, Zhao S, Xu M, Luo Y, Yang Y, Wu Z, Mao L, Wu H, Ling-Hu C, Zhou H, Lin H, González-Morales S, Trejo-Saavedra DL, Tian H, Tang X, Zhao M, Huang Z, Zhou A, Yao X, Cui J, Li W, Chen Z, Feng Y, Niu Y, Bi S, Yang X, Li W, Cai H, Luo X, Montes-Hernández S, Leyva-González MA, Xiong Z, He X, Bai L, Tan S, Tang X, Liu D, Liu J, Zhang S, Chen M, Zhang L, Zhang L, Zhang Y, Liao W, Zhang Y, Wang M, Lv X, Wen B, Liu H, Luan H, Zhang Y, Yang S, Wang X, Xu J, Li X, Li S, Wang J, Palloix A, Bosland PW, Li Y, Krogh A, Rivera-Bustamante RF, Herrera-Estrella L, Yin Y, Yu J, Hu K, Zhang Z. Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specialization. Proc Natl Acad Sci U S A. 2014;111:5135–40.

    Article  CAS  Google Scholar 

  38. Reichardt S, Repper D, Tuzhikov AI, Galiullina RA, Planas-Marquès M, Chichkova NV, Vartapetian AB, Stintzi A, Schaller A. The tomato subtilase family includes several cell death-related proteinases with caspase specificity. Sci Rep. 2018;8:10531.

    Article  Google Scholar 

  39. Rappsilber J, Mann M, Ishihama Y. Protocol for micro-purification, enrichment, pre-fractionation and storage of peptides for proteomics using StageTips. Nat Protocols. 2007;2:1896–906.

    Article  CAS  Google Scholar 

  40. Rawlings ND, Barrett AJ, Thomas PD, Huang X, Bateman A, Finn RD. The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 2018;46:D624–32.

    Article  CAS  Google Scholar 

  41. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucl Acids Res. 2015;43:e47.

    Article  Google Scholar 

  42. Senthil-Kumar M, Mysore KS. Tobacco rattle virus–based virus-induced gene silencing in Nicotiana benthamiana. Nat Protocols. 2014;9:1549–62.

    Article  CAS  Google Scholar 

  43. Senthil-Kumar M, Wang M, Chang J, Ramegowda V, del Pozo O, Liu Y, Doraiswamy V, Lee H-K, Ryu C-M, Wang K, Xu P, Eck JV, Chakravarthy S, Dinesh-Kumar SP, Martin GB, Mysore KS. Virus-induced gene silencing database for phenomics and functional genomics in Nicotiana benthamiana. Plant Direct. 2018;2:e00055.

    Article  Google Scholar 

  44. Sierro N, Battey JN, Ouadi S, Bovet L, Goepfert S, Bakaher N, Peitsch MC, Ivanov NV. Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis. Genome Biol. 2013;14:R60.

    Article  Google Scholar 

  45. Sierro N, Battey JND, Ouadi S, Bakaher N, Bovet L, Willig A, Goepfert S, Peitsch MC, Ivanov NV. The tobacco genome sequence and its comparison with those of tomato and potato. Nat Commun. 2014;5:3833.

    Article  CAS  Google Scholar 

  46. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Systems Biol. 2011;7:539.

    Article  Google Scholar 

  47. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.

    Article  Google Scholar 

  48. Sparkes IA, Runions J, Kearns A, Hawes C. Rapid, transient expression of fluorescent fusion proteins in tobacco plants and generation of stably transformed plants. Nat Protocols. 2006;1:2019–25.

    Article  CAS  Google Scholar 

  49. Stanke M, Schöffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62.

    Article  Google Scholar 

  50. Stoger E, Fischer R, Moloney M, Ma JK-C. Plant molecular pharming for the treatment of chronic and infectious diseases. Ann Rev Plant Biol. 2014;65:743–68.

    Article  CAS  Google Scholar 

  51. Taylor A, Qiu Y-L. Evolutionary history of subtilases in land plants and their involvement in symbiotic interactions. Mol Plant Microbe Interact. 2017;30:489–501.

    Article  CAS  Google Scholar 

  52. Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protocols. 2016;11:2301–19.

    Article  CAS  Google Scholar 

  53. Vaattovaara A, Leppala J, Salojarvi J, Wrzaczek M. High-throughput sequencing data and the impact of plant gene annotation quality. J Exp Bot. 2019;70:1069–76.

    Article  Google Scholar 

  54. Jutras PV, Grosse-Holz F, Kaschani F, Kaiser M, Michaud D, Van der Hoorn RAL. Activity-based proteomics reveals nine target proteases for the recombinant protein-stabilising inhibuitor SlCYS8 in Nicotiana benthamiana. Plant Biotechn J. 2019; In press.

  55. Vizcaíno JA, Csordas A, del-Toro N, Dianes JA, Griss J, Lavidas I, Mayer G, Perez-Riverol Y, Reisinger F, Ternent T, Xu Q-W, Wang R, Hermjakob H. 2016 update of the PRIDE database and its related tools. Nucl Acids Res. 2016;44:D447–56.

    Article  Google Scholar 

  56. Wang X, Bennetzen JL. Current status and prospects for the study of Nicotiana genomics, genetics, and nicotine biosynthesis genes. Mol Gen Genomics. 2015;290:11–21.

    Article  CAS  Google Scholar 

  57. Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol. 2018;35:543–8.

    Article  CAS  Google Scholar 

  58. Wu C-H, Abd-El-Haliem A, Bozkurt TO, Belhaj K, Terauchi R, Vossen JH, Kamoun S. NLR network mediates immunity to diverse plant pathogens. Proc Natl Acad Sci U S A. 2017;114:8113–8.

    Article  CAS  Google Scholar 

  59. Xu S, Brockmöller T, Navarro-Quezada A, Kuhl H, Gase K, Ling Z, Zhou W, Kreitzer C, Stanke M, Tang H, Lyons E, Pandey P, Pandey SP, Timmermann B, Gaquerel E, Baldwin IT. Wild tobacco genomes reveal the evolution of nicotine biosynthesis. Proc Natl Acad Sci U S A. 2017;114:6133–8.

    Article  CAS  Google Scholar 

  60. Bolger A, Scossa F, Bolger ME, Lanz C, Maumus F, Tohge T, Quesneville H, Alseekh, Sørensen I, Lichtenstein G, Fich EA, Conte M, Keller H, Schneeberger K, Schwacke R, Ofner I, Vrebalov Julia, Xu Y, Osorio Sonia, Aflitos SA, Schijlen E, Jiménez-Goméz JM, Ryngajllo M, Kimura S, Kumar R, Koenig D, Headland LR, Maloof JN, Sinha N, van Ham RCHJ, Lankhorst RK, Mao L, Vogel A, Arsova B, Panstruga R, Fei Z, Rose JKC, Zamir D, Carrari F, Giovannoni JJ, Weigel D, Usadel B, Fernie AR. The genome of the stress-tolerant wild tomato species Solanum pennellii. Nature Genetics. 2014;46(9):1034–8.

Download references


We would like to thank Philippe Varennes-Jutras and Daniela Sueldo for critically reading the manuscript and providing important suggestions for improving the manuscript.


This work has been supported by ‘The Clarendon Fund’ (JK, FH), ERC Consolidator grant 616449 ‘GreenProteases’ (RvdH, FGH) and BBSRC grants BB/R017913/1 and BB/S003193/1 (RvdH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



Conceptualization: JK, RALvdH; formal analysis: JK; funding acquisition: RALvdH; wetlab experiments: FMG-H; proteomics: FK, MK; programming: JK, FH; writing: JK, RALvdH. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Renier A. L. van der Hoorn.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Figure S1. Comparison of Solanaceae proteomes. Figure S2. Phylogenetic analysis of the subtilase gene-family with names. Figure S3. Phylogenetic analysis of the subtilase gene-family of tomato and Arabidopsis and including other previously characterized subtilases. (PDF 518 kb)

Additional file 2:

Table S1. Gene-model comparison. (XLSX 18 kb)

Additional file 3:

Table S2. GO-SLIM term enrichment complete at p ≤ 0.05. (XLSX 23 kb)

Additional file 4:

Table S3. MEROPS family term enrichment complete. (XLSX 14 kb)

Additional file 5:

Dataset S1. New Niben1.0.1 gff3 annotation. (GFF 58963 kb)

Additional file 6:

Dataset S2. FASTA file of NbE genomic sequence ±1 kb. (FASTA 222960 kb)

Additional file 7:

Dataset S3. gff3 annotation of NbE gene-models. (GFF 33474 kb)

Additional file 8:

Dataset S4. NbD proteome. (FASTA 26317 kb)

Additional file 9:

Dataset S5. NbD transcriptome. (FASTA 71108 kb)

Additional file 10:

Dataset S6. NbDE proteome. (FASTA 36990 kb)

Additional file 11:

Dataset S7. NbDE transcriptome. (FASTA 99903 kb)

Additional file 12:

Dataset S8. Sma3s v2 annotation of NbDE. (TSV 158675 kb)

Additional file 13:

Dataset S9. PFAM32 annotation of NbDE. (TSV 18754 kb)

Additional file 14:

Dataset S10. SignalP4.1 annotation of NbDE. (TSV 5130 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kourelis, J., Kaschani, F., Grosse-Holz, F.M. et al. A homology-guided, genome-based proteome for improved proteomics in the alloploid Nicotiana benthamiana. BMC Genomics 20, 722 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Solanaceae
  • Genome annotation
  • Nicotiana benthamiana
  • Proteomics
  • Subtilases