- Research article
- Open Access
PNAC: a protein nucleolar association classifier
© Scott et al; licensee BioMed Central Ltd. 2011
- Received: 30 September 2010
- Accepted: 27 January 2011
- Published: 27 January 2011
Although primarily known as the site of ribosome subunit production, the nucleolus is involved in numerous and diverse cellular processes. Recent large-scale proteomics projects have identified thousands of human proteins that associate with the nucleolus. However, in most cases, we know neither the fraction of each protein pool that is nucleolus-associated nor whether their association is permanent or conditional.
To describe the dynamic localisation of proteins in the nucleolus, we investigated the extent of nucleolar association of proteins by first collating an extensively curated literature-derived dataset. This dataset then served to train a probabilistic predictor which integrates gene and protein characteristics. Unlike most previous experimental and computational studies of the nucleolar proteome that produce large static lists of nucleolar proteins regardless of their extent of nucleolar association, our predictor models the fluidity of the nucleolus by considering different classes of nucleolar-associated proteins. The new method predicts all human proteins as either nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic or non-nucleolar. Leave-one-out cross validation tests reveal sensitivity values for these four classes ranging from 0.72 to 0.90 and positive predictive values ranging from 0.63 to 0.94. The overall accuracy of the classifier was measured to be 0.85 on an independent literature-based test set and 0.74 using a large independent quantitative proteomics dataset. While the three nucleolar-association groups display vastly different Gene Ontology biological process signatures and evolutionary characteristics, they collectively represent the most well characterised nucleolar functions.
Our proteome-wide classification of nucleolar association provides a novel representation of the dynamic content of the nucleolus. This model of nucleolar localisation thus increases the coverage while providing accurate and specific annotations of the nucleolar proteome. It will be instrumental in better understanding the central role of the nucleolus in the cell and its interaction with other subcellular compartments.
- Gene Ontology
- Positive Predictive Value
- Reliability Index
- Nucleolar Protein
- Biological Process Term
The nucleolus was initially characterised over four decades ago and shown to be the site of ribosome subunit production . It is now known to play a role in other cellular activities, including assembly of diverse ribonucleoprotein particles (RNPs), cell cycle progression and proliferation regulation, as well as the response to numerous forms of cellular stress [2–6]. All of the proteins that are strongly enriched in the nucleolus, including marker proteins such as fibrillarin, can nonetheless cycle continually in and out of the nucleolus, as discovered by photobleaching experiments . In addition, many of the processes that occur, at least in part, in the nucleolus require the re-location of proteins to this nuclear sub-compartment. Many proteins are able to conditionally relocate between either the nucleoplasm, or other nuclear sub-compartments and the nucleolus [3, 4]. In addition to the 'part-time' nucleolar proteins which remain in the nucleus, many proteins are known to travel between the cytoplasm (including cytoplasmic organelles) and the nucleolus. These include ribosomal and non-ribosomal proteins that travel to the nucleolus for assembly into ribosome subunits and other RNPs respectively, as well as many growth factors and cell cycle regulators [2, 4, 8]. The nucleolus thus accommodates a large amount of traffic and its composition is very dynamic, which may be facilitated by its lack of a surrounding membrane .
Recent large-scale proteomics experiments have detected thousands of distinct proteins that stably co-purify with nucleoli isolated from human cells [9–11]. Although the first datasets defining the nucleolar proteome did not offer information regarding the proportion of each of these proteins in the nucleolus relative to other cellular compartments, this information has now been obtained in a high throughput manner using a combination of cellular fractionation and SILAC protocols . These data indicate that although thousands of distinct proteins are detected in the nucleolus, their degree of association with the nucleolus is variable. Some proteins are predominantly nucleolar while others, although detected in small numbers in the nucleolus and annotated as such in large databases, are present in much larger numbers in other cellular compartments. These proteomics data give a snapshot of the content of the nucleoli of a population of one cell type under specific conditions. In comparison to the first nucleolar proteome datasets [9–11], they provide a much clearer picture of the dynamic protein content of the nucleolus and its relationship with other cellular compartments. This methodology also offers the possibility of distinguishing the nucleolar-enriched proteins from the proteins which cycle between the nucleolus and other cellular locations or conditionally localise to the nucleolus. However, because only one cell type and a small number of conditions have been examined so far and because of the current limitation of the methodology, which does not yet offer full proteome coverage, the dynamic nucleolar proteome still has not been fully defined. Here, we investigate how a computational method can help fill this gap.
The prediction of eukaryotic protein subcellular localisation has been extensively investigated over the past decade using various machine learning methods and based on many diverse protein characteristics (reviewed in ). However, while many such predictors exist, most do not consider the nucleolus as a separate localisation: very few whole-cell predictors include the nucleolus in the list of cellular compartments to which they predict localisation [14–17]. Several nuclear-centric mammalian protein localisation predictors have been created to predict membership to one of at least four nuclear sub-compartments including the nucleolus [18–22]. However, proteins annotated as being in more than one subnuclear compartment are often not considered, thus substantially decreasing their actual coverage of the nuclear proteome. Because the individual nuclear subcompartments are not membrane-enclosed, it is expected that a significant proportion of nuclear proteins diffuse between these subcompartments and will be detected and annotated as present in several of these compartments. Thus these nuclear-centric predictors likely do not realistically model localisation patterns of nuclear proteins.
The prediction of nucleolar protein localisation has been investigated mainly in the context of a binary classification problem where proteins are predicted to be either associated with the nucleolus, or not. Such studies include a predicted nucleolar complex dataset, which is based on the clustering of protein-protein interactions, involving human proteins either detected experimentally in the nucleolus, or predicted to be nucleolar using a neural network . More recent studies include a naïve Bayesian classifier trained to predict yeast nucleolar proteins and ribosomal components , a sequence-based support-vector machine predictor that differentiates between nucleolar associated and non-nucleolar associated nuclear mammalian proteins  as well as a kernel canonical correlation analysis predictor based on genomic sequence and protein-protein interaction data that also differentiates between nucleolar associated and non-nucleolar associated nuclear mammalian proteins .
Architecture of the protein nucleolar association classifier
nucleolar-enriched proteins which correspond to proteins that are accumulated predominantly in the nucleolus and likely include the core nucleolar proteins.
nucleolar-nucleoplasmic proteins which can localise both to the nucleolus and to other nuclear regions. They either cycle between the nucleolus and other nuclear regions or are mainly nucleoplasmic but relocate to the nucleolus under specific conditions.
nucleolar-cytoplasmic proteins which can localise to the nucleolus and the cytoplasm (or even extracellularly). They either localise to the nucleolus for assembly into larger complexes but then function mainly in the cytoplasm, cycle between these compartments, or are predominantly cytoplasmic but relocate to the nucleolus under specific conditions.
non-nucleolar proteins which show little or no localisation in the nucleolus.
Because such detailed annotations describing dynamic characterisations of proteins are rarely available in large public databases, extensive manual curation of the nucleolus literature was required to create the datasets for this study. Additional File 1 shows experimentally determined proteins belonging to the three nucleolar-association classes (the nuclear-enriched, nucleolar-nucleoplasmic and nucleolar-cytoplasmic classes).
Features considered in the prediction of nucleolar association
Amino acid frequency
Protein sequences from IPI 
PNAC considers the relative proportion of leucine, isoleucine, lysine and serine residues
5 bins for each distinct amino acid considered
The predicted presence of signal peptides, transmembrane domains (TMDs), mitochondrial targeting peptides and nucleolar localisation sequences (NoLSs)
9 bins detailed in the Methods
GDS596 from the Gene Expression Omnibus 
The average Pearson correlation of expression between the query protein and proteins in the nucleolar-cytoplasmic training group using expression profiles from 79 physiologically normal tissues 
EBI Gene Ontology (GO) annotations  for human
Biological process and molecular function Gene Ontology (GO) annotations for the query protein are compared to those of the training set proteins
Subcellular localisation of interactors
A nucleolar proximity score is calculated for all the interactors of the query protein
Each predictive feature is used to calculate a score for the presence of a given protein in each of the nucleolar association classes described above. For each feature, the association score for a given class is evaluated by considering the relative proportion of proteins from that class that have a specific state (i.e. that fall in a particular bin of that feature). The features are trained on manually curated nucleolar datasets (listed in Additional File 1) and a randomly generated non-nucleolar dataset. The final classification results from the product of the initial class priors and the scores derived from the individual predictive features as detailed in the Methods section. Proteins are annotated as belonging to their highest scoring class.
Statistical tests of accuracy
Cross validation analysis
Tests of accuracy
Test set 1a
Test set 2b
Test set 3c
Test set 1
Test set 2
Test set 3
Test set 1
Test set 2
Test set 3
Independent literature test
The classifier was then tested on a literature-based test set which consists of proteins reported to be nucleolar-associated in the literature but that were not used to train the predictor (as described in the Methods). Once again, all classes obtain values well above what you would expect by chance. The overall accuracy of the predictor using the independent literature test is 0.85.
SILAC independent test
There are several reasons why the SILAC test accuracy values are lower than those of the other tests. Firstly, the thresholds used to define the nucleolar association class to which SILAC characterised proteins belong were determined using a very small number of proteins. As shown in Figure 2 (purple box), the intersection between the literature curated dataset (blue list) and the SILAC-derived dataset (red list), which was used to map SILAC experimental ratios into the nucleolar-association classes only consists of fourteen proteins. The thresholds most likely do not perfectly define the nucleolar association groups and will improve when the intersection between the literature-curated dataset and the SILAC-derived datasets increases in size. Secondly, the SILAC test set consists of a larger number of proteins than considered in the other tests (see Table 2, SILAC test set Counts) and is most likely not biased towards proteins that are well annotated, as the much smaller literature datasets would be. As the predictions depend in part on annotations (such as scores generated using the GO and subcellular localisation-derived features), the prediction accuracy will increase as proteins become better annotated. Finally, the SILAC test set is not characterising nucleolar association exactly in the same way as we seek to do here. As such, the two approaches are somewhat complementary in their aims. The SILAC-derived dataset investigates one cell type under normal growth conditions, thus providing a snapshot of the abundance of the proteins in each of the compartments considered under those conditions. In contrast, our method aims to classify human proteins according to their degree of nucleolar association under any possible condition and in all cell types. So while the SILAC test set is useful to increase the coverage of our test sets and investigate the classifier accuracy on proteins that might be highly uncharacterized, the SILAC test set is not a perfect match to test our classifier. This is particularly obvious for proteins that associate with the nucleolus only under specific conditions that were not experimented on in the SILAC-derived dataset. This most notably concerns nucleolar-nucleoplasmic and nucleolar-cytoplasmic proteins.
Examples of disagreements between SILAC classification and our predictions
Experimental observations from literature
Non-nucleolar (highly cytoplasmic)
Non-nucleolar (highly cytoplasmic)
Ribosomal protein which accumulates in the nucleolus 
Non-nucleolar (mainly nucleoplasmic)
The Human Proteome Atlas finds it in the nucleolus, nucleus and cytoplasm 
Non-nucleolar (mainly cytoplasmic but also nucleoplasmic)
The Human Proteome Atlas finds it to be strongly nucleolar 
Annotated in Uniprot as nucleolar, cytoplasmic and centromere
Highly enriched in nucleolus
Annotated in Uniprot as predominantly perinucleolar in G1 and in later phases predominantly localised in the nuclear matrix 
- Proteins that are conditionally localised to the nucleolus: these are proteins that are generally highly non-nucleolar but translocate to the nucleolus under specific conditions. For example, several heat shock proteins including HSPA8 are usually cytoplasmic but are known to translocate to the nucleolus after heat-shock, which was not a condition considered in the SILAC analysis (see Table 3).
- Proteins that are cyclically localised to the nucleolus, often in a cell-cycle manner, for example RCC2 and KI-67 in Table 3.
- Unknown proteins for which little information is available to confirm the true localisation.
In all these cases, a disagreement between the two classification methods warrants further investigation.
Biological process annotations of nucleolar-associated proteins
Most abundant biological process GO annotations of nucleolar-associated proteins with reliability index above 10
Biological process GO term
RNA metabolic process (GO:0016070)
of which rRNA processing (GO:0006364)
tRNA processing (GO:0008033)
Transcription, DNA-dependent (GO:0006351)
Cellular component organization (GO:0016043)
Ribosome biogenesis (GO:0042254)
of which rRNA processing (GO:0006364)
Regulation of biological process (GO:0050789)
Nucleobase, nucleoside, nucleotide and nucleic acid metabolic process (GO:0006139)
of which DNA repair (GO:0006281)
DNA replication (GO:0006260)
Regulation of biological process (GO:0050789)
of which Regulation of transcription, DNA-dependent (GO:0006355)
Signal transduction (GO:0007165)
Cellular component organization (GO:0016043)
of which Chromosome organisation (GO:0051276)
Cell cycle (GO:0007049)
Multicellular organismal development (GO:0007275)
Cell proliferation (GO:0008283)
Cell death (GO:0008219)
Protein metabolic process (GO:0019538)
of which Translation (GO:0006412)
Nucleobase, nucleoside, nucleotide and nucleic acid metabolic process (GO:0006139)
Regulation of biological process (GO:0050789)
of which Signal transduction (GO:0007165)
Cellular component organisation (GO:0016043)
of which Organelle organisation (GO:0006996)
In contrast, the 513 nucleolar-nucleoplasmic and 469 nucleolar-cytoplasmic classified proteins with reliability index above 10 are annotated with a wide variety of different terms. As shown in Table 4, the biological process term annotating the largest number of nucleolar-nucleoplasmic proteins is nucleobase, nucleoside, nucleotide and nucleic acid metabolic process which includes 52 proteins involved in DNA repair and 43 in DNA replication. Other biological process terms highly populated with nucleolar-nucleoplasmic proteins include cell cycle, chromosome organisation, regulation of transcription, DNA-dependent and signal transduction, in agreement with more recently described nucleolar-associated functions. Unsurprisingly, in the case of nucleolar-cytoplasmic proteins, the most predominant biological process is protein metabolic process, which includes 106 proteins annotated with the term translation.
Analysis of the evolution of nucleolar-associated proteins
The nucleolar-cytoplasmic group, which consists largely of proteins involved in translation (see Table 4), has a very high fraction of human proteins with orthologues in other organisms. This is consistent with the slow evolution and high conservation that has previously been shown for many proteins of this group . Of the 202 nucleolar-cytoplasmic human proteins considered, the fraction with orthology to the non-mammalian organisms considered ranges between 42% (for Giardia lamblia) to 69% (for Drosophila melanogaster). In contrast, of the 12725 human proteins considered for the non-nucleolar group, between 8% (for Giardia lamblia) and 55% (for Danio rerio) of non-nucleolar proteins have orthologues in the non-mammalian eukaryotic organisms considered. Nucleolar-enriched and nucleolar-nucleoplasmic proteins are often not as well conserved as nucleolar-cytoplasmic proteins especially in the most distant non-mammalian eukaryotic organisms considered but are significantly more conserved than non-nucleolar proteins.
In mammals, the nucleolar-enriched and nucleolar-nucleoplasmic groups have the highest fraction of orthologues with between 81% (in the case of human nucleolar-nucleoplasmic proteins with orthology to Rattus norvegicus) and 95% (in the case of human nucleolar-enriched proteins with orthology to Pan troglodytes) of their proteins having mammalian orthologues.
Thus many of the central processes carried out, at least in part, by nucleoli exist in all eukaryotes considered and, compared to non-nucleolar proteins, a much higher proportion of nucleolar-associated proteins are conserved amongst eukaryotic organisms.
In an effort to predict and describe the nucleolar proteome, we investigated the integration of various gene and protein features and annotations in a naïve Bayesian framework. To help differentiate between core-nucleolar proteins and proteins that associate with the nucleolus temporarily but also function in other compartments, the training set was subdivided into four groups: nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic and non-nucleolar proteins. This classification scheme provides information regarding the nucleolar-association potential of all human proteins in a manner that is neither cell-type, nor condition-specific. An analysis of our proteome-wide nucleolar-association predictions reveals that these groups display widely varying evolutionary characteristics and biological process signatures. This classification provides a clearer picture of the protein content of the nucleolus as well as its numerous and central roles in the cell and its interaction with other subcellular compartments.
Human proteins experimentally detected in the nucleolus were manually curated from the literature and inserted into one of three groups depending on their degree of association with the nucleolus:
The nucleolar-enriched class consists of proteins found to be predominantly nucleolar in all cell types and conditions considered (for examples, see Additional File 1). Thirty proteins are included in the nucleolar-enriched training set.
The nucleolar-nucleoplasmic class is composed of nuclear proteins that are identified in several nuclear regions including the nucleolus. This includes proteins that cycle between the nucleolus and other nuclear regions and proteins that localise primarily to non-nucleolar nuclear regions but relocate to the nucleolus under specific conditions (for examples, see Additional File 1). Twenty-two proteins form the nucleolar-nucleoplasmic training set.
The nucleolar-cytoplasmic class consists of proteins that are mainly cytoplasmic but have been detected in the nucleolus. This includes proteins that cycle between the cytoplasm and the nucleolus, cytoplasmic proteins that are assembled into larger complexes in the nucleolus as well as cytoplasmic proteins that are detected in the nucleolus under specific conditions (for examples, see Additional File 1). The nucleolar-cytoplasmic training set consists of twenty-four proteins.
The PNAC classifier was trained on 200 randomly chosen non-nucleolar proteins as well as all proteins from the manually curated nucleolar datasets (listed in Additional File 1) whose earliest nucleolar association literature reference (according to Additional File 1) has a PubMed ID smaller than 17470000 (which corresponds approximately to the first half of 2007). While most of the nucleolar association literature references considered here describe work performed in a small-scale manner (see references in Additional File 1), three large scale projects were included [9–11], mainly to ensure the presence of ribosomal proteins in the nucleolar-cytoplasmic dataset. Two of these projects [10, 11] were considered to generate the training set while the third  was considered to generate test set 2 (see below), even though its pubmed ID is below 1747000. The training set generation scheme is depicted in Figure 2.
The accuracy of the PNAC classifier was measured using three different test sets (Figure 2):
Test set 1
A leave-one-out cross-validation test in which one training set protein is set aside for testing purposes and the classifier is trained on all the remaining training set proteins. This is repeated for all training set proteins.
Test set 2
An independent, literature-based test in which the classifier is trained on all training set proteins and then tested on the remaining literature-curated proteins whose earliest PubMed ID nucleolar association reference (according to Additional File 1) is greater than 17470000 as illustrated in Figure 2. As explained above (Training set section), some ribosomal proteins reported to be nucleolar-associated in  were also included in this test set even though its pubmed ID is below 17470000, to ensure the presence of ribosomal proteins in test set 2. These ribosomal proteins were not included in the training set.
Test set 3
An independent experimental dataset generated using SILAC (stable isotope labelling with amino acids in cell culture). This dataset consists of a list of proteins whose relative abundance has been measured by harvesting nucleolar, nucleoplasmic and cytoplasmic cellular extracts each grown in the presence of amino acids labelled with different isotopes and then by pooling together the different fractions and analysing them by mass spectrometry . Each protein is thus assigned two ratios: a nucleolar versus cytoplasmic ratio and a nucleoplasmic versus cytoplasmic ratio which define the relative abundance of the protein in these three compartments. The SILAC independent protein dataset was partitioned into five groups (nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic, non-nucleolar, undefined) depending on their nucleolar versus cytoplasmic and nucleoplasmic versus cytoplasmic ratios (see Additional File 3). The thresholds used to define the five groups were determined manually by careful consideration of the proteins that are both in the independent literature based test set (test set 2) and in the independent SILAC set (test set 3) as depicted in Figure 2. These proteins that form the intersection of test set 2 and test set 3 are listed in Additional File 4 which also defines the thresholds used to decide to which nucleolar association group different SILAC characterised proteins belong. The fifth group (undefined group) corresponds to all proteins that do not fall into any of the four previously defined groups (their ratios are too different from the ratios of the proteins used to determine the thresholds). The accuracy results shown in Table 2 for the independent SILAC test exclude all proteins that were trained on or used to determine the SILAC thresholds to define the SILAC groups.
Redundancy in the training and test sets was eliminated by ensuring that all proteins are less than 25% identical over their entire sequence to any other protein in the datasets.
PNAC considers five distinct features to classify proteins according to their degree of nucleolar association:
1) Amino acid frequency
The frequency of most individual amino acids does not differ significantly between the proteins of the different nucleolar-association training set groups. However, in the case of serine, leucine, isoleucine and lysine, there are significant differences in their frequency between the different nucleolar-association groups. The frequency of each of these amino acids was measured for each protein considered and then the frequencies were grouped into five bins (four for lysine) using thresholds determined empirically.
2) Presence of targeting motifs and transmembrane domains (TMDs)
The presence of signal peptides and number of TMDs were predicted for each protein by Phobius . In addition to that, for each protein, the presence of mitochondrial targeting peptides and of nucleolar localisation sequences (NoLSs) were predicted respectively by TargetP  and NoD . Other targeting motifs were also considered but did not offer the same predictive capability as the chosen targeting motifs.
The results of these predictions were used to define three scores that characterise targeting motifs in a protein:
Mitochondrial score sM = 1 if TargetP predicts a mitochondrial targeting peptide, sM = 0 otherwise.
Secretory-membrane score sS = 1 if Phobius predicts a signal peptide or at least one TMD, sS = 0 otherwise.
NoLS score sN = 2 if the maximum NoLS score output by NoD is > = 0.9.
sN = 1 if the maximum NoLS score output by NoD is between 0.8 and 0.9.
sN = 0 if the maximum NoLS score output by NoD is <0.8.
The cyto score sC is defined based on sM and sS such that sC = 2 if sS = 1 regardless of sM, sC = 1 if sM = 1 and sS = 0 and sC = 0 if sM = 0 and sS = 0.
These sC and sN scores were grouped into nine bins representing all possible combinations of their states and their distribution is plotted for each class in Additional File 5.
3) Gene co-expression with nucleolar-cytoplasmic group
The average Pearson correlation of co-expression between the query protein and proteins in the nucleolar-cytoplasmic training group was calculated for all proteins considered, using expression profiles from 79 physiologically normal tissues . The correlation values were then grouped into four bins containing roughly (within 20%) the same number of proteins using thresholds determined empirically. Gene co-expression correlations with the other nucleolar association groups were also considered but these correlation values were not found to be predictive of nucleolar association. Thus only gene co-expression correlation with the nucleolar-cytoplasmic group was used.
4) Gene Ontology (GO) annotations
where Tp is the set of all terms that are associated with protein p, nct is the number of proteins of nucleolar association class c that are annotated with term t and nt is the total number of human proteins annotated with term t.
These ratios are then grouped into one of four bins, depending on which nucleolar association GOscore gc is highest: proteins whose nucleolar-enriched GOscore was highest, proteins whose nucleolar-nucleoplasmic GOscore is highest, proteins whose nucleolar-cytoplasmic GOscore is highest and those whose GOscores all fall below a threshold set to 0.003.
5) Subcellular localisation annotations of interactors
A nucleolar proximity score was calculated for all interactors of the query protein using HPRD  and Uniprot  subcellular localisation annotations. To do so, protein localisation annotations were grouped into four cellular regions which were each assigned a nucleolar proximity distance:
• 0.0 for the nucleolus
• 1.2 for the nucleoplasm, the nuclear speckles, the nuclear pore and the nuclear envelop
• 3.0 for the cytosol, cytoplasm, any of the cytoplasmic organelles, the plasma membrane and extracellular region
• 0.8 for the 'nuclear' annotation as this does not distinguish between nucleolar and nuclear non-nucleolar proteins.
and where I is the set of all interactors of the query protein, Dr is the distance between the nucleolus and cellular region r and Ri is the set of cellular regions to which protein i (the interactor) localises.
The protein interactions considered include all protein pairs predicted by the human protein-protein interaction predictor PIPs to interact with a posterior odds ratio above 4 [37, 38] as well as protein pairs annotated as interacting in HPRD  and IntAct .
The NPIp scores were grouped into 5 bins according to manually selected thresholds that were optimised to minimise the average class error in the leave-one-out cross-validation test. For all tests, care was taken to remove all interactors of the current test protein from consideration in calculating the NPI score of all training proteins.
Protein interaction data have been considered previously in the prediction of protein subcellular localisation, including for whole-cell protein localisation prediction [16, 40, 41] as well as by most of the nucleolar binary predictors [23, 24, 26].
Semi-naïve Bayes classifiers were used to score the likelihood of localisation to the four classes considered, based on the features described above. This learning method was chosen because of its transparency and ease of integration of highly heterogeneous data. The method was trained by counting the number of proteins from the different training classes that fall into each bin. Pseudocounts of 0.1 were added to all bins to ensure that no feature state would obtain infinite scores. The bin counts for each class were then divided by the total number of proteins that fall in the bin, regardless of their class, thus obtaining a conditional probability table for each feature considered. The five different features described above were considered independent and thus the final score for each class is calculated as the product of the initial class prior by the scores calculated for the individual features. The initial class priors were chosen to minimise the average class error in the leave-one-out cross-validation test and are set to 0.2 for the nucleolar-enriched class, 0.15 for the nucleolar-nucleoplasmic class, 0.15 for the nucleolar-cytoplasmic class and 0.5 for the non-nucleolar class. Proteins are labelled as belonging to their highest scoring class.
Measures of accuracy
The sensitivity measures the fraction of true positives (TP) amongst all the proteins annotated as being positives for this class in this particular test. The PPV measures the fraction of true positives amongst all the proteins predicted to be positive for this class. FN and FP represent respectively the false negative and false positive counts.
The overall accuracy for a given test is defined as the number of well-predicted proteins divided by the total number of proteins in the test set.
All measures of accuracy presented in Table 2 represent averages over ten runs. All runs are individually trained on the same nucleolar-enriched, nucleolar-nucleoplasmic and nucleolar-cytoplasmic sets but differ in their non-nucleolar sets, which are randomly generated as described in the Methods Dataset section.
The reliability index (RI) of the classification is calculated as the ratio of the score of the highest scoring class divided by the score of the second highest scoring class.
GO biological process annotations of predicted nucleolar-associated proteins
Biological process GO annotations  were downloaded for nucleolar-enriched, nucleolar-nucleoplasmic and nucleolar-cytoplasmic classified human proteins with reliability index greater than 10.0. Proteins can be annotated with more than one term.
For each nucleolar-association group, the number of proteins with orthologues in a given organism (as predicted by InParanoid7 ) was counted and compared to the total number of proteins of this group that are considered by InParanoid7. The standard deviation of these measures was estimated by a bootstrap procedure (using 1000000 bootstrap datasets derived using the inParanoid orthology predictions for each nucleolar-association group).
We would like to thank Dr Tom Walsh for technical expertise. This work was supported by a post-doctoral fellowship from the Caledonian Research Foundation to MSS. AIL is a Wellcome Trust Principal Research Fellow. The authors acknowledge funding from Wellcome Trust WT083481 and Wellcome Trust programme grant 073980/Z/03/Z.
- Sirri V, Urcuqui-Inchima S, Roussel P, Hernandez-Verdun D: Nucleolus: the fascinating nuclear body. Histochem Cell Biol. 2008, 129 (1): 13-31. 10.1007/s00418-007-0359-6.PubMedView ArticleGoogle Scholar
- Boisvert FM, van Koningsbruggen S, Navascues J, Lamond AI: The multifunctional nucleolus. Nat Rev Mol Cell Biol. 2007, 8 (7): 574-585. 10.1038/nrm2184.PubMedView ArticleGoogle Scholar
- Olson MO, Dundr M, Szebeni A: The nucleolus: an old factory with unexpected capabilities. Trends Cell Biol. 2000, 10 (5): 189-196. 10.1016/S0962-8924(00)01738-4.PubMedView ArticleGoogle Scholar
- Olson MO, Hingorani K, Szebeni A: Conventional and nonconventional roles of the nucleolus. Int Rev Cytol. 2002, 219: 199-266. full_text.PubMedView ArticleGoogle Scholar
- Pederson T: The plurifunctional nucleolus. Nucleic Acids Res. 1998, 26 (17): 3871-3876. 10.1093/nar/26.17.3871.PubMedPubMed CentralView ArticleGoogle Scholar
- Pederson T, Tsai RY: In search of nonribosomal nucleolar protein function and regulation. J Cell Biol. 2009, 184 (6): 771-776. 10.1083/jcb.200812014.PubMedPubMed CentralView ArticleGoogle Scholar
- Phair RD, Misteli T: High mobility of proteins in the mammalian cell nucleus. Nature. 2000, 404 (6778): 604-609. 10.1038/35007077.PubMedView ArticleGoogle Scholar
- Pederson T: Growth factors in the nucleolus?. J Cell Biol. 1998, 143 (2): 279-281. 10.1083/jcb.143.2.279.PubMedPubMed CentralView ArticleGoogle Scholar
- Andersen JS, Lam YW, Leung AK, Ong SE, Lyon CE, Lamond AI, Mann M: Nucleolar proteome dynamics. Nature. 2005, 433 (7021): 77-83. 10.1038/nature03207.PubMedView ArticleGoogle Scholar
- Andersen JS, Lyon CE, Fox AH, Leung AK, Lam YW, Steen H, Mann M, Lamond AI: Directed proteomic analysis of the human nucleolus. Curr Biol. 2002, 12 (1): 1-11. 10.1016/S0960-9822(01)00650-9.PubMedView ArticleGoogle Scholar
- Scherl A, Coute Y, Deon C, Calle A, Kindbeiter K, Sanchez JC, Greco A, Hochstrasser D, Diaz JJ: Functional proteomic analysis of human nucleolus. Mol Biol Cell. 2002, 13 (11): 4100-4109. 10.1091/mbc.E02-05-0271.PubMedPubMed CentralView ArticleGoogle Scholar
- Boisvert FM, Lam YW, Lamont D, Lamond AI: A quantitative proteomics analysis of subcellular proteome localization and changes induced by DNA damage. Mol Cell Proteomics. 2010, 9 (3): 457-470. 10.1074/mcp.M900429-MCP200.PubMedView ArticleGoogle Scholar
- Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics. 2004, 2 (4): 209-215.PubMedGoogle Scholar
- Cai YD, Chou KC: Predicting 22 protein localizations in budding yeast. Biochem Biophys Res Commun. 2004, 323 (2): 425-428. 10.1016/j.bbrc.2004.08.113.PubMedView ArticleGoogle Scholar
- Chou KC, Cai YD: Predicting protein localization in budding yeast. Bioinformatics. 2005, 21 (7): 944-950. 10.1093/bioinformatics/bti104.PubMedView ArticleGoogle Scholar
- Lee K, Chuang HY, Beyer A, Sung MK, Huh WK, Lee B, Ideker T: Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species. Nucleic Acids Res. 2008, 36 (20): e136-10.1093/nar/gkn619.PubMedPubMed CentralView ArticleGoogle Scholar
- Tung TQ, Lee D: A method to improve protein subcellular localization prediction by integrating various biological data sources. BMC Bioinformatics. 2009, 10 (Suppl 1): S43-10.1186/1471-2105-10-S1-S43.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang WL, Tung CW, Huang HL, Ho SY: Predicting protein subnuclear localization using GO-amino-acid composition features. Biosystems. 2009, 98 (2): 73-79. 10.1016/j.biosystems.2009.06.007.PubMedView ArticleGoogle Scholar
- Huang WL, Tung CW, Huang HL, Hwang SF, Ho SY: ProLoc: prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features. Biosystems. 2007, 90 (2): 573-581. 10.1016/j.biosystems.2007.01.001.PubMedView ArticleGoogle Scholar
- Lei Z, Dai Y: An SVM-based system for predicting protein subnuclear localizations. BMC Bioinformatics. 2005, 6: 291-10.1186/1471-2105-6-291.PubMedPubMed CentralView ArticleGoogle Scholar
- Lei Z, Dai Y: Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics. 2006, 7: 491-10.1186/1471-2105-7-491.PubMedPubMed CentralView ArticleGoogle Scholar
- Mohamad N, Boden M: The proteins of intra-nuclear bodies: a data-driven analysis of sequence, interaction and expression. BMC Syst Biol. 2010, 4: 44-10.1186/1752-0509-4-44.PubMedPubMed CentralView ArticleGoogle Scholar
- Hinsby AM, Kiemer L, Karlberg EO, Lage K, Fausboll A, Juncker AS, Andersen JS, Mann M, Brunak S: A wiring of the human nucleolus. Mol Cell. 2006, 22 (2): 285-295. 10.1016/j.molcel.2006.03.012.PubMedView ArticleGoogle Scholar
- Staub E, Mackowiak S, Vingron M: An inventory of yeast proteins associated with nucleolar and ribosomal components. Genome Biol. 2006, 7 (10): R98-10.1186/gb-2006-7-10-r98.PubMedPubMed CentralView ArticleGoogle Scholar
- Bodén M: Predicting nucleolar proteins using support-vector machines. Proceedings of the 6th Asia-Pacific Bioinformatics Conference - APBC 2008. Edited by: Brazma A, Miyano S, Akutsu T. 2008, Kyoto, Japan: Imperial College Press, 19-28.Google Scholar
- Boden M, Teasdale RD: Determining nucleolar association from sequence by leveraging protein-protein interactions. J Comput Biol. 2008, 15 (3): 291-304. 10.1089/cmb.2007.0163.PubMedView ArticleGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4 (7): 1985-1988. 10.1002/pmic.200300721.PubMedView ArticleGoogle Scholar
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005, D476-480. 33 DatabaseGoogle Scholar
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004, 428 (6983): 617-624. 10.1038/nature02424.PubMedView ArticleGoogle Scholar
- The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, D142-148. 38 DatabaseGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A: Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009, D767-772. 10.1093/nar/gkn892. 37 DatabaseGoogle Scholar
- Kall L, Krogh A, Sonnhammer EL: Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server. Nucleic Acids Res. 2007, W429-432. 10.1093/nar/gkm256. 35 Web ServerGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300 (4): 1005-1016. 10.1006/jmbi.2000.3903.PubMedView ArticleGoogle Scholar
- Scott MS, Boisvert FM, McDowall MD, Lamond AI, Barton GJ: Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010, 38 (21): 7388-7399. 10.1093/nar/gkq653.PubMedPubMed CentralView ArticleGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMedPubMed CentralView ArticleGoogle Scholar
- Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 2009, D396-403. 10.1093/nar/gkn803. 37 DatabaseGoogle Scholar
- Scott MS, Barton GJ: Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics. 2007, 8: 239-10.1186/1471-2105-8-239.PubMedPubMed CentralView ArticleGoogle Scholar
- McDowall MD, Scott MS, Barton GJ: PIPs: human protein-protein interaction prediction database. Nucleic Acids Res. 2009, D651-656. 10.1093/nar/gkn870. 37 DatabaseGoogle Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J: The IntAct molecular interaction database in 2010. Nucleic Acids Res. D525-531. 38 DatabaseGoogle Scholar
- Scott MS, Calafell SJ, Thomas DY, Hallett MT: Refining protein subcellular localization. PLoS Comput Biol. 2005, 1 (6): e66-10.1371/journal.pcbi.0010066.PubMedPubMed CentralView ArticleGoogle Scholar
- Shin CJ, Wong S, Davis MJ, Ragan MA: Protein-protein interaction as a predictor of subcellular location. BMC Syst Biol. 2009, 3: 28-10.1186/1752-0509-3-28.PubMedPubMed CentralView ArticleGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 2009, D885-890. 10.1093/nar/gkn764. 37 DatabaseGoogle Scholar
- Tsukahara F, Maru Y: Identification of novel nuclear export and nuclear localization-related signals in human heat shock cognate protein 70. J Biol Chem. 2004, 279 (10): 8867-8872. 10.1074/jbc.M308848200.PubMedView ArticleGoogle Scholar
- Welch WJ, Mizzen LA: Characterization of the thermotolerant cell. II. Effects on the intracellular distribution of heat-shock protein 70, intermediate filaments, and small nuclear ribonucleoprotein complexes. J Cell Biol. 1988, 106 (4): 1117-1130. 10.1083/jcb.106.4.1117.PubMedView ArticleGoogle Scholar
- Da Costa L, Tchernia G, Gascard P, Lo A, Meerpohl J, Niemeyer C, Chasis JA, Fixler J, Mohandas N: Nucleolar localization of RPS19 protein in normal cells and mislocalization due to mutations in the nucleolar localization signals in 2 Diamond-Blackfan anemia patients: potential insights into pathophysiology. Blood. 2003, 101 (12): 5039-5045. 10.1182/blood-2002-12-3878.PubMedView ArticleGoogle Scholar
- Barbe L, Lundberg E, Oksvold P, Stenius A, Lewin E, Bjorling E, Asplund A, Ponten F, Brismar H, Uhlen M, et al: Toward a confocal subcellular atlas of the human proteome. Mol Cell Proteomics. 2008, 7 (3): 499-508.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.