- Research article
- Open Access
GRASP [Genomic Resource Access for Stoichioproteomics]: comparative explorations of the atomic content of 12 Drosophila proteomes
BMC Genomicsvolume 14, Article number: 599 (2013)
“Stoichioproteomics” relates the elemental composition of proteins and proteomes to variation in the physiological and ecological environment. To help harness and explore the wealth of hypotheses made possible under this framework, we introduce GRASP (http://www.graspdb.net), a public bioinformatic knowledgebase containing information on the frequencies of 20 amino acids and atomic composition of their side chains. GRASP integrates comparative protein composition data with annotation data from multiple public databases. Currently, GRASP includes information on proteins of 12 sequenced Drosophila (fruit fly) proteomes, which will be expanded to include increasingly diverse organisms over time. In this paper we illustrate the potential of GRASP for testing stoichioproteomic hypotheses by conducting an exploratory investigation into the composition of 12 Drosophila proteomes, testing the prediction that protein atomic content is associated with species ecology and with protein expression levels.
Elements varied predictably along multivariate axes. Species were broadly similar, with the D. willistoni proteome a clear outlier. As expected, individual protein atomic content within proteomes was influenced by protein function and amino acid biochemistry. Evolution in elemental composition across the phylogeny followed less predictable patterns, but was associated with broad ecological variation in diet. Using expression data available for D. melanogaster, we found evidence consistent with selection for efficient usage of elements within the proteome: as expected, nitrogen content was reduced in highly expressed proteins in most tissues, most strongly in the gut, where nutrients are assimilated, and least strongly in the germline.
The patterns identified here using GRASP provide a foundation on which to base future research into the evolution of atomic composition in Drosophila and other taxa.
Understanding the basis of biological diversity requires integration of ecological and evolutionary information. One exciting emerging picture is that ecological variation in the availability of key elements can have evolutionary consequences even at the primary protein sequence level [1–4], a perspective known as “stoichioproteomics” (reviewed in ). Indeed, various studies, both older and newer, have detected selection for efficiency of usage in key limiting elements in amino acid side chains, both in the sequences of individual proteins [1, 6] and across entire proteomes [7–10].
To begin to explore this potentially vast source of variation within and among species, it is necessary to have reliable and comparable sequence datasets for multiple taxa. This problem applies most strikingly in multicellular eukaryotes. Although several studies have explored stoichioproteomics in prokaryotes (e.g., [7, 8, 11–15]) and eukaryotes [2, 9–11], prokaryotic species have more often been the subject of comparative analysis than eukaryotes. Comparative analyses of molecular-scale variations in the elemental compositions of proteins among plant and animal species are currently very scarce (see e.g. [2, 9]). One reason for this may be that, in these taxa, such analyses are more difficult owing to the more complicated relationships between gene, transcript, and protein (for example, through alternative splicing), which blurs the definition of “homology” and makes meaningful comparisons among proteomes more difficult to achieve. In such species, answering even simple questions about atomic composition can quickly become a daunting task that requires merging several large datasets from different research groups using multiple sequence identity codes.
To begin to address this problem, we present GRASP (Genomic Resource Access for Stoichioproteomics, URL: http://www.graspdb.net), a public web resource focused on providing a centralized and standardized resource for analyzing the elemental composition of whole eukaryotic proteomes. GRASP is intended, first and foremost, to encourage and enable researchers to conduct their own comparative stoichioproteomic analyses. Second, it is intended to simplify and greatly facilitate these analyses for eukaryotes, by providing a common, standardized repository of protein-by-protein information with easy ways to search, match, extract, and analyse composition data from groups of homologous proteins and splice variants across multiple species with sequenced genomes. Third, we seek to facilitate testing of biological hypotheses by linking protein data to other publicly available sources of biological information using standard naming conventions. GRASP does not provide new data; rather, the advance GRASP represents is one of convenience and streamlining of analyses that would otherwise be laborious, in a manner analogous to repositories of biological data such as FishBase (http://www.fishbase.org), the Tree of Life (http://www.tolweb.org) and the Global Biodiversity Information Facility (http://www.gbif.org).
In its current form, GRASP includes information on the atomic composition of proteins of all twelve fully-sequenced species of Drosophila: D. ananassae, D. erecta, D. grimshawi, D. melanogaster, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. simulans, D. virilis, D. willistoni and D. yakuba. Information on multiple splice variants is currently available for D. melanogaster. In the future, we plan to expand the database to include a diversity of multicellular and unicellular eukaryotes.
Exploring Drosophila stoichioproteomics
Combined with an almost unparalleled understanding of the biology of Drosophila from many decades of intensive research (see  and references therein), these 12 sequenced genomes have already been used to make inferences about species relationships and speciation, patterns of genome organization, e.g. , the evolution and function of gene sequences, e.g., , and rates of evolution, e.g., . However, the potential of this clade for studying variation in atomic composition has yet to be investigated.
To illustrate the potential that GRASP represents for researchers interested in testing biological and ecological hypotheses using stoichioproteomic data, and the kinds of analyses that are facilitated by GRASP, we present here the first exploratory analysis and overview of proteomic variation in atomic composition among the 12 sequenced Drosophila species. We specifically illustrate the potential of this resource by conducting preliminary tests of two stoichioproteomic hypotheses.
Stoichioproteomics derives a number of specific hypotheses from a single core precept: that limitation in an element leads to purifying selection in order to reduce the usage of amino acids needing that element in protein sequences or their expression. Limitation could occur at any of several levels. At the long-term ecological level, limitation would result in predictable changes in protein stoichiometry among species (see ). Alternatively, at the short-term ecological level, limitation would result in predictable changes in protein stoichiometry among or even within individuals (see ). Limitation may operate even at the intracellular level, whereby temporary nutrient limitation within cells due to the demands of protein expression results in predictable changes in composition among proteins with predictable expression profiles such as nutrient assimilation proteins , differentially expressed protein variants , or according to transient expression profiles . We examined two sets of predictions arising from this general elemental limitation hypothesis: first, ecological differences should lead to predictable protein stoichiometry among species, and second, highly-expressed proteins should have sequences conservative in key nutrients, specifically N.
First, we collated data for the 12 Drosophila species on key ecological traits that have been linked to organismal stoichiometry, and tested for associations between these traits and the composition of proteins (and protein subsets) across species. These traits, along with the relevant predictions with respect to Drosophila proteomes, are outlined in Table 1.
Second, we combined the data in GRASP with a public database of protein expression (FlyAtlas, http://www.flyatlas.org) in different Drosophila tissues to test for a negative association between protein expression and N content. Using tissue-specific expression allowed us to assess the predicted relationship not only in a general context but also in tissues where this relationship would be expected to apply strongly or not at all, respectively. First, the insect midgut is the site of nutrient uptake and assimilation; enzymes that function to assimilate nutrients have evolved to contain less of the element they assimilate , leading to the prediction that we should observe a stronger relationship between expression and N content in the midgut. Conversely, in the testes of D. melanogaster, evidence suggests that protein synthesis is greatly reduced, which should reduce the requirement for N conservation. Protein expression during spermatogenesis in Drosophila occurs in a unique way: transcription occurs only in early meiotic divisions, which peak during the pupal stage. Post-meiosis, there is almost no transcription in Drosophila spermatids; instead, protein synthesis is achieved by retention of mRNA transcripts for relatively long periods of time . Translation also appears to be reduced—12 ribosomal proteins are down-regulated in adult testes while none is up-regulated . In a global expression study, transcription and translation proteins were not among those differentially expressed in testes, unlike in ovaries . Thus, in contrast to other tissues, an adult testis would have no particular requirement for N conservation in its proteins, because proteins are being synthesized at a much lower rate.
Results and discussion
In GRASP, core information on individual proteins is given as protein length, the percent composition of each amino acid, the number of atoms of each constituent element in each protein (excluding invariant backbone), and “elemental content”, defined (following ) as
where L protein is the protein sequence length, w i is the number of atoms of a given element in the side chain of amino acid i, and p i is the frequency of amino acid i in the protein.
The multivariate analyses we present here incorporate (1) elemental content, following previous authors, e.g. [1, 2, 7, 10], (2) DNA GC content, and (3) several basic properties of amino acid sequences (protein length, proportions of hydrophobic, polar, positive, negative, and aromatic residues). We restricted our analyses to the subset of proteins that have orthologs in all 12 available Drosophila species (n = 4934). Future authors may wish to base more detailed analyses upon individual amino acid contents and raw numbers of constituent elements, or on the composition of proteins lost and gained during the evolution of this clade.
Principal component analysis revealed multiple patterns of co-linearity among the analysed sequence variables (Figure 1).
Both species and functions differed statistically (MANOVA, species: Pillai’s trace = 0.91, df = 27, p < 0.001; function: Pillai’s trace = 0.29, p < 0.001) but the interaction of species and function did not (Pillai’s trace = 0.02, p = 0.99), indicating that protein functional groups occupied similar regions of multivariate space relative to each other within each species’ proteome. There were relatively small differences among species with the exception of D. willistoni (Figure 2) but pronounced differences among protein categories (Figure 3).
Figure 1 shows pairwise plots of variable loadings for the first eight principal component axes, which collectively explained 89% of the variance. Most of the co-linearity stemmed from inherent properties of protein sequences. Aside from fundamental associations such as those between N and O content and charge density (described by PC1), and between C content and aromaticity (described by PC2), Figure 1b also shows, for example, that S content was negatively associated with protein length. Although this partly reflects the effect of a constant initial methionine residue, PC1 and PC2 loadings did not appreciably change after excluding this initial residue from all proteins (data not shown) so this may stem from the tendency of smaller proteins to be stabilized by disulphide links while longer proteins tend to have salt bridges . Also reflecting previous findings, DNA GC content was correlated negatively with protein C content [14, 15] and also with O content .
Most species showed only small differences on all PC axes (Figure 2). D. willistoni was an outlier in many cases, notably PC2, PC3 and PC7, stemming from its proteome’s relatively exceptionally high O content (median 0.496 atoms/residue) and its genome’s well-documented low GC content (median for our dataset 46.5%; see ). Although D. willistoni is not exceptional among eukaryotes either in its GC or O content, since it falls roughly centrally among eukaryotes plotted in Vieira-Silva & Rocha's Figure 2 (in , p. 1935), it was a clear outlier within the clade studied here.
Overall, protein functional categories differed in elemental content and sequence properties largely in line with expectations from the biochemistry of each protein category (Figure 3). For example, transcription factors and nucleic acid binding proteins had very low values on PC1, indicating high N and O content, high charge density and hydrophilicity; nucleic acid binding proteins in particular had high N content, reflecting the requirement for positive charge associated with binding to negatively charged DNA. In contrast, receptors and transporters had high values of PC1 indicating very low N and O, low charge density, and high hydrophobicity, consistent with the high proportion of hydrophobic groups required to function within a plasma membrane.
Patterns in elemental content across the whole phylogeny
To investigate evolutionary patterns rather than standing variation, we first explored species divergence in the PC axes identified above by reconstructing ancestral states for each axis in the PCA. The most striking lineage-specific patterns were in D. willistoni and in the D. persimilis/D. pseudoobscura lineage (Figure 4). D. willistoni has undergone across-the-board increases in PC2 (i.e. increased C content and decreased GC content, percentage of polar residues, and protein length) and PC3 (i.e. increased O content and negative charge, and decreased GC content, N content and positive charge) and decreases in PC4 (i.e. increased S, reduced protein length) in all functional categories of proteins, arising from divergence since its last ancestor, while D. pseudoobscura and D. persimilis have both undergone equally wholesale, but slightly less substantial, decreases in PC2 and PC3 and increases in PC4. This suggests a proteome-wide, bottom-up evolutionary pressure leading to correlated changes in GC content of DNA and of C, O and N content across proteins, instead of an ecological or physiological explanation that might be seen more strongly in some proteins than in others.
Why we should observe this pattern remains an open question. It seems likely that selection acting on DNA GC content may drive the observed difference; PC2, PC3 and PC4 all contain heavy loadings for GC content, a fundamental property of DNA, and, in PCA analyses, extant standing variation in species tended to fall along a common line roughly parallel with variation in GC content in all cases (Figure 2). While it is beyond the scope of this study to speculate on causal relationships between genomic GC content and protein properties, which are currently unclear (for discussion, see  with respect to O content and  with respect to C content), changes in GC content in D. willistoni have been shown to correlate with changes in amino acid transition rates . If evolutionary changes in GC content indirectly drive evolution in amino acid composition and protein properties such as O content, it is likely that this change may be sufficient to account for observed differences in PC2, PC3 and PC4 between D. willistoni, D. pseudoobscura, D. persimilis and their congeners.
We then calculated phylogenetically independent contrasts in each variable for each ortholog set across the phylogeny to look at general patterns in the evolution of proteomic properties across the clade. We amalgamated all independent contrasts into a single dataset and used PCA to identify broad patterns of evolutionary variation (following ). The main axes of variation we identified are given in Figure 5. Here we term these “Evolutionary Principal Components” (EPC) purely to distinguish them from the axes identified for standing variation among proteomes in Figure 1. Broadly speaking, these were (EPC1) from long, polar, high-O content proteins to aromatic, high-C proteins; (EPC2) from hydrophobic, high-GC proteins to positive, high-N proteins, (EPC3) from more polar proteins to more negative, more hydrophobic proteins and (EPC4) from GC-poor, S-rich, aromatic proteins to GC-rich, positively charged proteins. Evolutionary changes in protein and DNA properties often mirrored patterns detected within proteomes. For example, O content and hydrophobicity were strongly opposed on both PC1 (Figure 1) and EPC1 (Figure 5). As another example, evolutionary divergence in both C and O content were negatively related to divergence in GC content on EPC3, which reflected our findings within proteomes (Figure 1) and supports previous findings among whole proteomes of other taxa [8, 14, 15].
However, evolutionary patterns were sometimes different from standing variation within proteomes. Evolution in N content and % positive charge followed patterns different from those seen in static variation. Evolutionary changes in N and % positive charge were independent of O content and % negative charge both on EPC1 and (to a lesser extent) on EPC3 (Figure 5). In contrast, within proteomes, these two variables were positively related to O content and % negative charge on PC1 (collectively describing charge density) and negatively related on PC3 (collectively describing a positive–negative continuum; see Figure 1).
Testing hypotheses across the phylogeny
Statistics for all protein subsets are given in Table 2. Note that we use PC (not EPC) axes for these analyses, because the technique we used (PGLM) uses raw species values, not contrasts, as its input data (see Methods for details). We detected extensive changes in protein stoichiometry across all axes with intron percent; associations significant at the p < 0.01 level were seen in 651, 561, 493 and 539 out of 4934 proteins on PC1, 2, 3 and 4, respectively, over 10 times the expected number. However, there was no detectable bias on any of PC1-4 towards positive or negative associations with intron percent (binomial tests on effect sizes, all NS); neither were positive or negative associations biased towards any protein category (χ2 tests, all NS). This indicates an overall lack of consistent association between intron percent and protein stoichiometry. Comparably high numbers of significant associations that also had no net positive or negative bias were also detected for ovariole number (162–394 proteins across all axes) and specific development time (161–393 proteins across all axes). Future investigators may wish to explore these associations in more detail, on a protein-by-protein basis.
For diet breadth (significant associations in 302 and 229 proteins across PC1 and PC2, respectively), the significant positive and negative associations were distributed unequally among protein functions (χ2 tests, p < 0.05). Among this subset of proteins showing significant associations, nucleic acid binding proteins and transcription factors showed predominantly positive associations between diet breadth and PC1, while transporters showed negative associations. For PC2, nucleic acid binding molecules and transcription factors showed positive associations, whilst oxidoreductases, transferases and select regulatory molecules showed mostly negative associations. Rather than being a phylogeny-wide trend, though, these patterns were driven by D. sechellia, the most resource-specialized of all the flies represented here and for whom this subset of proteins had diverged somewhat from the rest of the species (Figure 6). In D. sechellia, this subset of nucleic acid binding proteins and transcription factors had the highest PC1 (i.e. were most hydrophobic, with highest S and lowest N and O) and PC2 (i.e. the least polar, with the highest C), while its transporters had the lowest PC1 and its oxidoreductases, transferases and select regulatory molecules had the lowest PC2. Although these differences appeared to drive evolutionary correlations with diet breadth, from a stoichioproteomic perspective we might, a priori, have expected the protein composition of cactus-feeders such as D. yakuba to have been most distinctive (see e.g. ), owing to the low nutritional quality of their food resources, but this was not the case. Furthermore, D. erecta is almost as strictly specialized as D. sechellia but did not have distinctive stoichiometry in these proteins (Figure 6). It may therefore be different, species-specific selection pressures in D. sechellia, such as detoxification of host substances, that have contributed to this divergence; patterns of selection in this subset of proteins may warrant further research.
Ecological selection pressures evident at the proteomic level have been detected previously using comparative analyses across whole kingdoms (see e.g. [2, 7, 11]); the relatively few substantial findings we report here may also reflect a relatively short divergence period (compared to divergence among kingdoms), or that differences in the ecologies of Drosophila are not substantial or consistent enough to generate the selection pressures we predicted – although major differences in body composition reflect those seen among the flies’ respective substrates , these differences may not ramify into the proteins. Given the scope of the proteomic datasets, our overview-style analysis was also necessarily very broad and coarse-grained. More detailed research into the atomic content of specific proteins or protein groups using GRASP may be better able to reveal effects of nutritional limitation upon protein atomic content among Drosophila species.
Protein expression levels in D. melanogaster
Highly expressed proteins (i.e. proteins that impose substantial nutrient demands upon a cell) should theoretically evolve to be nutrient poor [2, 6] and, conversely, nutrient-rich proteins should be down-regulated in times of low nutrient availability [13, 20]. To test this hypothesis, and to illustrate the ease with which the information in GRASP can be integrated with other publicly available resources, we asked how atomic composition, specifically N content, was related to protein expression (FlyAtlas, http://www.flyatlas.org) across different tissue types in D. melanogaster.
Bragg & Wagner  outlined two hypotheses to account for how nutrient conservation in highly expressed proteins might come about. First, relief of nutrient limitation might arise mainly from changes in expression, with nutrient-rich proteins down-regulated and nutrient-poor proteins up-regulated. This scenario predicts a proteome-wide negative correlation between expression levels and content of the limiting nutrient. Second, specifically up-regulated proteins may have evolved to be nutrient-poor, resulting in a negative expression-nutrient content relationship only in up-regulated proteins [3, 20].
Expression was bimodal in all tissues, the lower distribution corresponding to low- or rarely-expressed genes (see e.g. ). To test among the three alternatives (the two predictions outlined above, plus a null hypothesis of no negative association between nutrient content and expression), we conducted analyses for each tissue separately. Specifically, we conducted piecewise regression, allowing us to separate the low expression and high expression clusters at the most likely point (corresponding to a log2 abundance of 5.5; see Methods).
Results of piecewise regressions for N content are shown in Table 3 (for context, statistical data for all tissues and all response variables [PC axes and all elements] are given in Additional file 1: Table S1). In the low expression cluster, N content was weakly and inconsistently related to expression levels. By contrast, in the high-expression cluster, N content was steeply negatively related to expression level in all tissues but the testes, where this relationship was actually positive, and ovaries, in which the slope did not differ from zero (Figure 7, Table 1).
This indicates that, specifically in the highly expressed proteins of all tissues except the germline, increased expression was associated with conservation of N in protein sequences. In the testes, upregulated proteins were actually higher in N - the only tissue for which this was the case. The most steeply negative expression/N relationships were seen in the midguts of adults and larvae. In these tissues, doubling expression (i.e. increasing by one log2 unit) was associated with approx. 0.01 fewer N atoms per amino acid residue. The next-steepest relationships were also all gut-related tissues (hindgut and malpighian tubules; Table 3).
One clear interpretation of these patterns is that high levels of protein expression place a high demand for N upon somatic cells, creating a selection pressure for conservation of N in the most highly expressed proteins . Thus, our results support the hypothesis that specifically up-regulated proteins have evolved to be nutrient-poor [3, 20] in keeping with the idea that proteins evolve to reflect material costs of their production [1, 4]. Among eukaryotes, this specific expression/N content relationship has so far only been identified in plants [eg., 2, 9] and is weaker or absent in animals, possibly owing to relaxed selection for efficiency of N usage in heterotrophs . Proteins involved in nutrient assimilation show strong evolutionary conservation of the element they assimilate , so we would expect a priori to see the steepest relationships between N content and expression at the sites of N assimilation, such as the gut. Accordingly, midgut tissues, the main site of nutrient uptake, showed the steepest relationships of all tissues – followed by all other gut tissues in both larvae and adults (Table 3). In contrast, sites where protein synthesis is arrested or reduced, such as the testes, are not expected to show such a pattern. In contrast to the testes, the ovary grows during adult life , but we still found a relatively shallow relationship between PC1 and expression in ovaries, suggesting they may also be under reduced selection for N conservation. As a potential hypothesis for future study, conservation of N in eggs may impair offspring performance, constraining egg proteins to be nutritionally expensive. Consistent with this, dietary protein deficiency differentially affects female fertility rather than lifespan in Drosophila. Brain and CNS tissues, while actively growing and differentiating in larvae and adults , also showed comparatively shallow N content/expression relationships (Table 3); we hypothesize that, because the CNS is highly charge-sensitive, the intrinsic correlation between N content and protein charge may reduce the scope for N conservation in nervous tissues. However, the apparently shallow expression/N content relationship in these tissues remains an open question.
Interestingly, Elser et al.  also used D. melanogaster as a reference model organism for heterotrophs, and used it as a baseline in the comparison with autotrophs. They found that N content in Drosophila followed a U-shaped curve that actually increased with expression intensity in the most highly expressed proteins (their Figure 1b). Examination of their figure reveals that this trend is influenced by two outliers (possibly ribosomal proteins, a group with unusually high N [10, 13]); excluding these two outliers, the remainder of the points in their figure agree with our data because the non-outlier data in  follow a weakly negative trend. This result accords with the authors’ main conclusion, because this negative trend is indeed shallower than in the plants they analysed, lending weight to the idea that N conservation is indeed relaxed in animals.
Comparing the extent of N conservation we observed in D. melanogaster with the results obtained by Elser et al. , our results indicate that it is important to consider tissue-specific expression levels. For example, in the tissues with the strongest N conservation, the larval and adult midgut, the most highly expressed proteins were approx. 0.05 N atoms poorer per residue than in the least highly expressed. By contrast, in the testes there was no such pattern. These results suggest that selection for nutrient conservation in proteins may be mediated by tissue-specific expression, a possibility that requires further research.
Of course, it is difficult to be entirely confident that stoichioproteomic patterns are not a result of systematic selection on biochemical properties of amino acids or of underlying DNA rather than elemental content per se. As an alternative hypothesis, the most highly expressed proteins may require a lower charge density to allow unbinding from the machinery of translation at a fast enough rate to maintain high expression, which would explain their lower N and O content, although this requirement would most likely be of much lower importance than requirements of protein function. Future authors may wish to make preliminary steps towards elucidating these two hypotheses by conducting analyses of protein composition and expression while controlling for charge density.
We have provided a mainly descriptive account of broad-scale variation in the atomic content of Drosophila proteins across the 12 fully sequenced Drosophila species, to which GRASP provides ready access, alongside preliminary tests of some core stoichioproteomic hypotheses. Further detailed research using GRASP will provide deeper insights into the evolution of atomic composition within and among species. Subsequent releases of GRASP will be augmented with similar information on other organisms across the phylogeny, as well as with additional information about other characteristics, including known developmental regulators, life span, feeding habits, and other ecological information, resulting in a powerful bioinformatics knowledgebase for the framing and testing of stoichioproteomic hypotheses.
We found that atomic content in Drosophila was at least partially a function of DNA GC content and amino acid biochemistry, and was also predictable based upon relative amounts of other constituent elements. On top of this, however, proteins carried signatures of conservation of limiting nutrients: N content was reduced in the most highly expressed proteins in most somatic tissues, but not in testes where nutrient conservation is unnecessary. However, the predictable patterns in elemental composition that we detected within proteomes were not plainly evident in broad-scale comparisons across species, indicating a potential role for lineage-specific evolutionary changes; this phylogenetic variation can provide a testing ground for future researchers wishing to use GRASP to look into the evolution of atomic composition.
Protein atomic content can be seen a passive emergent property of selection acting on the phenotype via a protein’s structure, but may also be a source of selection pressure in itself, through its effect on organism nutrient demand. Here we have identified patterns in atomic content ranging from associations with basic properties of DNA to evolutionary associations with ecological species differences that may represent signatures of selection for nutrient conservation. We hope that the stoichioproteomic trends we have identified here will provide multiple working hypotheses for future research aiming to investigate these hypotheses in detail using these 12 Drosophila species and beyond. GRASP will provide a convenient springboard for such studies.
GRASP is organized around a central interface whereby users select the species they wish to query and then the category of proteins whose data they wish to extract. A range of data is included on the website to enable direct tests of hypotheses, as well as providing links to outside sources of information. We have added categorical data mapping to the Gene Ontology (http://www.geneontology.org) on protein family, biological process, molecular function, and pathway, derived from FlyBase (http://www.flybase.org), Panther (http://www.pantherdb.org), and Uniprot (http://www.pir.uniprot.org). GRASP also includes the amino acid sequence itself, along with its length, plus the underlying coding DNA sequence and information about its GC content, a property that directly affects the amino acid sequence . The current sequence data in GRASP are derived from FlyBase version FB_2007_3, October 2007. When a gene gives rise to more than one protein product, each protein product is indicated with a different suffix (i.e., PA, PB, etc.). Aggregations grouped by biological process, molecular function, protein pathway and family can be selected and output to the browser or via downloadable spreadsheets and comma-separated value lists for use in other statistical software. In addition, users can create their own aggregations of proteins derived from the selected species, automatically generating a downloadable spreadsheet of aggregated amino acid and elemental counts.
Exploratory analysis of Drosophila proteomes
After downloading the information from GRASP, all data analyses were carried out in R 2.13.0  using various packages as cited below. Unless stated otherwise, only proteins with orthologs in all 12 species were analysed.
First, we used principal component analysis to characterize multivariate relationships between C, O, N and S content, DNA GC content, protein length, and the proportions of hydrophobic, polar, positive, negative and aromatic residues, respectively, in the entire dataset. We then asked whether proteins from the 12 different species, and different functional categories, occupied distinct regions in multivariate space using MANOVA with the first 8 principal components as a multivariate response.
Species- and clade-specific divergence in elemental content
Because species share evolutionary history to different extents, comparisons among species must account for the way characters evolve e.g. [39–41]. For the 12 Drosophila species under consideration, a well-supported phylogeny is known based on the whole genome, with reliable divergence estimates (Figure 8; see ). Both the inferred phylogeny and protein atomic content are closely linked to the amino acid sequence, which may lead to circular inference – we assumed this would not bias our results, i.e. we assumed that forces affecting the atomic composition of proteins were independent of the forces affecting the sequence affinity of the entire genomes on which the phylogeny was based.
First, we looked at species divergence in the PC axes identified above by reconstructing the ancestral states for each protein across the phylogeny on each axis in the PCA using maximum likelihood reconstruction in the ace() function of the ape package in R . We used this information to calculate the estimated divergence for each protein in each species since its most recent common ancestor with a sibling species. Lineage- and clade-specific evolution of atomic content could therefore be isolated from patterns shared among species.
Evolutionary patterns in elemental content and ecology
We used the method of phylogenetically independent contrasts (PIC, ) to calculate independent contrasts in all considered variables. To look at multivariate evolutionary change we then calculated principal components in these contrasts, following . Evolutionary associations among the variables were assessed using the variable loadings of the principal axes.
Testing hypotheses across the phylogeny
To test ecological and genomic hypotheses relating to stoichioproteomics (Table 1), we asked whether any of the species-level ecological or genomic traits listed in Table 1 on its own was related systematically to evolutionary patterns in atomic composition and protein properties. We used phylogenetic generalized least squares (PGLM), using the CAIC package  to model phylogenetic changes in the principal component axes identified above for “standing variation” against changes in the trait of interest (i.e. for each ecological trait, 4934 analyses each of n = 12), asking whether fitted lines systematically departed from zero. Under a null hypothesis we would expect 1% of 4934, or 49, analyses to be significant at the 0.01 level; we used 200 or approx. 4 times this number as an arbitrary but conservative threshold for significance. Note that PGLM differs from the method of PIC which we used to calculate the EPC axes: where PIC calculates a new dataset of phylogenetically independent contrasts, PGLM instead uses raw species values as the response variable, and incorporates phylogenetic information into the error term of the model. Thus, we performed these analyses on the "standing variation" PC axes (rather than the EPC axes). For ecological variables that were frequently associated with protein composition (intron percent, ovariole number, specific development time and diet breadth) we asked whether associations were consistently positive or negative in particular protein categories using χ2 tests; for each variable, Table 1 outlines hypotheses relating to specific subsets of proteins that might be expected to show elemental conservation in their sequences.
Protein expression in D. melanogaster
Detailed information on protein expression in D. melanogaster has recently become available in the FlyAtlas database (http://www.flyatlas.org). We used FlyAtlas to analyse protein elemental content with respect to protein expression in various tissues of D. melanogaster (see Table 3 for tissues). Nutrient conservation in proteins is expected to appear as a negative relationship between protein nutrient content and protein expression level (see [2, 3, 13]). If N conservation is brought about by wholesale adjustment of expression levels on the basis of N content, we would expect to see such a negative relationship across all proteins. On the other hand, if proteins that are constrained to be highly expressed have evolved to be low in N, we should see this negative relationship only in highly-expressed proteins .
To test between these two hypotheses, we fitted piecewise regression models to the data for each tissue, breaking the bimodal distribution at a point corresponding to a log2 abundance of 5.5 (determined by comparing AIC values of piecewise regressions using different breakpoints; data not shown).
In tissues where N conservation is expected to be weak or non-existent, however, we would expect a negative relationship in neither down- nor up-regulated proteins. Thus, we predicted that the slope of any relationship between expression and N content would be shallower for the testes than for any other tissue.
Baudouin-Cornu P, Surdin-Kerjan Y, Marlière P, Thomas D: Molecular evolution of protein atomic composition. Science. 2001, 293 (5528): 297-300. 10.1126/science.1061052.
Elser JJ, Fagan WF, Subramanian S, Kumar S: Signatures of ecological resource availability in the animal and plant proteomes. Mol Biol Evol. 2006, 23 (10): 1946-1951. 10.1093/molbev/msl068.
Bragg JG, Wagner A: Protein carbon content evolves in response to carbon availability and may influence the fate of duplicated genes. Proc R Soc B. 2007, 274: 1063-1070. 10.1098/rspb.2006.0290.
Bragg JG, Wagner A: Protein material costs: single atoms can make an evolutionary difference. Trends Genet. 2009, 25 (1): 5-8. 10.1016/j.tig.2008.10.007.
Elser JJ, Acquisti C, Kumar S: Stoichiogenomics: the evolutionary ecology of macromolecular elemental composition. Trends Ecol Evol. 2011, 26 (1): 38-44. 10.1016/j.tree.2010.10.006.
Mazel D, Marliere P: Adaptive eradication of methionine and cysteine from cyanobacterial light-harvesting proteins. Nature. 1989, 341 (6239): 245-248. 10.1038/341245a0.
Bragg JG, Thomas D, Baudouin-Cornu P: Variation among species in proteomic sulphur content is related to environmental conditions. Proc R Soc B. 2006, 273 (1591): 1293-1300. 10.1098/rspb.2005.3441.
Vieira-Silva S, Rocha EPC: An assessment of the impacts of molecular oxygen on the evolution of proteomes. Mol Biol Evol. 2008, 25 (9): 1931-1942. 10.1093/molbev/msn142.
Acquisti C, Elser JJ, Kumar S: Ecological nitrogen limitation shapes the DNA composition of plant genomes. Mol Biol Evol. 2009, 26: 953-956. 10.1093/molbev/msp038.
Acquisti C, Kumar S, Elser JJ: From elements to biological processes: signatures of nitrogen limitation in the elemental composition of the catabolic apparatus. Proc R Soc B. 2009, 276: 2605-2610. 10.1098/rspb.2008.1960.
Acquisti C, Kleffe J, Collins S: Oxygen content of transmembrane proteins over macroevolutionary time scales. Nature. 2007, 445: 47-52. 10.1038/nature05450.
Zeldovich KB, Berezovsky IN, Shakhnovich EI: Protein and DNA sequence determinants of thermophilic adaptation. PLoS Comput Biol. 2007, 3 (1): e5-10.1371/journal.pcbi.0030005. 10.1371/journal.pcbi.0030005
Gilbert JDJ, Fagan WF: Contrasting mechanisms of proteomic nitrogen thrift in Prochlorococcus. Mol Ecol. 2011, 20: 92-104. 10.1111/j.1365-294X.2010.04914.x.
Bragg JG, Hyder CL: Nitrogen versus carbon use in prokaryotic genomes and proteomes. Proc R Soc B. 2004, 271 (Suppl 5): PC374-PC377.
Baudouin-Cornu P, Schuerer K, Marlière P, Thomas D: Intimate evolution of proteins. Proteome atomic content correlates with genome base composition. J Biol Chem. 2004, 279 (7): 5421-5428.
Markow TA, O’Grady PM: Drosophila biology in the genomic age. Genetics. 2007, 177 (3): 1269-1276. 10.1534/genetics.107.074112.
Drosophila 12 Genomes Consortium: Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007, 450: 203-218. 10.1038/nature06341.
Stage DE, Eickbush TH: Sequence variation within the rRNA gene loci of 12 Drosophila species. Genome Res. 2007, 17 (12): 1888-1897. 10.1101/gr.6376807.
Larracuente AM, Sackton TB, Greenberg AJ, Wong A, Singh ND, Sturgill D, Zhang Y, Oliver B, Clark AG: Evolution of protein-coding genes in Drosophila. Trends Genet. 2008, 24 (3): 114-123. 10.1016/j.tig.2007.12.001.
Fauchon M, Lagniel G, Aude J-C, Lombardia L, Soularue P, Petat C, Marguerie G, Sentenac A, Werner M, Labarre J: Sulfur sparing in the yeast proteome in response to sulfur demand. Mol Cell. 2002, 9 (4): 713-723. 10.1016/S1097-2765(02)00500-2.
Elser JJ, Acharya K, Kyle M, Cotner J, Makino W, Markow T, Watts T, Hobbie S, Fagan W, Schade J, Hood J, Sterner RW: Growth rate–stoichiometry couplings in diverse biota. Ecol Lett. 2003, 6 (10): 936-943. 10.1046/j.1461-0248.2003.00518.x.
Lee KP, Simpson SJ, Clissold FJ, Brooks R, Ballard JWO, Taylor PW, Soran N, Raubenheimer D: Lifespan and reproduction in Drosophila: new insights from nutritional geometry. Proc Natl Acad Sci USA. 2008, 105 (7): 2498-2503. 10.1073/pnas.0710787105.
Sterner RW, Elser JJ: Ecological stoichiometry: the biology of elements from molecules to the biosphere. 2002, USA: Princeton University Press
Raubenheimer D, Simpson SJ: Integrative models of nutrient balancing: application to insects and vertebrates. Nutr Res Rev. 1997, 10: 151-179. 10.1079/NRR19970009.
Jaenike J, Markow TA: Comparative elemental stoichiometry of ecologically diverse Drosophila. Funct Ecol. 2003, 17 (1): 115-120. 10.1046/j.1365-2435.2003.00701.x.
Chintapalli VR, Wang J, Dow JA: Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nature Genet. 2007, 39 (6): 715-720. 10.1038/ng2049.
White-Cooper H: Studying how flies make sperm—investigating gene function in Drosophila testes. Mol Cell Endocrinol. 2009, 306 (1–2): 66-74.
Mikhaylova L, Nguyen K, Nurminsky DI: Analysis of the Drosophila melanogaster testes transcriptome reveals coordinate regulation of paralogous genes. Genetics. 2008, 179: 305-315. 10.1534/genetics.107.080267.
Parisi M, Nuttall R, Edwards P, Minor J, Naiman D, Lü J, Doctolero M, Vainer M, Chan C, Malley J, Eastman S, Oliver B: A survey of ovary-, testis-, and soma-biased gene expression in Drosophila melanogaster adults. Genome Biol. 2004, 5: R40-10.1186/gb-2004-5-6-r40.
Bastolla U, Demetrius L: Stability constraints and protein evolution: the role of chain length, composition and disulfide bonds. Protein Eng Des Sel. 2005, 18 (9): 405-415. 10.1093/protein/gzi045.
Vicario S, Moriyama EN, Powell JR: Codon usage in twelve species of Drosophila. BMC Evol Biol. 2007, 7: 226-10.1186/1471-2148-7-226.
Albu M, Min XJ, Golding GB, Hickey D: Nucleotide substitution bias within the genus Drosophila affects the pattern of proteome evolution. Genome Biol Evol. 2009, 1: 288-293.
Clobert J, Garland T, Barbault R: The evolution of demographic tactics in lizards: a test of some hypotheses concerning life history evolution. J Evol Biol. 1998, 11: 329-364.
Markow TA, Raphael B, Dobberfuhl D, Breitmeyer CM, Elser JJ, Pfeiler E: Elemental stoichiometry of Drosophila and their hosts. Funct Ecol. 1999, 13: 78-84. 10.1046/j.1365-2435.1999.00285.x.
Meiklejohn CD, Presgraves DC: Little evidence for demasculinization of the Drosophila X chromosome among genes expressed in the male germline. Genome Biol Evol. 2012, 10.1093/gbe/evs077
Cooper KW: Normal spermatogenesis in Drosophila. Biology of Drosophila. Edited by: Demerec M. 1950, New York: Wiley, 1-56.
Truman JW, Bate M: Spatial and temporal patterns of neurogenesis in the central nervous system of Drosophila melanogaster. Dev Biol. 1988, 125: 145-157. 10.1016/0012-1606(88)90067-X.
R Development Core Team: A language and environment for statistical computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, ISBN 3-900051-07-0 http://www.R-project.org/
Felsenstein J: Phylogenies and the comparative method. Am Nat. 1985, 125 (1): 1-15. 10.1086/284325.
Martins EP, Hansen TF: Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. Am Nat. 1997, 149 (4): 646-667. 10.1086/286013.
Pagel M: Inferring the historical patterns of biological evolution. Nature. 1999, 401 (6756): 877-884. 10.1038/44766.
Paradis E, Claude J, Strimmer K: APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004, 20: 289-290. 10.1093/bioinformatics/btg412.
Freckleton RP: The seven deadly sins of comparative analysis. J Evol Biol. 2009, 22: 1367-1375. 10.1111/j.1420-9101.2009.01757.x.
This work was funded by NSF grant (DBI 0548366) to WFF, JJE and SK and NIH grant (HG002096-12) to SK. The authors would like to thank B. van Emden and R.R. Tyagi for technical assistance with GRASP and J. Bragg and F.S. Gilbert for useful discussions and comments on the manuscript.
The authors declare no competing interests, financial or otherwise.
JDJG carried out the analyses and drafted the manuscript. CA participated in the design and implementation of GRASP, in the analyses and in drafting the manuscript. HMM participated in the analyses and in drafting the manuscript. JJE, SK and WFF conceived of the study, implemented and currently maintain GRASP and provided comments on the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.