Characterization of oligopeptide patterns in large protein sets
© Bresell and Persson.. 2007
Received: 28 June 2007
Accepted: 01 October 2007
Published: 01 October 2007
Skip to main content
© Bresell and Persson.. 2007
Received: 28 June 2007
Accepted: 01 October 2007
Published: 01 October 2007
Recent sequencing projects and the growth of sequence data banks enable oligopeptide patterns to be characterized on a genome or kingdom level. Several studies have focused on kingdom or habitat classifications based on the abundance of short peptide patterns. There have also been efforts at local structural prediction based on short sequence motifs. Oligopeptide patterns undoubtedly carry valuable information content. Therefore, it is important to characterize these informational peptide patterns to shed light on possible new applications and the pitfalls implicit in neglecting bias in peptide patterns.
We have studied four classes of pentapeptide patterns (designated POP, NEP, ORP and URP) in the kingdoms archaea, bacteria and eukaryotes. POP are highly abundant patterns statistically not expected to exist; NEP are patterns that do not exist but are statistically expected to; ORP are patterns unique to a kingdom; and URP are patterns excluded from a kingdom. We used two data sources: the de facto standard of protein knowledge Swiss-Prot, and a set of 386 completely sequenced genomes. For each class of peptides we looked at the 100 most extreme and found both known and unknown sequence features. Most of the known sequence motifs can be explained on the basis of the protein families from which they originate.
We find an inherent bias of certain oligopeptide patterns in naturally occurring proteins that cannot be explained solely on the basis of residue distribution in single proteins, kingdoms or databases. We see three predominant categories of patterns: (i) patterns widespread in a kingdom such as those originating from respiratory chain-associated proteins and translation machinery; (ii) proteins with structurally and/or functionally favored patterns, which have not yet been ascribed this role; (iii) multicopy species-specific retrotransposons, only found in the genome set. These categories will affect the accuracy of sequence pattern algorithms that rely mainly on amino acid residue usage. Methods presented in this paper may be used to discover targets for antibiotics, as we identify numerous examples of kingdom-specific antigens among our peptide classes. The methods may also be useful for detecting coding regions of genes.
Sequencing projects have been ongoing for decades and have made enormous amounts of sequence data available. This opens up possibilities for large-scale investigations of oligopeptide pattern frequencies, both in general and on a kingdom or genome level by relying on statistically impressive amounts of data. For example, kingdoms can be classified on the basis of tripeptide pattern abundances using only the first two principal components, and the compositional signatures can be explained by habitats . However, at the level of relative amino acid composition, one can see a connection with growth temperature . In another study, the occurrence of oligopeptides of lengths three, four and five was investigated using the NCBI non-redundant sequence database, showing that many peptide patterns did not exist. Six non-existent pentapeptides were synthesized and expressed as parts of a soluble fusion protein in reasonably high yields, suggesting that oligopeptide patterns in proteins are selected on an evolutionary basis rather than by limitations in the biosynthetic pathway . It has also been shown that short amino acid residue patterns can be useful for predicting sequence features, e.g. secondary structure prediction using pentapeptides . Furthermore, efforts at local structure prediction have been made with sequence segments of length nine using profiles based on structurally aligned regions . Among the best-known initiatives is the Prosite pattern database, which has been used for many years in protein sequence annotation for assigning function and structure via regular expressions . Consequently, it is beyond doubt that short oligopeptide patterns carry information and that many patterns are either over- or under-represented.
Many common bioinformatic methods of today, e.g. BLAST, hidden Markov models, PSI-BLAST and Prosite scans, assume that the relative amino acid residue frequency is more or less the same for all larger protein data sets. However, if we have a database biased for a certain species, a kingdom or a set of protein families, then over- and under-represented oligopeptide patterns will cause overestimation of the accuracy of the result, for which an amino acid null frequency model will not account. Also, besides utilizing kingdom-specific peptide patterns for diagnostics, they can be used to find antigens and targets for antibiotics. It might also be possible to find patterns with high risk of causing autoantigens in eukaryotes after viral or bacterial infections.
In this study, we have performed a large-scale investigation of all possible combinations of five amino acid residues, pentapetides, in order to characterize oligopeptide patterns that are over- or under-represented in general or with respect to a kingdom. We find not only sequence patterns of known and frequently-used features but also patterns due to compositional bias. In addition, we find novel patterns which might be part of features not revealed by current bioinformatic methods, forming structural building blocks or segments selectively filtered because of unfavorable properties or immune response-induced epitopes.
We have searched in protein databases for pentapeptide patterns that are over- or under-represented. On one hand, we wanted to utilize as much sequence data as presently available. For this, we collected all protein sequences from 386 completed genomes, in the following referred to as the genome set. On the other hand, we wanted well annotated data in order to get information about the proteins. For this, we utilized the Swiss-Prot part of UniprotKB [7, 8], hereafter referred to as Swiss-Prot. We decided to use both these two data sets, since they complement each other. The Swiss-Prot database has high quality, is very well annotated and constitutes the current de facto standard of protein knowledge. Swiss-Prot contains many well-characterized proteins and one may suspect a bias because it is easier to characterize proteins that are easily purified and/or homologous to proteins that are already well known. The genome sequence set, on the other hand, represents a more complete and unbiased distribution with respect to different types of proteins. However, many sequences in the genome set have been predicted by unsupervised automatic high-throughput algorithms and hence might be of lower quality than those in the Swiss-Prot dataset. There might also be a bias towards organisms of medical and biotechnological interest. In addition, genomes might contain duplicated genes. Consequently, the two datasets have different properties which motivates their combined utilization in this investigation.
Number of different observed oligopeptides
Theoretical (20 d )
3 200 000
3 021 259
3 136 980
3 196 081
3 199 490
64 000 000
25 025 493
34 155 965
52 989 609
58 435 452
The oligopeptide sets of length four, five and six overlap in informational content because an oligopeptide set of length six is a partitioning of the set of length five, and that of length five is a partitioning of the set of length four. We limited our further studies to only one length, pentapeptides, which proved to be a good compromise between informational resolution and run times for the computer calculations.
In order to investigate various biological aspects of the nature of peptide patterns, we create four categories and focus on the 100 most extreme examples in each category. The categories are: (i) POP ("positively selected peptides"), which are the most abundant peptide patterns in observed data and are found not at all or only occasionally in randomized data; (ii) NEP ("negatively selected peptides"), which are those with extremely low abundance in available protein data but with high frequencies in randomized data; (iii) ORP ("over-represented peptides") are the most frequent kingdom-specific peptide patterns; and (iv) URP ("under-represented peptides") are those with extremely low abundance in a particular kingdom. POPs are expected to contain favored structural or functional motifs and might also belong to large protein families. They are expected in low numbers in view of their amino acid compositions but are in fact over-represented and must therefore result from positive selective pressure. ORPs are unique to a kingdom and might be used as diagnostic patterns. They will cause bias in databases that do not have equal portions of proteins from the three kingdoms. NEPs are expected to result from negative selective pressure and can be explained as structurally unfavored building blocks. URPs can be parts of epitopes that are inappropriate to the kingdom or avoided for other reasons and, as for the ORPs, this will lead to bias in protein databases.
The overall relative amino acid compositions for each kingdom in the two data sets are shown in Additional file 1, ordered by their average frequencies in the respective data set. The data follows trends in previous studies, i.e. our data are not contradictory to a habitat-amino acid usage correlation study  and consistent with kingdom classification via principle component analysis . Only small differences are observed between Swiss-Prot and genome data sets.
Swiss-Prot sequence features matching peptide patterns
Feature (Number of peptide patterns)
NP_BIND 25, ACT_SITE 7, REPEAT 6, MOTIF 4, BINDING 4, ZN_FING 3, METAL 2
METAL 22, ZN_FING 15, NP_BIND 7, ACT_SITE 3, DISULFID 3, CARBOHYD 2, VAR_SEQ 2, LIPID 1, SIGNAL 1, MUTAGEN 1, REPEAT 1
METAL 6, ACT_SITE 5, MOTIF 2, NP_BIND 1
METAL 33, ZN_FING 18, DISULFID 7, BINDING 6, ACT_SITE 4, TRANSMEM 2, VAR_SEQ 2, STRAND 2, REPEAT 1, SE_CYS 1, SITE 1, TURN 1, HELIX 1
METAL 10, TRANSMEM 5, DNA_BIND 5, DISULFID 3, ZN_FING 2, BINDING 1, ACT_SITE 1
TRANSMEM 10, ZN_FING 6, DISULFID 6, REPEAT 5, VAR_SEQ 4, COMPBIAS 2, METAL 2, CARBOHYD 2, VARIANT 1, BINDING 1, PROPEP 1, CONFLICT 1
TRANSMEM 5, PEPTIDE 1
TRANSMEM 7, VAR_SEQ 1, COMPBIAS 1, COILED 1
DISULFID 12, VAR_SEQ 8, REPEAT 6, TRANSMEM 5, COMPBIAS 3, METAL 2, ZN_FING 2, VARIANT 1, COILED 1, STRAND 1
TRANSMEM 3, VAR_SEQ 2, COMPBIAS 1, NP_BIND 1
TRANSMEM 4, DISULFID 4, STRAND 3, ZN_FING 1, MOTIF 1, PROPEP 1
METAL 3, BINDING 2, ACT_SITE 1, DISULFID 1, MOTIF 1, TRANSMEM 1, HELIX 1
METAL 4, BINDING 3, NP_BIND 1
TRANSMEM 17, STRAND 4, HELIX 3, METAL 2, DNA_BIND 1, ACT_SITE 1, TURN 1, ZN_FING 1, DISULFID 1, CROSSLNK 1
METAL 10, ZN_FING 9, DNA_BIND 3, NP_BIND 1, TRANSMEM 1
ZN_FING 22, DISULFID 5, COMPBIAS 2, REPEAT 2, CARBOHYD 2, TRANSMEM 2, LIPID 1, DNA_BIND 1, HELIX 1, COILED 1, VAR_SEQ 1, PROPEP 1
COMPBIAS 7, METAL 4, VAR_SEQ 3, ZN_FING 2, REPEAT 1, COILED 1
ZN_FING 49, COMPBIAS 10, TRANSMEM 6, DISULFID 6, REPEAT 3, DNA_BIND 2, SIGNAL 2, ACT_SITE 2, NP_BIND 1, METAL 1, CARBOHYD 1, VAR_SEQ 1
ZN_FING 10, METAL 8, DNA_BIND 3, NP_BIND 1, TRANSMEM 1
ZN_FING 24, DISULFID 9, COMPBIAS 2, TRANSMEM 2, HELIX 2, CARBOHYD 2, REPEAT 1, SIGNAL 1, METAL 1 TURN 1, NP_BIND 1, COILED 1, PROPEP 1, LIPID 1
NP_BIND 2, METAL 2, BINDING 1, ACT_SITE 1
TRANSMEM 19, HELIX 4, STRAND 4, METAL 2, TURN 2, DNA_BIND 1, ACT_SITE 1, ZN_FING 1, CROSSLNK 1
The POP category contains peptide patterns that are expected not to exist but are in fact found in large numbers. These peptides have intrinsically favorable properties and have undergone positive selection, presumably containing structurally or functionally important sequences. Only for few of the patterns, the databases contain information about properties, while for the majority of these patterns, functional assignments still remains to be done. Here we summarize selected findings about the most interesting patterns among the 100 most frequent peptides for each kingdom. All these patterns were found at most twice in the randomized data sets but between 28 and 1648 times in the observed data sets, so they are statistically unexpected. The patterns can be divided into three groups: a) large protein families, b) peptides with unassigned functions, and c) integrases and transposases.
More than half of archaeal Swiss-Prot POP share the nucleotide phosphate-binding feature, patterns that are also found in several thousand copies in other kingdoms in the Swiss-Prot data set. For the genome set we see other dominant characters, "zinc finger" and "metal". Only 13% of bacterial Swiss-Prot POPs are associated with a known feature, "metal" and "active site" being the most frequent, with examples from glutamine amidotransferases, the methionine import ATP-binding protein metN, the GTP-binding protein lepA and elongation factor G. The bacterial POPs from the genome set show many more feature associations (67%) than their Swiss-Prot counterparts, "metal" and "zinc finger" predominating. However, the most abundant peptide pattern WCGPC (599 occurrences), which is found in almost all bacterial species in the genome data set, has the features "disulfide" and "active site" and is found in thioredoxins . In the eukaryotic Swiss-Prot POP set, frequent peptide patterns originate from cytochrome b, homeobox associated proteins, and various sodium channels.
In the archaeal genome set, the third most abundant peptide pattern, CPVCG (258 occurrences), is not part of any known feature but is found in all 31 archaeal species in the genome set. The sequence is part of various biosynthesis related proteins. Another not yet feature-associated pattern found in all archaeal species is GMDKM, which is part of the archaeal chaperonin thermosome and its homologues in the eukaryotic cytosol (CCT) . Presumably, these patterns form structurally or functionally important motifs in the respective protein.
In the bacterial Swiss-Prot set, the peptides GMQFD (385 occurrences) and MQFDR (375) lack feature assignments and they are found in the 60 kDa chaperonin. In the eukaryotic Swiss-Prot set, the eight most abundant patterns (all with more than 1300 occurrences) are not associated with any known sequence features. These eight peptides (AMHYT, WWNFG, WIWGG, HICRD, PWGQM/QMSFW/MSFWG and EWYFL) are all found in the known conserved regions Q o , Q i and the two haem binding segments in cytochrome b , which is vital in eukaryotes as a component of the respiratory chain bc1 complex in mitochondria . We conclude that the patterns must be structurally or functionally important, since they are heavily over-represented (>1300 versus ≤ 2).
Furthermore, among the eukaryotic Swiss-Prot POPs, four sets of overlapping peptide patterns, WTTVW/TVWTD, HVWHM/VWHMP/WHMPA, GHPWG/HPWGN and PFMRW/FMRWR/MRWRD, and the single peptide patterns WNIGI and HRAMH, found approximately 430 times each, lack known sequence features except for the last-named, which is annotated with "binding" and "active site". They are all found in ribulose bisphosphate carboxylase (RuBisCO), which catalyzes the first major step in carbon fixation by the Calvin cycle . Another unfeatured set of peptides is VYPWT/YPWTQ (403/377 occurrences), found in various haemoglobin subunits.
In the eukaryotic genome set, we find two highly abundant peptide patterns, which are not parts of any known features-FHWCC (283 occurrences, 29 eukaryotic species) and WCCYV (207 occurrences, 25 species). The corresponding proteins belong to the Wnt signalling pathway, which is a large family of cysteine-rich secreted glycoproteins controlling development in multicellular organisms .
In the bacterial genome set, the POP peptide NCWDN is found 218 times although expected only twice. This peptide pattern is found only once in Swiss-Prot, but the description line of the proteins in the genome set that harbor this pattern shows that half of them are integrases. Integrases are usually used by viruses (e.g. HIV) to integrate genetic material into the host DNA and have been suggested as therapeutic targets . Note, however, that no virus proteins are included in our genome set, and all these hits are in prokaryotic proteins. The remaining NCWDN-containing proteins are transposases, which are involved in the transfer of transposons within a genome. Hence, as the functions of these two protein families are very similar, the NCWDN peptide pattern might be directly involved in the integrating activity.
The two most abundant peptide patterns in the eukaryotic genome set, WWDHF (569 occurrences) and WCMRH (313), are not found in Swiss-Prot at all, and nearly all protein hits (except 6 and 14, respectively) are to a putative protein retrotransposon in Oryza sativa (rice). The peptide pattern YCKWH (203) also occurs mainly in this family. Transposable elements are abundant among the POP and ORP (cf. below) categories from the genome data set, but this high abundance usually originates from only one or a few species (see below for examples). Genome projects differ in whether repeats and transposable elements are included in the main release of protein predictions of a genome, so there is no systematic way to exclude these. Interestingly, the high copy numbers of transposable elements are believed to be important in rapid speciation . Hence, they might be of great evolutionary importance but will distort the distribution of native peptide patterns in studies such as this.
The most frequent peptide pattern in bacterial ORPs from the genomic set, HYNWH (216 occurrences, 9 species), matches 205 copies of transposase from Bordetella pertussis Tohoma I. Similarly, 184 of 190 occurrences of the second most frequent peptide pattern, IMTWM, come from transposase copies from only one species, in this case Mycobacterium ulcerans Agy99. However, the IMTWM pattern is also found in the transpeptidase region of a penicillin binding protein in the bacterial species Nostoc sp. PCC 7120 and Anabella variabilis ATCC 29413, which may explain why it does not occur in eukaryotes. The pattern EFWCR (109 occurrences, 8 species) is part of a multicopy transposase protein in Yersinia pestis and Salmonella enterica.
Also in the genome bacterial URP set we find motifs originating from a retrotransposon protein family, in this case MCVDY (1094), MYCAE (435) and TMYCE (412). All these are significant and primarily found in only one species, rice.
Categorization of peptide sets
Top 100 by
≥ 10 in A, ≤ 2 in rand. A
Freq. in orig. A
≥ 10 in B, ≤ 2 in rand. B
Freq. in orig. B
≥ 10 in E, ≤ 2 in rand. E
Freq. in orig. E
≤ 2 in A, ≥ 10 in rand. A
Freq. in rand. A
≤ 2 in B, ≥ 10 in rand. B
Freq. in rand. B
≤ 2 in E, ≥ 10 in rand. E
Freq. in rand. E
≥ 10 in A, ≤ 2 in B+E
Freq. in orig. A
≥ 10 in B, ≤ 2 in A+E
Freq. in orig. B
≥ 10 in E, ≤ 2 in A+B
Freq. in orig. E
≤ 2 in A, ≥ 10 in B+E
Freq in orig. B+E
≤ 2 in B, ≥ 10 in A+E
Freq in orig. A+E
≤ 2 in E, ≥ 10 in A+B
Freq in orig. A+B
Only 6–16% of the archaeal NEPs have known features (Table 2). For peptides with feature associations, all except one (in the Swiss-Prot data set) are of the type "transmembrane". These featured peptide patterns are rich in the hydrophobic residues leucine, isoleucine and valine. Leucine and isoleucine are much more abundant in this peptide pattern class than in the archaeal Swiss-Prot proteins overall (Figure 3). These patterns do not exist in archaea, although archaea have more isoleucine and valine residues than eukaryotes and bacteria. One may suspect an inherent restriction on how leucine and isoleucine containing proteins are able to fold into working entities in archaea. However, when archaeal NEPs in the genome data set are examined, no extreme differences in amino acid residue contents are observed. It is possible that the much smaller sample space (shorter total length, see Figure 1) of archaeal sequences in comparison to eukaryotes and bacteria masks some part of the informational pattern. Hence, caution is needed in drawing conclusions about archaeal NEPs, as NEPs are very sensitive to the size of the data set during the data filtration.
The number of patterns that have cysteine-cysteine and proline-proline dipeptides in NEP
All: all proteins from the kingdom.
For the most expected NEP pattern, CSCCC (40 occurrences in randomized data) in Swiss-Prot, one may suspect that several consecutive cysteines are unfavorable. However, considering the difference in relative residue frequency for eukaryotic NEPs in Swiss-Prot compared to the overall distribution (Figure 3), no extremes are observed for any amino acid residues, which makes this cysteine-rich peptide pattern an exception in this peptide class. Disulfide bridges connect polypeptide chains or distant segments within the same chain and are known to depend on the spacing of cysteines in the linear sequence . Apparently, consecutive cysteines are statistically expected but have been negatively selected during evolution. Eukaryotic NEPs in the genome set contain many of the rare cysteine and tryptophan residues (Figure 3). Several of the NEPs in the genome set also have cysteine-cysteine dipeptides (Table 4). One of the most expected NEPs in the genome set (RCDLM, 50 occurrences in randomized data) is found twice among eukaryotes, in an unknown protein from the plant Arabidopsis thaliana and in a novel protein from the fish Takifugu rubripes. In contrast, it is found ten times in bacteria and one may speculate that the peptide pattern is part of an immunological triggering epitope, as for bacterial stress response proteins .
ORPs are peptide patterns that are found at least 10 times in one kingdom and at most twice in the union of the other two kingdoms. They are therefore to be considered unique to one kingdom. Analogously, URPs are peptides not found at all or in only low numbers in a kingdom. As the randomized data set is not used in the filtering step it is possible that the retrieved ORPs and URPs are just a result of compositional bias. To investigate this possibility, the patterns were tested for significance (cf. Methods section).
In the Swiss-Prot and genome archaeal ORP sets, only 54 and 6 peptide patterns, respectively, passed the filtering step. The small number of archaeal ORPs that passed the filtering step is largely due to the much shorter total sequence length in archaea compared to those in bacteria and eukaryotes (Figure 1). On the other hand, the resulting peptide patterns have passed a harder filtering criterion and might therefore be considered even more specific than those of bacteria and eukaryotes. Three of the six patterns in the genome set were overlapping (EMCCH/MCCHY/CCHYD) and found 18 times each, all in the same protein and in the same species, Methanospirillum hungatei JF-1.
Among archaeal Swiss-Prot URPs, no peptide patterns are biologically significant (p < 0.05) and all may result from the much smaller size of the archaeal section of Swiss-Prot. In the genome counterpart, seven peptide patterns are significant. There are numerous examples of peptide patterns found more than 10 000 times in eukaryotes or bacteria. All these have the zinc finger feature but only one is significant, the THTGE pattern with 13 209 occurrences. Other significant peptide patterns come from collagen-associated proteins and cadherin-associated proteins .
Two of the most abundant bacterial ORP patterns, FRCGF (268 occurrences) and FGFRC (245) in Swiss-Prot, have no feature association but occur in the GTP-binding protein lepA family, the function of which is unknown. The peptide patterns DWMEQ (265 occurrences) and YHDVD (235), together with the overlapping patterns GSYHD (200) and YHDVD (235) in the eukaryotic Swiss-Prot URP set, come from the conserved elongation factor G family, responsible for the accuracy of translation in the ribosome and preserved in all kingdoms. This protein has been suggested as a target for antibiotics  and therefore these bacterial-specific patterns now found are interesting sites for further investigation. Two other interesting peptides, MGAQM (234 occurrences) and MNPMD (210 occurrences), are parts of the 60 kDa chaperonin, a protein also found in the bacterial POP set. Like other bacterial stress response proteins, this protein family harbors human immune response activating antigens , which explains why these peptide patterns are not found in eukaryotes. The bacterial ORPs in the genome set are rich in tryptophan and have more feature associations than those of the Swiss-Prot set (29% versus 7%), the most common of which is "transmembrane" (Table 2). As in the Swiss-Prot data set, the translational machinery is also represented here, although in this case it is FCDWY (140 occurrences, 138 species), which is found in the bacterial form of valyl-tRNA synthetase. This pattern is also the most widespread of the eukaryotic URPs in the genome set.
About half the eukaryotic Swiss-Prot URPs are significant. Two examples of the most abundant peptide patterns in other kingdoms are YAEGY (270) and VMPQT (223), which are parts of serine hydroxymethyltransferase and translation initiation factor IF-2, respectively. Very few of the eukaryotic Swiss-Prot URPs are feature associated. Among the eukaryotic URPs in the genome set, 80 peptide patterns are found in significantly very low numbers and are therefore expected to be missing for biological reasons. One of these patterns is GWMHD (110 occurrences), which is part of the 1,4-alpha-glucan branching enzyme responsible for the branched structure of glycogen. The enzyme is also found in animal cells, but this peptide pattern seems unique to the bacterial form, known to be different from the eukaryotic version . The GWMHD peptide pattern is widespread in the bacterial kingdom and is found in 100 of 303 of the bacterial species in the genome data set. The peptide pattern QWAYA (133 occurrences, 37 bacterial species) is part of the UDP-N-acetylmuramate-L-alanine ligase, a protein involved in the biosynthesis of the peptidoglycan murein, which is an essential part of the bacterial cell wall. The enzymes involved in this process are interesting antibacterial drug targets as they are not found in eukaryotes .
The most common features are "metal" and "zinc finger". Zinc fingers are found in many forms but the peptide patterns in this class are primarily of human origin. Further ORPs originate from homeobox-associated proteins, hemoglobins, and the RuBisCO protein family  (cf. POP above). Similarly, for the Swiss-Prot bacterial URPs, we notice cytochrome b and the RuBisCO family. The other features associated with eukaryotic ORPs in the genome set, e.g. "disulfide" and "coiled", generally originating from various protein families, indicating that they are independently-occurring common patterns.
In the genome bacterial URP set, about half the URPs are significant. The most widespread peptide pattern among eukaryotic genomes and not found in bacteria is HHCPW (535 occurrences in eukaryotes, 48 species), which is part of the DHHC tetrapeptide sequence motif in a putative zinc finger of the palmitoyltransferase family .
The most extreme of eukaryotic ORPs in the genome data set is ECKQC, which is found more than 10 000 times; however, these sequence hits are found in only 34 of the 52 eukaryotic species. WGCFD (379 occurrences, 41 species) is unfeatured and occurs in the dynein protein family, which transports cellular cargo along the microtubules in eukaryotic cells . As these patterns are all biologically significant and not the results of amino acid residue bias, one may suspect that they encode common folds or favorable motifs for eukaryotes.
The 24 classes of peptide patterns and their respective overlaps are outlined in Figure 4. The largest overlaps are found between ORPs in bacteria and URPs in eukaryotes and vice versa. These evidently have dual properties. ORPs from bacteria have a high (61–77%) overlap with URPs from eukaryotes. The reciprocal case URPs from bacteria and ORPs from eukaryotes are even more similar (78–89%).
Common patterns between ORPs and URPs in bacterias and eukaryotes
Most patterns are not associated with any known feature, but are probably part of an important biological entity unique to the respective kingdom, which has not been elucidated so far. Among the unfeatured ORP-E/URP-B, the four patterns LRLSC, RLSCA, GHPIS and RNLSH are described as maturase K-associated in the ORP-E section. The remaining peptide patterns, not described earlier, are all cytochrome b associated. In the genome data set many of the peptide patterns are observed in only a few organisms, and the common theme of these seems to be the retrotransposase family discussed earlier. The four peptides ECVWQ, CKQDV, PKYCI and SKFWY are found in 20 or more of the 52 eukaryotic genomes, but have no common descriptions; however, the many occurrences of the last two are due to hits in multiple proteins in only one organism, Trichomonas vaginalis and rice, respectively.
All numbers in this study are dependent on how many protein sequences have been discovered to date (2007). Some insight into the effect of time can be offered, as we have a similar data set (unpublished) from 2003. Swiss-Prot has increased by 33% in length and 40% in number of proteins, but the fractions of shared pentapeptide patterns (Figure 2) are still similar. Hence it seems that the growth of Swiss-Prot is fairly homogeneous. The genome set, however, has increased by 200% in length and 188% in number of proteins. It now includes 386 species compared to 137 in 2003. Interestingly, though, the total sequence fraction of bacteria is larger in 2007 than 2003. This stands even though we included only one strain per species in the current study, and completely-sequenced organisms with multiple strains are mostly bacterial. Another notable difference in the genome set is that the fraction of patterns unique to eukaryotes has decreased from 5% to 1% and that patterns that are found in all kingdoms have increased from 62% to 75%. Furthermore, the number of unobserved pentapeptide patterns has decreased by less than one percentage point, while the non-existent hexapeptide patterns have decreased by approximately 7 percentage points for Swiss-Prot and about 15 percentage points for the genome set. Hence, databank growth does affect oligopeptide patterns of length six and longer, but it seems that we have already reached saturation of the available patterns for oligopeptides of length five.
Methodology and ideas from this study may be important in further studies. An interesting application would be to construct a predictor for protein-coding sequences that is different from ab initio algorithms such as Genscan . One such effort has already provided complementary information on this subject . However, the training data in that study were limited to structural motifs of 471 proteins and Pfam alignments, the latter only accounting for 38% of the Swiss-Prot sequences. The informational content of short oligopeptides such as those in our study might possibly be used to distinguish features in truly-expressed exons from those in translated introns and open reading frames, that have been frame-shifted.
Although there are no obvious differences in amino acid residue preferences between the genome and Swiss-Prot sets, we see marked differences in pentapeptide characteristics. Almost all pentapeptide patterns exist, but there are sets of over- and under-represented patterns that are extreme in frequencies, even if compositional bias is considered. The abundances of many of the highly represented peptide patterns in this study can be explained on the basis of the protein families from which they originate. Notably, only a few protein families give rise to most of the over- and under-represented peptide patterns between kingdoms. These are mainly in three categories: (i) proteins widespread in a kingdom, such as respiratory chain-associated cytochromes and proteins associated with the translation machinery; (ii) patterns with unassigned functions, of special interest for understanding structural and functional mechanisms of proteins; and (iii) multicopy proteins such as retrotransposons, which usually carry a species-unique peptide pattern. Categories (i) and (ii) are found in both Swiss-Prot and the genome set while category (iii) is found only in the genome set. In our study we used only one set for each species, but for many of the completely-sequenced species there are multiple releases for several strains, suggesting that if included, category (iii) protein families will give rise to even more extreme numbers of occurrences. As sequence patterns are fundamental in many bioinformatics algorithms, this raises questions about the need to correct for over-represented peptide patterns such as those found in this study.
The UniprotKB/Swiss-Prot database (release 51.5, January 2007), which consists of 255 000 sequences, was downloaded from EBI [7, 8]. All proteins of viral origin (8000 proteins) were removed from the original release to make the Swiss-Prot data set comparable to the genome data set. The genome set was assembled from complete genomes downloaded from GenBank , TIGR  and EnsEMBL [34, 35] (January 2007). For genomes with multiple strains, only the strain with the largest number of proteins was included, resulting in a set of 386 completely-sequenced organisms representing 31 archaeal, 303 bacterial and 52 eukaryotic species, with a total of 2 million protein sequences. Details of the genomes included in this data set are given in Additional file 5.
For statistical comparisons, reference sets were generated from the genome and Swiss-Prot data sets by randomizing the original sequence data, on a per protein basis. Given a set of original protein sequences Ω = o 1, o 2, ..., o j , ..., o N where the protein sequence o j has the letters o j1 o j2 ...o jk ... (using the amino acid residue alphabet, A). From Ω we create a set of permutated protein sequences Π = p 1, p 2, ..., p j , ..., p N where p j contains all the letters from o j but in arbitrary order, p j = randomize(o j ). The function randomize is defined as,
randomize(o j )
randObj = instantiate new randomization object with new seed.
permutatedSequence = ""
while len(o j ) > 0 :
k = randObj.randrange(1, len(o j ))
permutatedSequence += o jk
from o j remove o jk
That is, each randomized sequence was created by adding residues one at a time from a random position in the original sequence. A new residue was taken at each iteration until the original sequence was consumed.
This resulted in a randomized sequence of the same length and with the same amino acid residue composition as the original.
For further analysis only the pentapeptide sets were used. The choice of length five was a compromise between complexity of the sequence patterns and the informational content of the possible set of words of length d (e.g. setting d = 6, we observe only 39–91% of the possible words, Table 1).
The abundances of pentapeptides in the original and randomized sets were retrieved using a Linux Cluster of Beowulf design of 32 nodes (1800+ AMD CPU, 512 MB RAM per node) and a 64 bit Linux system with 8 GB of RAM. The sets of POPs, ORPs, URPs and NEPs were generated by filtering the data according to the rules in Table 3 and then selecting the 100 top-ranked peptides. Within a set, the peptide sequences were clustered by scoring ungapped pairwise alignments using an identity matrix. Multiple sequence alignment was made by grouping them by single-linkage hierarchical clustering and using a cut-off score value of three or more.
A null hypothesis was stated that occurrences(ω i , Ω) is . A p-value for x observations of ω i was calculated and the null hypothesis was rejected for all ω i with a p-value less than or equal to 0.05. Those that not belonged to the null hypothesis were considered to be biologically significant. That is all POPs and ORPs that satisfy
P(x ≥ occurences(ω i , Ω)| ) ≤ 0.05
and all NEPs and URPs that satisfy
P(x ≤ occurences(ω i , Ω)| ) ≤ 0.05
Note that the estimation of the null distribution of a pentapeptide is based on at most 120 samples, and for patterns with less than five different residues this number is even lower. For homopeptides (which are found only in a few cases in our peptide categories), no permutations are possible, hence a null distribution will be based on only one sample. Therefore, the p-value should be considered more as a guide for excluding statistically expected patterns than an accurate calculation of a probability.
To determine whether a peptide is part of any novel or known feature, a scan against all Swiss-Prot entries (release 51.5) was performed. Every hit in the FT field was recorded and those features that covered at least one fifth of the sequence hits are listed in Additional files 3 and 4. Ambiguous features such as "chain", "domain", "topological domain" and "region" were discarded in further analysis. A python script to retrieve fasta headers was used to retrieve information about proteins in which a certain peptide pattern was observed.
negatively selected peptides
over-represented peptides in a kingdom
positively selected peptides
ribulose bisphosphate carboxylase
under-represented peptides in a kingdom
Carl Trygger Foundation and Linköping University are gratefully acknowledged for financial support. We thank Jan-Ove Järrhed for computer support and Roland Nilsson for valuable discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.