Materials and Methods
• Proteomes: SWISS-PROT/TrEMBL non-redundant complete proteome sets of A. thaliana, C. elegans, D. melanogaster, S. cerevisiae, and S. pombe were downloaded from EBI on Apr 25th, 2003; human IPI 2.18, mouse IPI 1.11, rat IPI 1.1 from EBI. Zebrafish from NCBIrefseq. Fugu and Ciona inestinalis from JGI
• Domain or Motif Databases/Algorithms: InterPro database version 6.2 is downloaded from EBI; CDD database version 1.62 is from NCBI; Blast Package 2.2.6 downloaded from NCBI; Superfamily 1.61 PSSMs downloaded from MRC-LMB; Printscan 3.595 and PRINTS 35.0 downloaded from UMBER; Prosite database 17.26 and search tool pfscan 1.5 downloaded from Expasy; Profile HMMER 2.3 downloaded from Wustl; SignalP V2.0.b2 from CBS; TMHMM 2.0 was run using CBS web service;
• Software and Programming Languages: Perl language 5.8.0 is from http://Perl.com; PHP 4.3.2 from http://php.net; MySQL 4.0.12 from http://mysql.com; Apache 1.3.26 from http://Apache.org; GD graphics library 1.8.0 from http://Boutell.com
All of the algorithms were run with default parameters. After that, the original results were parsed by Perl scripts. Data were stored in MySQL database.
Apache and PHP were used to set up the web server and write the web pages. The graphical display in PCAS was realized by calling GD graphics library.
Annotation Pipeline
PCAS, ProteinCentric Annotation System, includes most of the motif or domain databases and analysis methods mentioned above. Moreover, we included SCOP [21] based SUPERFAMILY [22] classification and COG [23] classification. SUPERFAMILY analysis offers structure based protein annotation; COG, short for Clusters of Orthologous Groups of proteins, is a system delineated by comparing protein sequences encoded in over forty complete genomes that represent 30 major phylogenetic lineages. Assigning proteins to COGs could provide not only valuable evolutional inference but also functional information derived from evolutional analysis. We also included SignalP [24] and TMHMM [25] to perform a priori prediction of leader peptide and TM regions, which is indicative of proper function of a protein.
Following are algorithms, motif or domain databases and application software in current PCAS:
hmmpfam V2.2g search of Pfam 8.0 , SMART 3.4, TIGERFAM 2.1 HMMs
RPS-BLAST V2.2.6 of CDD 1.62 PSSMs: COG, Pfam, Smart, LOAD and NCBI curated CD
pfscan V1.5 of PROSITE 17.26 patterns and profiles.
PSI-BLAST V2.2.6 of SCOP based SUPERFAMILY 1.61 PSSMs
FingerPRINTScan V3.595 of PRINTS 35.0 fingerprints.
SignalP V2.0.b2 for signal peptide prediction.
TMHMM 2.0 for TM prediction.
Cross-reference Motif or Domain Database
We used InterPro system as the basis for data integration, which consists of most available motif or domain databases. Certain subsets in CDD and InterPro are overlapping. Therefore, through CDD, we can also map COG and SUPERFAMILY to InterPro.
Mapping COGs to InterPro
According to the embedded relations between different CDD subsets, we can map COGs to Pfam and Smart then to InterPro. For example, in CDD database, COG0004 is related to Pfam00909, which is a signature of IPR001905; COG0004 is then mapped to IPR001905. Among the 4873 COG PSSMs in current CDD release 1.62, 2285 COGs were related to Pfam or SMART, and were mapped to InterPro database. Some of the unmapped COGs may be associated with more than one InterPro entry, and the rest simply are not included in current InterPro 6.1 yet.
Mapping SUPERFAMILY to InterPro
There are 7550 HMMs representing 1109 superfamilies. We first map those HMMs to CDDs by performing hmmpfam search against CDD's representative sequences, then using the embedded CDD and InterPro cross references, map those SUPERFAMILY HMMs to InterPro entries and further map SUPERFAMILY to InterPro. Detailed mapping strategy is described as following:
Step1, apply E value cutoff at 0.01.
202 SUPERFAMILY HMMs that have no hits and 464 HMMs that only have hits with E value greater than 0.01 were thrown out.
Step2: filter out those with CDD and InterPro link broken
90 HMMs are filtered out since their CDD hits have no corresponding InterPro entries.
The remaining 6994 (7550-202-464-90) HMMs were divided into 3548 1:1 relations (one SUPERFAMILY model corresponds to one InterPro entry) and 3446 1:N relations (one SUPERFAMILY model corresponds to more than one InterPro entries).
Step3: apply coverage filters
For 1:1 relations, we applied following coverage filters:
0.5 <= sf_length / cdd_length <= 2;
(The length ratio of SUPERFAMILY HMM and CDD representative sequence)
sf_coverage >= 0.75;
(The length coverage of the aligned query SUPERFAMILY HMM over the full length of the query SUPERFAMILY HMM.)
cdd_coverage >= 0.75.
(The length coverage of the aligned CDD representative sequence over the full length of the CDD representative sequence)
Total 561 out of 3446 1:1 relations were initially filtered out. By matching descriptions in SUPERFAMILY and InterPro, we further retained 4 mappings.
For 1:N relations, the coverage filters are a bit more stringent that above:
0.8 <=sf_length/cdd_length < = 1.25
sf_coverage > = 0.75
cdd_coverage > = 0.75
-
1957
1:N relations were filtered out. By matching descriptions in SUPERFAMILY and InterPro, we further retained 22 mappings. At this point, all the mappings are 1:1 relations.
In summary, 4502 (26+1489+2987) out of 7550 SUPERFAMILY HMMs are mapped to InterPro entries, and they are all 1:1 mappings. In terms of 1109 superfamilies, 551 have only one InterPro entry; the rest are mapped to zero or multiple InterPro entries, with 66 as the maximum (P-loop containing nucleotide triphosphate hydrolases in SUPERFAMILY). For those with multiple InterPro entries, we treated them as unmapped, since those InterPro entries most likely represent certain subfamilies and it is scientifically inaccurate to use subfamily's definition to describe superfamily.
We would like to point out that the purpose of mapping SUPERFAMILY to InterPro here is to find out the best possible InterPro description for a particular SUPERFAMILY member. Therefore, we took relatively stringent mapping conditions described above. This enables PCAS to steer clear from the possible complication caused by multi-domain protein families or by superfamily, subfamily relationships.