Skip to main content

Advertisement

Table 2 A text mining approach using an entropy-based scoring function rediscovers the molecular function of proteins sharing PROSITE motifs

From: The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

Motif # of proteins # of documents Terms
EF_HAND ef-hand
36 calcium-bind
183 calcium
  ca 2+
  calcium-bind protein
  ca
  2+ bind
  2+
  ef-hand motif
  calmodulin
TRYSIN_SER serin proteinas
11 proteinas
108 chymotrypsin
  serin
  serin proteas
  elastase
  ser-195
  his-57
  proteinas especially
  proteolyt
PROTEIN KINASE_ST protein kinas
15 catalyt domain
107 phosphoryl
  substrat
  autophosphoryl
  phosphoryl site
  kinas
  threonin
  catalyt
  constitutively active
  1. The method extracts text from the abstracts of references annotated in each protein's Swiss-Prot record, pre-processes the text (tokenization into terms, removal of non-content words, and basic stemming to normalize word forms), and scores terms based on their distribution across proteins and their relative significance in the entire corpus of Swiss-Prot referenced documents. With no additional normalization, concept and word redundancy may be observed. Although still very preliminary, the method is able to capture the molecular function for each cluster of proteins shown: "ef-hand" and "calcium binding" for EF_HAND; "serine proteinase", "proteolysis", and the active site residues "ser-195" and "his-57" for TRYPSIN_SER; and "protein kinase", "phosphorylation", "catalytic domain" and the substrate residue "threonine" for PROTEIN_KINASE_ST.