Biozon's EST2Prot system builds upon the infrastructure of Biozon. It uses multiple data sets, all integrated into a single, tightly connected schema that enables great flexibility in querying for complex relations between entities. Specifically, we utilize the many paths that exist between entities in the Biozon data graph to map ESTs to protein products.
Biozon
The Biozon database [21] is a system that unifies multiple biological databases consisting of a variety of heterogeneous data types (such as DNA sequences, proteins, interactions, cellular pathways and more) into a single schema. Logically, the database is viewed as a large graph where biological entities correspond to nodes and edges correspond to relations, as is depicted schematically in Figure 1. The underlying assumption of Biozon is that any biological entity or process can be associated with a physical object or a set of physical objects. Therefore, physical objects form the backbone of the database and their physical properties serve as the actual identifiers. For example, a protein is uniquely identified by its amino acid sequence and a DNA by its sequence of nucleotides. An interaction between two proteins or between a protein and a DNA is represented as a set of physical objects (the interacting partners), a protein family is a set of protein sequences, a metabolic pathway is a set of reactions (each one associated with a protein (enzyme) family) and so on. Each type of object is also associated with an identity operator that is used to compare entities and determine whether they are identical (for example, for sequences the string match operator is used, for sets we use the set-identity operator and for arbitrary subgraphs graph isomorphism is used).
The reliance on physical entities and sets of physical entities as our backbone is especially useful for data integration since it allows unambiguous unification of many entities from different databases based on their physical properties. For example, a protein sequence that exists in Swiss-Prot [22], PIR [23] and RefSeq [24] will be mapped to the same sequence object (node) in the data graph and the information that is available in these sources about this protein will be accessible from a single entry point in Biozon. Unlike identifiers such as accession numbers and cross-references that are potentially unstable or inconsistent (as each database uses its own set of identifiers), relationships that are established based on physical non-redundant Biozon objects are highly reliable and are materialized explicitly in the data graph. This has a great benefit in linking entities from disparate sources. For example, paths are formed between protein domains from InterPro [25] and interactions from BIND [26] or between protein structures from PDB [27] and metabolic pathways from KEGG [28]. Relations between objects in Biozon can have different meanings, depending on the entities they connect. For example, 'member of' is a relation that connects a protein to a protein family or an EST to a EST cluster. The relation 'manifests' relates a protein sequence to its structure, 'encodes' relates a DNA sequence to protein sequence(s), 'similar' relates two similar protein sequences and so on. The large-scale data integration results in a highly connected graph structure that allows one to see each entity in its broader context with all its related entities; a context that cannot be determined from any one source. Utilizing its graph structure, Biozon allows complex and fuzzy searches on the data graph that span multiple data types and specify desired interrelationships between them. For more details on the Biozon schema and its various components see [20].
Mapping ESTs to proteins
The EST2Prot system exploits a subset of the Biozon schema, including DNA sequences, proteins and EST clusters and the 'encodes', 'substring' and 'similar' relations. We explore five different direct paths in the Biozon data graph, and say that EST s is directly mapped to protein p if:
1. s encodes p
2. s is a substring of DNA s' near an encoding region of s' which encodes for p (see section 'Relations').
3. s is a member of a UniGene cluster to which NCBI assigns p
4. s is a member of a UniGene cluster containing s' and s' encodes p
5. s is a member of a UniGene cluster containing s' and s' is a substring of s" near an encoding region of s" which encodes for p
We say an EST s maps to protein p if s directly maps to p or if s directly maps to p' and p' is similar to p as described in section 'Relations'. An overview of our system is given in Figure 2.
It should be noted that while UniGene relies just on BLAST searches with respect to eight model organisms, Biozon uses all these paths at once to create a more comprehensive mapping between ESTs and proteins. It is the tightly connected schema of Biozon that enables immediate information flow and deduction of paths between entities, without having to resort to external resources outside the database or expensive computations. Most notably, the materialization of similarity data brings forward instantly an unprecedented amount of information that otherwise would requite millions of BLAST searches. This is especially important since often proteins with unknown properties can be characterized based on their similarity with better studied homologous proteins.
Data sets
DNA sequences are gleaned from GenBank records. As of September 2005 (release 2.2), Biozon contains 42,686,711 unique DNA sequences. Proteins are extracted from several databases (including Swiss-Prot/TrEMBL, Genpept, PDB, PIR, BIND and other sources) and unified into a non-redundant set based on their physical sequence of amino acids (rather than based on cross-links). All together, Biozon contains 2,062,061 unique protein sequences in release 2.2.
EST clusters
In response to the growing chaos of EST data, NCBI developed UniGene [6], a gene-oriented clustering of transcribed nucleic acid sequences. UniGene includes only protein-coding genes which have at least 100 high quality non-repetitive base pairs. It also requires that its clusters be 3' anchored. Clusters not showing evidence of reaching the 3' terminus are eliminated (these are usually singleton clusters). Each UniGene cluster represents a gene and its alternative splice forms. Associated with each cluster are the gene's possible protein products. These proteins are chosen by comparing the cluster sequences with the available proteomes of eight model organisms [Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (nematode), Saccharomyces cerevisiae (baker's yeast), Escherichia coli, and Arabidopsis thaliana (mouse-ear cress)]. For each model organism, the cluster is assigned the protein most similar to a representative sequence with respect to some similarity threshold (BLAST evalue less than 1e-6). If no sequence in a cluster has a significant BLAST match, then that cluster is left unassigned. In fact, UniGene do not assign proteins to 42% of its clusters. UniGene clusters for 54 organisms were integrated into the Biozon schema and in release 2.2 this dataset contains 807,175 clusters with a total of 19,471,927 EST sequences.
Relations
To determine possible links between EST sequences and proteins we explore several paths, as is depicted in Figure 2. These paths are based on the following relations.
The 'encodes' relation
This relation ties nucleic acid sequences and proteins. The relations are not established based on cross links, but rather based on physical properties. Each encodes relation (d, p) indicates that the DNA sequence d contains a coding region that can be translated completely to the protein sequence p.
The 'UniGene encodes' relation
This relation is established between UniGene clusters and proteins. The relations are established by the UniGene team as described above.
The 'substring' relation
This relation exists between strings of the same data type (e.g. nucleic acid sequences). A substring relation (d, d') indicates that the DNA sequence d is a fragment of the longer DNA sequence d'. Of special interest are substring relations that place a fragment d near a coding region of d'. If d is no more than 50 base pairs away from overlapping a coding region of d' that encodes for protein p, then we say that d is linked to p (the strict threshold of 50 base pairs was chosen to ensure high quality, however, as Figure 3 shows, more permissive thresholds can be used to extend the set of links formed between DNA and protein sequences).
The 'similarity' relation
The similarity relation is one of the most fundamental relations in biology, frequently used for functional inference. Biozon computes and stores similarity relationships between proteins based on sequence, structure or expression profiles. The integration of similarity data enables the propagation of information from well-studied entries to uncharacterized ones.
Biozon contains pairwise similarities for about 2,000,000 sequences, which were computed using BLAST [29], resulting in a total of about 6.5 billion significant pairwise similarities (with evalue < 0.1). These similarity relations are used to extend the mappings from ESTs to proteins, thus increasing the set of functional descriptors that can be associated with an EST. The great advantage of the similarity relations of Biozon is the scalability and accessibility. Since EST analysis requires expensive database searches to search for possible protein products, it is difficult to scale existing methods for EST analysis to large libraries. By materializing similarity data, knowledge propagation in Biozon becomes immediate, thus facilitating the task of function assignment.
Target proteins
A biologist might be interested only in ESTs that are linked to a specific biological system. To address this need, the EST2Prot system can be queried with respect to specific biological descriptors. The system collects a set of target proteins with relevant functions and reports the ESTs which map to at least one target protein. We define our target proteins by target descriptors, which are based on GO terms [30] and SwissProt keywords [22]. SwissProt keywords are descriptors that are associated with proteins based on manual curation. These keywords have been used in many studies to automatically annotate proteins or assess the biological function of protein clusters (e.g. [31, 32]). The Gene Ontology (GO) functional descriptors are obtained from the GO database [30]. GO terms are organized in an acyclic tree-like graph where a node's parent represents a property that is more general than the node's property. However, unlike a tree form of a graph, in the GO graph it is possible to have more than one path leading from the root to a node. Also, a protein may be assigned more than one GO term, each one on a different branch of the graph (the different branches represent different groups of properties). GO terms in Biozon were collected from multiple sources, downloaded from the GO consortium website and extracted from databases such as UniProt. A total of 1,111,272 proteins in Biozon can be associated with GO terms in release 2.2. (Since protein databases contain many similar and almost identical proteins, the number of functionally different proteins with GO terms is obviously smaller).
User interface
Given an EST (a GenBank or RefSeq accession number) EST2Prot explores all possible paths leading from that sequence to protein products in the Biozon data graph. The user is presented with multiple pages that summarize the information and rank the proteins based on our confidence in the association (depending on the type of the path). The first page provides the entry point to the Biozon data graph for the query EST and each page is linked to other pages with increasingly detailed information on the mapped proteins. For more information on the webserver see the Appendix (Additional File 1).