A. Backbone knowledge base selection
Given the cost and the limitation of expert-based curation, we decided to use the Medline database as the backbone database for cross-domain and cross-scale data integration. The Medline database is without a doubt the foremost biomedical knowledge base. It has the most complete coverage of all areas of biomedical research among the existing databases. It is essential to most, if not all, biomedical researchers for exploring relevant topics and understanding the biological implications of their own data. Inherent conceptual relationships in Medline abstracts, titles and MeSH terms can be directly used for linking and understanding concepts from different biological scales and biomedical research domains. This is a critical advantage that is very hard to match by an expert-curated system.
B. Identification of biomedical concepts in free text
Although Medline database contains comprehensive information, it is essential to identify all useful biomedical concepts in Medline title and abstracts for the purpose of cross-domain data integration and computer-based analysis. While data from closely related biomedical research domains can often be linked to well defined molecular IDs (e.g., gene ID), structure IDs (e.g., anatomical structure) or spatial coordinates (genomic location or 3D anatomical locations), cross domain data integration frequently has to rely on the matching of different free text strings representing the same concepts or identifying relationships between different concepts. It is essential that the free text strings in the Medline database can be converted into computable forms for computer-based mining and more effective data presentation.
In collaboration with the National Center for Biomedical Ontologies (NCBO), we developed a highly efficient and flexible free text to biomedical concept mapping solution called mgrep. It has three major advantages over the publically available MetaMap Transfer (MMTx) program developed by the National Library of Medicine [http://mmtx.nlm.nih.gov]: 1) mgrep records the location of each matched concept in the original text, which is critical for in-depth text mining 2) mgrep is about 2 orders of magnitude faster than MMTx. It enables us to process the Medline dataset frequently to keep up with updates from various controlled vocabulary sources as well as Medline on a single dual opteron server 3) MMTx is designed to work with concepts in the Unified Medical Language System alone but it is very straightforward for our solution to include and manage additional vocabularies, such as chemical compound names and ontologies in the Open Biomedical Ontologies (OBO) system [17].
In comparative analysis of concept mapping results using all of the concepts in UMLS and 10,000 Medline sentences as input, mgrep can identify about 95% of best match concepts that MMTx can find. We believe this level of sensitivity is sufficient for data integration and exploration purposes. Although the details of our work in this area have yet to be published, a fully functional implementation of mgrep for on-the-fly free text to OBO ontology mapping is available at [http://bioontology.org/tools/oba.html]. Using mgep, we are able to identify free text strings to unique concepts in full UMLS and 10 or so OBO ontologies related to anatomy, disease, environment, and chemicals in the Medline database. We also identifies concepts related to genes, genetic markers and cytobands using entity recognition engines developed in earlier work [4, 18, 19]. The transformation of highly variable free text strings to computable unique biomedical concepts as well as the ability to identify conceptual relationships based on UMLS and ontologies provide the foundation for cross domain data integration in our database.
C. Integrating data from external sources
The mgrep program's ability to map free text strings to unique biomedical concepts provides unlimited data integration possibilities. For example, even in the absence of common molecular and structure IDs, mgrep can be applied to free text fields in a database to link together information from different domains. While such direct concept mapping will not always be correct, we believe the benefits outweigh the shortcomings. Our main goal is to facilitate novel hypothesis development by present researchers with a more comprehensive view of related issues rather than providing 100% accurate conceptual relationships.
Naturally, our system supports integration of data outside of the Medline database. Our system's data integration is based on widely used IDs and coordinates such as Gene IDs and genomic locations. Since our prototype is focusing on neurobiological problems, we also enable the integration of data based on Allen Brain Atlas structure coordinates.
For example, in order to link Medline records to individual brain structure names in the Allen Brain Atlas, we downloaded canonical mouse and rat brain structure nomenclatures in four brain atlases from the BrainInfo website [http://braininfo.rprc.washington.edu/Nnont.aspx]. The same page provides a mapping of each term to the NeuroName2002 ontology for human and macaque neuroanatomy. We combined all distinct brain structure name text strings from different atlases and NeuroName into a list of 13233 test strings representing various brain structures. Since one of the atlases annotated by the BrainInfo website is the Dong atlas used by the Allen Brain Project, we are able to map all text strings from other atlases and NeuroName to brain structure terms used by the Allen Brain Atlas based on the NeuroName annotation provided by the BrainInfo project.
To reduce false positives in the brain anatomical structure mapping, we exclude abbreviations and require each structure term to be at least 5 characters long. Then we use the lvg program to generate common lexical variations of each text string; we eliminate high frequency words without distinctive meaning, e.g. "the" and "of", that we identified in an earlier full Medline text analysis; and we generate word order permutations. Next, using mgrep, all string variations are mapped to the full Medline abstracts and titles (again discarding meaningless high frequency words). To further increase the sensitivity of identifying Medline records related to brain structures, we also included records with MeSH terms that can be directly mapped to structure terms in the Allen Brain Atlas. The combination of text-string and MeSH based mapping leads to close to 1 million Medline abstracts that can be linked to structure names in the Allen Brain Atlas for the December, 2008 download of the Medline database.
D. Identify potentially important conceptual relationships
Our extensive concept mappings of the full Medline database allow us to associate pairs of over 1 million unique biomedical concepts from different biomedical research domains based on concept co-occurrence at either the abstract or the sentence level. However, such concept co-occurrence association may lead to a large number of false relationships that will reduce the efficiency of data exploration and mining. However, short of expert curation, there is no satisfactory way to extract accurate conceptual relationships with decent sensitivity. While significant progress has been made in the area of natural language processing [20], the best solutions for extracting conceptual relationships are still not ideal.
Since our goal is to help researchers to mine data more effectively, we decided to compare the level/frequency of a concept in a given context (e.g., Medline records returned by a keyword search) to the concepts level/frequency in the full database (e.g., Medline) to rank the concept's importance in the given context. The underlying assumption is that the concepts ranked most significant by this approach are the concepts most likely to have meaningful relationships to the query term(s). Two main advantages of this approach is it is computationally efficient and this approach can be applied to different data types.
For example, we use this approach to identify the most significant disease terms from a list of Medline records in the current solution as well as in our earlier PubOnto web application [7]. Briefly, based on disease concept mapping and MeSH term annotation of each Medline abstract, we are able to pre-calculate the overall frequency of each disease-related concept in the Medline database. The frequency of disease terms in each set of returned Medline records, regardless of what search terms and filters are used, can be ranked against their background values using a number of different ways on-the-fly. Consequently, users can easily identify the most significant concepts associated with the search results, out of a very large number of concepts that have abstract or sentence level co-occurrence, for further exploration. While there is no guarantee that such a simple statistical approach can always identify the most meaningful conceptual relationships, it provides a valuable starting point for data mining.
We also applied the same approach to rank genes that are expressed in individual brain structures based on the voxel level gene expression data from the Allen Brain Atlas [21]. Since the Allen Brain Atlas provides 200 micron voxel level expression data for around 20,000 genes in each voxel, methods to select the most relevant genes are necessary for more effective exploration of functional relationships between genes and brain structures. One of the ranking methods we included in the current solution is the ratio of the average gene expression level derived from all voxels belonging to a brain structure and the average gene expression level based on all voxels in the whole brain. This method turns out to be quite effective in identifying genes that are highly expressed in distinct brain regions. For example, the top 20 genes in each brain region identified by this simple method show on the average of 21-fold higher expression level in a specific brain region than their average expression level in the whole brain. Consequently, researchers can easily identify genes that are most significantly expressed in a brain region to further explore their functional relevance.
E. Use anatomical structure as a scaffold
Identification of brain anatomical structure concepts in Medline records not only facilitates the integration of data with brain structures, but also enables the use of the Allen Brain Atlas as both an overview of the data and a starting point for data exploration in a relevant biological context. There are several reasons for selecting brain anatomical structure as the anchor for data presentation and exploration: anatomical structure is a biologically meaningful way for integrating cross-domain data and literature since the majority of pathophysiological processes in the brain can be linked to specific brain structures; hierarchical and brain circuit level relationships among different brain structures (important for detailed data exploration but not obvious for most molecular biologists) are easily presented in brain anatomy; the presentation of anatomical structure at the level of major brain nuclei is not as overwhelming as complex network graph; and the fixed location of each brain structure at a specific brain section plane allows quick identification of relevant content.
Without a doubt, alternative perspectives of the same data are needed. In PubAnatomy, we included a gene network view and various list views in our user interface. These views are described in the RESULT section. We also incorporate a new solution aiming at taking advantage of compatible external applications for more effective data exploration and it will be described in the next section.
F. Interoperability with other applications
In order to share data among multiple applications, we design a simple and effective schema to accommodate the following requirements: data generation, individual or group access permission, diversified data fields, central dataset registry, and incremental updates of data sets.
Our schema involves a set of five relational database tables: 1) User account table: maintains the groups that users belongs to; 2) Share permission table: the original creator of a data set can set the access permission of the data set, including individual accounts, group accounts, and public; 3) Dataset definition table: keeps the title and field names of each set; 4) Data storage table: the ultimate table for storing data of each set; 5) Update history table: tracks the history of set changes, including the application, type of change (e.g., create/update/delete), and the parameters used to determine the change of a data set.
There are other alternatives in implementing the data sharing and interoperability. Typically, they are session data sharing and web service frameworks. The session- based data sharing’s initial simplicity will be outweighed by difficulties in cross domain data exchange due to various firewalls settings, organizational security settings and browser requirements. The pure web service approach is not satisfactory since function changes need to be coordinated by different groups. The above data sharing schema is more stable due to its ability to accommodate future function changes once the same schema is adopted across groups.