Semantically linking and browsing PubMed abstracts with gene ontology

Background The technological advances in the past decade have lead to massive progress in the field of biotechnology. The documentation of the progress made exists in the form of research articles. The PubMed is the current most used repository for bio-literature. PubMed consists of about 17 million abstracts as of 2007 that require methods to efficiently retrieve and browse large volume of relevant information. The State-of-the-art technologies such as GOPubmed use simple keyword-based techniques for retrieving abstracts from the PubMed and linking them to the Gene Ontology (GO). This paper changes the paradigm by introducing semantics enabled technique to link the PubMed to the Gene Ontology, called, SEGOPubmed for ontology-based browsing. Latent Semantic Analysis (LSA) framework is used to semantically interface PubMed abstracts to the Gene Ontology. Results The Empirical analysis is performed to compare the performance of the SEGOPubmed with the GOPubmed. The analysis is initially performed using a few well-referenced query words. Further, statistical analysis is performed using GO curated dataset as ground truth. The analysis suggests that the SEGOPubmed performs better than the classic GOPubmed as it incorporates semantics. Conclusions The LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analyses using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques.


Background
The development of new technologies in the fields of bioinformatics, bio-engineering and functional genomics has lead to the vast amount of research. The advent of these new research fields has lead to an exponential growth of the literature. The PubMed is one of the leading repositories for such growing literature. There are as many as 16 million (and counting) abstracts referenced by the PubMed as of 2006 [1,2]. Finding meaningful abstracts or papers from such a huge database is a great challenge. More than often the classical key-word based search engines yield results that are not meaningful to the query. There is a need of a semantic-sensitive search engine to browse the relevant information from the PubMed.
The search engine used by PubMed is the 'Entrez' system [3]. The Entrez system performs the search operation in two steps. In the first step, the Entrez performs the query translation in which it identifies the existence of Medical Subject Headings (MeSH) terms in the query. In the second step, the translated query is compared with words from all the abstracts in the repository based on 'String Matching' (term matching) to find the relevant abstracts. The extraction of the relevant abstracts based on string matching can not capture the underlying semantics. If the abstracts were to be retrieved only on the basis of keywords, their synonyms are not used in the search process. For example, Leukemia, blood cancer and bone marrow cancer are synonymous terms. The search in PubMed with the keyword 'Leukemia' retrieves only the abstracts containing the word leukemia but not blood cancer/bone marrow cancer. The relevant abstracts are further presented to the user in the order of decreasing PubMed Ids. PubMed Id is the index number for each abstract in the repository. PubMed therefore presents the results in the order of latest to the oldest. This approach causes severe inconvenience to the user to find the relevant abstracts. The bottleneck of this approach is that there is no option to refine the search results from the retrieved abstracts. The user has to manually skim through all the possible abstracts to find the relevant ones since they might be deeply buried inside the retrieved abstracts. There is also a possibility of extracting irrelevant abstracts. For example, the keyword 'blood cancer' retrieves abstracts that do not have any relevance with that keyword. The closer inspection reveals that all the abstracts in PubMed repository published in the journal 'blood cancer research' are retrieved.
The aforementioned problems are addressed to some extent by the advent of GOPUBMED [4]. The GOP-UBMED addresses the problem of refining the search results by introducing the concept of ontology based browsing. The ontology based browsing uses the domain knowledge and taxonomies to hierarchically organize the terms in the given corpus. User query is processed to retrieve relevant abstracts and structure based on the relevance provided by the ontology. For example, the keyword 'Alzheimer' is linked to the words 'brain development', 'cell', 'memory' etc. in the Gene Ontology (GO) [5,6]. When the GOPUBMED is searched with the keyword Alzheimer', the results are displayed categorically based on the relevant keywords 'brain development'/ 'cell' etc. from the GO. It does not however take into account the two main problems viz i) Semantics and ii) Relevance ranking.
The ontology based searching of the large text corpus (for example, PubMed) is an evolving area of research. The research issues addressed in GOPubmed [4] and GO-KDS [7] is closest to the work presented in this paper . The GOPubmed used the concept of ontology-based search into the PubMed [4]. This system organizes the results obtained using PubMed in the order of hierarchical ontology based arrangement. Subsequently, term matching is used to link the abstracts categorically into the GO terms. The process begins with the user submitting the query. The GOPUBMED links to the PubMed via the e-utilities provided by the Entrez System to retrieve the relevant abstracts. The retrieved abstracts are categorized based on the ontology terms using a basic term matching algorithm. The abstracts thus may be browsed categorically based on the ontology terms. This process yields information similar to the information obtained from the GO.
The GO already provides the information of linking abstracts to the GO terms [5]. The curators of the Gene Ontology Consortium manually annotate this information. In this regard, GOPUBMED provides redundant information already available from GO.
The other method GO-KDS [7] uses a machine learning approach to address the mentioned objectives in section I. The well-annotated abstracts linked to the GO terms are obtained from various sources such as SwissProt, Gen-Bank, and FlyBase etc. This annotated set of 26500 abstracts published prior to 2001 are linked to 3700 GO terms is used to train the support vector machine (SVM) system. The trained SVM system is validated using the abstracts obtained in 2001. The system performed with an accuracy of 70.5 %. The linking of 26500 abstracts is further generalized to 12 million abstracts (in 2001). It was claimed that the 70.5% accuracy obtained on the training set is acceptable and generalizes this result to 12 million abstracts. The 26500 abstracts considered for training may only address a small proportion of diversification posed by 12 million abstracts. Hence, this procedure suffers severe scaling problems.
The other related works include (but not limited to), ALI-BABA [8] which represents the relations among cells, dis-eases, drugs, proteins, species and tissues as a interconnected graph extracted from PubMed. Another work, called, PubFinder [9], requests the abstracts of interest from the user. The abstracts are next scanned to find the list of words, which are indicative of discrimination between the abstracts. These words are used to find the relevant abstracts from the PubMed. The MedMiner [10] is another related work in which the user is asked for list of gene names or processes. These words are used as the query terms for GeneCards, which is similar to PubMed. The underlying processes and techniques have not been clearly understood from this paper [10]. A natural language processing (NLP) based approach to find the relations among the genes, proteins and drugs is incorporated into an online application called Chilibot [11]. Another application which is specific towards finding the relations between two proteins is [12]. This application inspects the frequency of the terms in the abstracts for extract the relevant abstracts related to the proteins [12]. This application is also based on direct term matching concept of query and keywords from abstracts. The semantics of the query and the abstracts are not addressed by any of these systems. This will be the main emphasis of this paper. This paper presents the concept of Semantics Enabled linking of GO with PubMed, called, SEGOPubmed to address the aforementioned problems. The SEGOPubmed adapts the concept of latent semantic analysis (LSA) [13] for linking the PubMed abstracts to the Gene Ontology. The basic idea behind LSA is to map both the documents and the query vector into semantic space before comparison. This process addresses the problem of synonymy by projecting the vectors into low-dimensional space in retrieving abstracts. The comparison between the query and database entries is performed using similarity measure. The cosine similarity measure is found to be well suited for this application. The scores obtained using the similarity measure may be utilized for relevance ranking of the abstracts.

Analysis using well referenced keywords
This section presents the performance analysis of SEGOPubmed. To assess the performance of the proposed method, the comparison of the SEGOPubmed is made with the earlier proposed methods for ontology based literature search, namely GOPubmed. The analysis is performed using a few well-referenced query words such as 'Levimisole Inhibitor' and 'rab 5'. The PubMed is enquired using these keywords for extracting the abstracts. The retrieved abstracts are organized semantically using GO terms as tags.
The word 'Levimisole Inhibitor' retrieves abstracts related to the enzymes that inhibit the affect of the drug Levimi-sole. The search using this keyword retrieves 136 abstracts that are further organized semantically using the GO terms. In this paper, three GO terms viz. 'cell growth', 'collagen', and 'pathogenesis' are used to evaluate the performance of SEGOPubmed. The keyword 'cell growth' is present in 2 out of 136 abstracts, which is evident from GOPUBMED. There is a possibility of other abstracts that might be related to cell growth but do not contain that keyword. For example, the abstract, PMID: 8267680 deals with the affect of alkaline phosphatase in drug resistant tumor cells. This study analyzes the affect of alkaline phosphatase on cell growth that semantically may be extracted using SEGOPubmed. The analysis using the SEGOPubmed extracted 5 abstracts that were rendered highly ranked to be semantically related to 'cell growth'. The next GO term used is 'collagen', which is tensile rich protein of connective tissue. There are 6 abstracts retrieved by the SGP, 5 of which are also retrieved by the GOP-UBMED. The abstract PMID: 3936345 relevance ranked #5 by the SEGOPubmed is not retrieved by term matching techniques. This abstract talks about the inflammatory responses in the collagen-induced arthritis models. Although it has the word collagen, it does not exist as a separate word and hence not retrieved by GOPUBMED. The relevance ranking for the word collagen is in the order PMIDs: 9284952, 10647622, 15601852, 2725422, 3936345 and 10983877.
The other query term used in this study is 'Rab 5', which yielded 623 abstracts. Rab 5 is a protein that controls the fusion between early endosomes and endocytic vesicles. The GO term 'pathogenesis' is used to find abstracts that are semantically related to the keyword. The analysis using SGP resulted in 5 abstracts whose PMIDs are 16113213, 15367862, 15304337, 1554866, and 11785977 in the decreasing order of their relevance. The direct term matching techniques for the above scenario would result in three abstracts which do not include the abstracts 1 and 5 provided by SGP. The close examination of these abstracts reveals that 'pathogenesis' does not occur as a single word in one of the abstracts (PMID: 16113213) and the other abstract (PMID: 11785977) semantically addresses the issues of 'pathogenesis'.
The above empirical analyses reveal that SEGOPubmed incorporates semantics into the ontology based searching of the PubMed. The organization of abstracts according to the relevance would greatly enhance the search experience of the user. Further, the thresholding technique would provide only relevant abstracts to the user.

Statistical evaluation of SEGOPubmed using GO curated term associations
This section outlines the empirical analysis of the proposed SEGOPubmed from GO curated term associations. The construction of ground truth using GO is shown first.

Construction of ground truth using GO
The GO is a consortium that aims to describe the genes and gene products of any organism by providing a controlled vocabulary. The GO extracts the genes/gene products by manually reading the PubMed abstracts and associates them with the vocabulary. This process results in the association of the GO terms with the PubMed abstracts. These series of associations may be downloaded and used as ground truth to evaluate the performance of SEGOPubmed.

Empirical evaluation
The statistically evaluation of the performance of SEGOPubmed is performed using the ground truth constructed as described in the previous section. The ground truth consists of 491 PubMed abstracts associated with the 60 GO terms with Ids GO:0000001 to GO:0000070 (some of the terms with Ids such as GO:0000069, GO:0000065 are missing making the count to 60). The SEGOPubmed is queried with the GO terms and the significant abstracts are retrieved by applying R-test and are compared with the ground truth. The number of true positives and false positives among the retrieved abstracts are recorded to build the receiver operating characteristic (ROC) curves. In signal detection theory, a ROC curve is a plot of true positive fraction Vs. false positive fraction. The ROC curves are one of the ways to analyse the cost benefit ratio. The problem at hand is a binary classifier where the abstract is either associated to GO term or not. There are four possible alternatives that may be obtained from the classifier viz. true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). The TP is the number of truly associated abstracts among the retrieved abstracts. The FP is the number of un-associated abstracts among the retrieved abstracts. On the contrary, TN is the number of truly un-associated abstracts among the rejected abstracts by the SEGOPubmed. FN is the number of truly associated abstracts rejected by the SEGOPubmed. The plot of TPF Vs. FPF hence, enables the comparison of performance of various classifiers employed in the study.
The detailed steps of the performed evaluation may be listed as: 1. Construct the Ground truth by downloading GO terms and their associated abstracts.
2. Query the SEGOPubmed using the GO terms as the query words.
3. Find the significant abstracts using the R-test (see methods).
4. Compare the retrieved abstracts with the ground truth.
5. Calculate the true positive fraction and false positive fraction.

Construct the ROC curve
The validation for both testing and training is done using three GO terms i) ribosomal chaperone activity, ii) transition metal ion transport and iii) autophagic vacuole fusion in step 5. The red curve indicates the average of the 3 roc curves. The Fig. 1 (a) shows the cost curve of the performance of SEGOPubmed for the training dataset. As shown in Fig.1 (a), the SEGOPubmed recorded very small FPF and very large TPF. This indicates that the model performed well for the training data. Fig. 1(b) shows the cost curve for the test data. Fig. 1(b) shows the similar performance as seen for the training data and hence can be used to classify the test documents to the GO terms.
The cost curves plotted show the performance of SEGOPubmed for only 3 GO terms. For a complete investigation of the performance for all the 60 GO terms considered, TPF and FPF values at the threshold given by the R-test are mentioned in tables 1 and 2.

Conclusions
This paper opens a new paradigm to semantic-sensitive ontology based browsing and linking of large corpus (i.e., PubMed) to ontologies (for example, GO). The LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analysis using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques. The present study is limited to only a few well-referenced keywords. A comprehensive and evaluation based on semantic-space similarity of the SEGOPubmed is currently under investigation. The present technique also does not incorporate the concept of polysemy in linking the abstracts to the GO terms. This feature may be introduced into ontol-

Methods
The PubMed is one of the highly used repository for biomedical literature [1]. The results from the PubMed query are arranged in the order of entry of the abstracts into the repository. The order of arrangement intended by most of the users is the relevance of the abstracts to the query. The non-availability of such a feature forces the user to skim through the abstracts to obtain the relevant abstracts. The ontology-based search is hence a most relevant alternative. The LSA is incorporated into this framework to find the semantically meaningful and relevant abstracts. The abstracts are ordered based on relevance and ontology based terms. The main building blocks of the SEGOPubmed are shown in the Fig. 2. As shown in Fig. 2, the user first inputs the query. The relevant abstracts are obtained from the PubMed. The text processing is performed on these abstracts and term frequency (TF) and inverse document frequency (IDF) are obtained. The LSA is performed using the TF and IDF. This process incorporates the semantics into the retrieved document space. Next the GO terms are mapped into the semantic space and semantically related abstracts are retrieved and displayed based on relevance tagged to each of the GO terms.

Creation of corpus
The process begins by collecting text (for example, PubMed abstract) into a corpus. First, text pre-processing is performed to extract the meaningful words from the abstracts. This is performed by i) eliminating the stop words, ii) stemming the words to their root words and iii) forming the dictionary from all the stemmed words. The irrelevant words are eliminated following the list of words provided by [14]. These are the words that do not offer any significant improvement in the semantics or the search retrieval and also introduce noise in the corpus. A porter-stemming algorithm is used to stem the words in the document to their root words [15].
A matrix is created from the corpus, having one row for each unique word (for example, GO terms) in the corpus and one column for each document (PubMed abstract). Weightings and normalizations are often applied to the data matrix that take into account the frequency of key word i (k i ) in the document j (d j ) and the frequency of k i across all documents, such that distinctive words that appear infrequently are given the most weight. The cells of the matrix consist of weighted term-frequency (T-F) and inverse document frequency (IDF) matrix described in the previous section. Since many words do not appear in any given document, the matrix is often sparse.
The term-frequency (TF) matrix is constructed as proposed Landauer et al [16]. The Inverse Document Frequency (IDF) matrix is constructed using the Eq. 1.
Here, is the total number of documents in the corpus and is the total number of documents where the term appears [17].
The TF and IDF are next multiplied to form a TF-IDF matrix. The next step as shown in Fig. 2 is to apply LSA idf D Schematic diagram of the proposed SEGOPubmed Figure 2 Schematic diagram of the proposed SEGOPubmed using the TF and IDF matrices generated in the text preprocessing step.

Latent semantic analysis (LSA)
The TF-IDF data matrix (A) is first normalized across the rows by dividing frequency of the word in each document by the highest frequency of that word in all the documents. The LSA transforms the high dimensional TF-IDF data matrix (A) into a low-dimensional latent space through a mathematical procedure known as singular value decomposition (SVD) [18]. SVD is a technique that creates an approximation of the original word by document matrix. After SVD, the original matrix is equal to the product of three matrices, word by latent concept, latent concept by latent concept and latent concept by document. The size of each latent concept (singular value) corresponds to the amount of variance captured by corresponding Eigen vector. Because the singular values are ordered in decreasing size, it is possible to remove the smaller dimensions and still account for most of the variance. The approximation to the original matrix is optimal, in the least squares sense, for any number of dimensions one would choose. In addition, the removal of smaller dimensions introduces linear dependencies between words that are distinct only in dimensions that account for the least variance. Consequently, two words that were distant in the original vector space can be near in the compressed space, causing the inductive semantic space and knowledge acquisition effects reported in the [19].
The SVD is a matrix factorization technique that decomposes the TF-IDF matrix into three different matrices as shown in Eq. 2.
The first s Eigen vectors are considered for mapping the high-dimensional TF-IDF data matrix (A) to the lowdimensional space as shown in the Eq. 3.
Besides facilitating the dimensionality reduction, the semantic relations are also incorporated in the reduced A USV T = . (2) Box plots of the possible ranks for one query Figure 3 Box plots of the possible ranks for one query dimensional space. The GO terms are used as the query vector (q 0 ) to inquire the Eigen mapped documents. Since the comparison needs to be performed in the same space, the query expansion is performed by mapping the query vector to the Eigen space as shown in Eq. 4.
The similarity of each expanded query is found with respect to the document vectors, which are also mapped to Eigen space. This enables the comparisons of the query and the documents based on semantics. A cosine similarity is used as a measure of similarity as proposed in [18] The breakthrough provided by LSA is a solution to the synonymy problem, i.e., the problem that multiple words can express the same meaning. In the basic vector space model, distinct words with the same meaning are kept distinct, but LSA gives them equivalent or near equivalent meanings. LSA's solution to the synonymy problem made it attractive to a variety of researchers outside of information retrieval. The geometrical interpretation provided in [20] gives an insight into the underlying principles of LSA.
In general, there are many parameters (for example selecting threshold) that need to be determined for any semantic space. Most often performance of semantic is evaluated by human experts.

Automated thresholding to extract semantically relevant documents
An R-test is employed to extract semantically relevant documents for a query [21]. The following procedure describes the R-test i) Randomly select the terms from the term document matrix.
q q T U s s = − ∑ 0 1 . (4) cos . θ j d j T q d j q = 2 2 (5) Box plots of the documents under null hypothesis for one query Figure 4 Box plots of the documents under null hypothesis for one query ii) Find the relevance (ranking using similarity measure or other ways) of the query with the documents based on the GO and PubMed term document space.
iii) Repeat the steps 1 and 2 for 25 iterations as proposed in [21]. iv) Arrange the document ranks based on the median rank (r) as shown in the Fig. 3.
v) Consider the consistently high ranked documents under null hypothesis. These ranks will follow a uniform distribution as shown in the Fig. 4.
vi) For each document, find the p-value (p) as given by the Eq. 6 The documents that are considered significant using pvalue are considered relevant and displayed in the order of relevance. This process is repeated for all the GO terms used as query vectors. The user can now skim through the results using ontology terms based on relevance. Please note that only relevant abstracts are displayed because of the thresholding process.