GLAD4U: deriving and prioritizing gene lists from PubMed literature
© Jourquin et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Skip to main content
© Jourquin et al.; licensee BioMed Central Ltd. 2012
Published: 17 December 2012
Answering questions such as "Which genes are related to breast cancer?" usually requires retrieving relevant publications through the PubMed search engine, reading these publications, and creating gene lists. This process is not only time-consuming, but also prone to errors.
We report GLAD4U (Gene List Automatically Derived For You), a new, free web-based gene retrieval and prioritization tool. GLAD4U takes advantage of existing resources of the NCBI to ensure computational efficiency. The quality of gene lists created by GLAD4U for three Gene Ontology (GO) terms and three disease terms was assessed using corresponding "gold standard" lists curated in public databases. For all queries, GLAD4U gene lists showed very high recall but low precision, leading to low F-measure. As a comparison, EBIMed's recall was consistently lower than GLAD4U, but its precision was higher. To present the most relevant genes at the top of a list, we studied two prioritization methods based on publication count and the hypergeometric test, and compared the ranked lists and those generated by EBIMed to the gold standards. Both GLAD4U methods outperformed EBIMed for all queries based on a variety of quality metrics. Moreover, the hypergeometric method allowed for a better performance by thresholding genes with low scores. In addition, manual examination suggests that many false-positives could be explained by the incompleteness of the gold standards. The GLAD4U user interface accepts any valid queries for PubMed, and its output page displays the ranked gene list and information associated with each gene, chronologically-ordered supporting publications, along with a summary of the run and links for file export and functional enrichment and protein interaction network analysis.
GLAD4U has a high overall recall. Although precision is generally low, the prioritization methods successfully rank truly relevant genes at the top of the lists to facilitate efficient browsing. GLAD4U is simple to use, and its interface can be found at: http://bioinfo.vanderbilt.edu/glad4u.
The physical development and phenotype of organisms can be thought of as a product of genes interacting with each other and with the environment. Therefore, it is common for a scientist to ask questions like "Which genes are related to breast cancer?", "Which genes are involved in embryonic development?", and "Which genes are functionally related to TP53?"
The current answers to these questions are primarily contained in the articles indexed in the MEDLINE database. Traditionally, answering these questions requires individuals to retrieve relevant publications through the PubMed search engine and then to create gene lists by manually extracting gene-centered information from retrieved literature. This process is not only time-consuming, but also prone to errors. First, it is difficult to ascertain that all relevant literature is processed. Second, it is unlikely that all relationships in a publication will be detected. Third, individual researchers tend to extrapolate based on domain knowledge.
Over the past decade, bioinformatics approaches have been developed to address this issue. One of the most successful projects in this area is the Gene Ontology (GO) project . GO produces a structured, precisely defined, and controlled vocabulary (i.e., GO terms) for describing the roles of genes and gene products in different species. Genes are associated with GO terms through manual curation as well as computational inference. A researcher can now go to the GO website  to get a list of genes related to a GO term of interest. However, as the GO vocabulary only describes gene products in terms of their associated biological processes, cellular components and molecular functions, users are limited by questions linked to this limited vocabulary. Moreover, processes, functions or components that are unique to diseases, such as oncogenesis, are not included in GO because causing cancer is not the normal function of any gene.
A useful resource specifically designed for disease studies is the Online Mendelian Inheritance in Man (OMIM ) project. OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes. It contains information on all known Mendelian disorders. However, information on complex diseases such as cancer and diabetes is lacking in OMIM.
In addition to manual curation, text mining tools have been developed to assist gene list creation . As an example, EBIMed [5, 6] combines text mining with co-occurrence-based analysis to generate a prioritized list of genes for a user-provided query. Specifically, EBIMed collects MEDLINE records and available full text documents for a user-provided query, identifies protein names, drugs, species, or GO terms in the documents, and prioritizes genes/proteins based on the number of co-occurrences of the different pairs (protein/protein, protein/drug, protein/species, protein/GO term) in the sentences of the documents in which they appear. EBIMed and similar tools, such as FACTA  and SciMiner , provide more flexible ways to create gene lists that are not limited to certain aspects of biology. Nevertheless, they usually require heavy computation, and the relevance of the resulted gene lists to the input queries has not been systematically evaluated.
Here, we report GLAD4U (Gene List Automatically Derived For You), a new web-based gene retrieval and prioritization tool. GLAD4U takes advantage of existing resources at the National Center for Biotechnology Information (NCBI) to ensure computational efficiency. It provides a simple user interface that facilitates intuitive usage and interpretation of results. The quality of gene lists created by GLAD4U is assessed using corresponding "gold standard" lists curated in GO, GAD (Genetic Association Database ), and OMIM. The performance of GLAD4U is also compared with EBIMed.
Overall quality of the retrieved gene lists
GO/ MIM gene count
GLAD4U gene count
EBIMed gene count
The low precision of GLAD4U may be partially attributed to the incompleteness of the annotation in GO and GAD/OMIM. However, it is likely that the original gene lists include many irrelevant genes. In this case, a prioritization step that ranks truly relevant genes at the top of a list would certainly facilitate efficient browsing.
We studied the performance of two methods to prioritize the gene lists. The first, "GLAD4U Counts", is based solely on the number of supporting publications as commonly implemented in other software [10, 11]. The second, "GLAD4U Hypergeometric", is proposed in this study, which is based on the Hypergeometric test (see the Methods section for details). We used the above mentioned three GO terms and three disease terms as queries to evaluate the performance of our prioritization methods. We also included the prioritized gene lists returned by EBIMed for comparison.
Comparison of different prioritization methods
Precision at k = 50
Precision at k = 100
Precision at k = 50
Precision at k = 100
Precision at k = 50
Precision at k = 100
The precision-recall curve and the AP score factor in precision at all recall levels. For ranked gene lists, particularly in web-based applications, this may not be of interest to users. In most scenarios, what matters may be the number of relevant genes on the first page or the first several pages. "Precision at k" is usually used to measure precision at a fixed low level of retrieved results, e.g., the top k results. To this end, we calculated the precisions for the top 50 (k = 50) and top 100 (k = 100) genes for all three methods, for each query (Table 2). GLAD4U Counts and GLAD4U Hypergeometric methods maintained higher precisions for the top 50 genes compared to EBIMed (0.74±0.15, 0.77±0.20 and 0.54±0.18, respectively), as well as for the top 100 genes (0.64±0.20, 0.69±0.25 and 0.42±0.20, respectively). Although the AP-based comparison may be biased against EBIMed owing to its low overall recall, precision at 50 and 100 only focus on the top ranking genes and are not affected by the overall recall. These results suggest that GLAD4U can produce lists where relevant genes are ranked at the top.
First 10 genes retrieved by GLAD4U and not listed in the gold standard lists
21051655, 21051533, 20849854, 20849851, 20832750, 20822933, 20708156, 20659896, 20657550, 20644561
20736797, 20573801, 20558744, 20473571, 20463961, 20430109, 20393480, 20345980, 20307495, 20299663
20714214, 20598117, 20596624, 20573831, 20564213, 20515470, 20232342, 20071475, 19996270, 19966300
20562100, 20514402, 20507639, 20490331, 20459702, 20447714, 20213502, 20197401, 20164027, 20154216
20548952, 20547768, 20471435, 20093486, 19932628, 19917613, 19875824, 19833733, 19808702, 19747914
20636820, 20617899, 20587542, 20506224, 20445553, 20363965, 19916867, 19826049, 19811426, 19794071
20940411, 20665026, 20644561, 20629644, 20564216, 20453000, 20388712, 20181890, 20177052, 20072652
20813833, 20515755, 20514462, 20447717, 20404348, 20372781, 20371612, 20346171, 20153722, 20148895
20619274, 20430109, 20298786, 20103619, 19671194, 19566940, 19328186, 19120277, 18983687, 18848838
19085961, 18836436, 18801192, 18691180, 18655775, 18533182, 18388957, 17585055, 17379210, 16607280
20925572, 20662730, 20577119, 20537141, 20429690, 20223792, 20160196, 19891555, 19673942, 19536175
20597806, 19811365, 19150652, 18837962, 18573267, 18178212, 17551100, 16872738, 16778331, 16109323
20713912, 20368210, 20350538, 20346360, 20234137, 20142024, 20113292, 20102554, 20087954, 20083731
21072525, 21060006, 20960113, 20852445, 20812180, 20717043, 20669348, 20637366, 20592457, 20479155
21044781, 20805569, 20733302, 20683147, 20676960, 20346360, 20339115, 20184533, 20074254, 20068351
20577119, 20543198, 20368210, 20346360, 20137368, 19635983, 19479237, 19430483, 19346663, 19330901
20831043, 20144152, 20044737, 19842096, 19779464, 19479237, 19131662, 18724972, 18510051, 18088254
20708777, 20339375, 19820005, 19567537, 19082699, 18663314, 18294861, 17980006, 17296872, 17121536
20831027, 20813695, 20679547, 20349406, 20160196, 20117991, 19926873, 19684612, 19289653, 19286756
21044781, 20593932, 20552610, 20528971, 20516205, 20443850, 20385503, 20376890, 20166815, 20150538
Regarding hypertension, renin (REN) is part of the renin-angiotensin system (RAS). Proteins in this system are thought as important regulators of blood pressure and are involved in the onset of hypertension [29–32]. Overexpression of REN leads to hypertension via chronic overproduction of AngII [33, 34], and inhibiting the regulators of the RAS--such as REN--is a common treatment for hypertension . Adiponectin (ADIPOQ) is an adipocytokine synthesized by the adipose tissue. It has been proposed as a biomarker for hypertension, as low plasma levels correlates with higher risk of hypertension [35–38], and possibly with coronary artery disease, kidney disease, left ventricular hypertrophy, and even myocardial infarction [36, 39–41]. Interestingly, REN and ADIPOQ also present polymorphisms, which seem linked to therapeutic response to hypertension [31, 40, 42–46].
From these publications, we believe that MDM2 and IGN1 should be part of the apoptosis list, as well as REN and ADIPOQ should be part of the hypertension list. These results accentuate the incompleteness of the gold standards and suggest that GLAD4U can help in the completion of the gold standard lists.
GLAD4U uses a simple query interface for users to submit their queries. Any queries that are valid in a PubMed search can be used in GLAD4U. In the query interface, users can also modify the default parameters of the application, including: search space (all species or restricted to human genes), the number of genes to present per result page, the maximum number of publications supporting each gene returned in the result page and the number of pages to build for each of the algorithm runs.
At the top of the output page, a summary of the run is also given: query term and options chosen, number of genes and publications processed, as well as a hyperlink to download the complete results in the comma-separated values (CSV) format. Although this file may be difficult to interpret by humans, it can be used as input for other computational analysis tools. For example, we have implemented a "send data to Functional Enrichment Analysis" link in the result page (Figure 3) of GLAD4U for submitting a gene list to the functional enrichment analysis tool WebGestalt [49, 50]. This function is particularly handy for the functional interpretation of a gene list, e.g., a list returned by a disease term query. It could help revealing biological processes associated with the disease. As an example, enrichment analysis on the first 100 genes returned by the "Obesity" query linked this disease to biological processes such as "fat cell differentiation" (20 genes, multiple-test adjusted enrichment p-value (adjp) = 5.27e-28), "lipid metabolic process" (39 genes, adjp = 5.05e-20) and "response to insulin stimulus" (17 genes, adjp = 4.99e-18). In addition, we have also implemented a "visualize genes in a protein-protein interaction network" link, which allows the visualization of interactions among the protein products of the genes based on the Cytoscape Web utility (http://cytoscapeweb.cytoscape.org/).
Reading through all relevant literature to generate a gene list is time consuming [10, 51–53], a common concern that came up in all interviews of experimentalists that we performed (results not shown). GLAD4U addresses this problem by automatically creating a ranked list of genes following a user's input query.
One important feature of GLAD4U is its information processing. Based on our survey among experimentalists, GLAD4U follows the exact same steps that an experimentalist would follow: gather literature, extract gene information and create an expert list . Whether a user queries a disease, a non-disease phenotype, a biological process or a gene, GLAD4U will fetch corresponding biomedical publications using NCBI's eUtilities API, retrieve relevant gene information, rank them and send them back to the user. GLAD4U ensures computational efficiency through effective use of existing NCBI resources, which also made it one of the winning applications in the National Library of Medicine (NLM)'s 2011 Software Development Challenge on the Innovative Uses of NLM Information.
Another important feature of GLAD4U is its simplicity. Researchers will be at ease using GLAD4U because its searching engine is powered by PubMed's API [48, 52], and behaves similarly to Entrez-PubMed . GLAD4U outputs a clean result page where the user can easily find genes relevant to the concept queried and supporting publications. Additionally, the use of PubMed's API makes GLAD4U almost maintenance-free. GLAD4U will update itself along with the MEDLINE library update. This will ensure that GLAD4U's results will always be up-to-date with the current literature.
Several tools rely on PubMed to build disease candidate genes lists [5, 8, 52, 56, 57]. EBIMed  and FACTA  are concept-oriented applications for mining existing biomedical literature. They attempt to automatically establish the publication-concept (including genes) relationship through in-house text mining tools whereas GLAD4U relies on the manually curated publication-gene mapping provided by NCBI. According to our results, manual mapping seems to have notable impact on performance. Nevertheless, automated mapping would allow flexibility in extending the services for concepts other than genes.
Although using the biomedical literature as a knowledge source seems intuitive [51, 58, 59], certain limitations exist: the literature is indexed based on titles, abstracts and keywords, not on full-text [60, 61]. Thus, a set of publications retrieved may be incomplete (i.e., some publications relevant to the concept queried will not be retrieved because they do not contain the necessary keywords in their titles or abstracts) . There is a possible bias in using the biomedical literature and ontology , as the most studied genes (those with the most publications) will have more weight [51, 63] at the expense of more relevant genes that might only be featured in few papers . Thus, we use the hypergeometric test to rank genes based on how likely it would be to retrieve them by chance alone, based on the number of publications retrieved for this gene among the total number of publications linked to this gene. The less likely it is--the smaller the p value--the higher the score will be for the gene. Thus, even if GLAD4U is solely retrieving its data from the biomedical literature, it prioritizes following a statistical analysis of the retrieved data.
The most obvious usage of GLAD4U is to generate a gene list for an input concept, which has been demonstrated in this paper. This can be extremely useful for the design of targeted high-throughput experiments. If one needs to create a custom array or selected proteins for targeted quantitative proteomic analysis using the selected reaction monitoring (SRM) assay, one can use GLAD4U and review the ranked list of genes that likely should be included in the experimental design. Besides generating gene lists for individual concepts, GLAD4U is very flexible and allows production of gene lists related to multiple concepts, which cannot be done by searching GO or OMIM databases. For example, a query of "smoking AND cancer" can generate a gene list that could potentially help exploring gene-environment interactions in cancer. GLAD4U also holds the potential to assist in improvement of the functional annotation of genes. Although GO contains more than 17,000 terms [4, 65] and is regularly used in the bioinformatics field as a standard [4, 66], it is not complete [51, 67]. Through manual checking of the top genes returned by GLAD4U that were not part of the gold standard lists, we easily found evidence that these genes were indeed linked to the query, and probably should have been included in the gold standard.
Finally, because GLAD4U prioritization algorithm assigns scores to genes, removing the genes with a low score consistantly improves the quality of the results. This result justifies thresholding GLAD4U results by default.
GLAD4U is a freely available web-application for creating expert candidate gene lists tailored to a user's query. It follows the same steps that the experimentalist would follow: gather literature, extract gene information and create an expert list. The simple interface of GLAD4U ensures easy usage and interpretation. Because GLAD4U relies on existing biomedical literature, it has an immediate credibility with experimentalists, who use this resource as a primary means for enhancing their knowledge and expertise. Although the gene list directly returned from a PubMed query is usually lengthy and noisy, the prioritization method implemented in GLAD4U successfully ranks truly relevant genes at the top of the list and facilitates efficient browsing of the list.
GLAD4U relies on the eSearch application programming interface (API) developed by the NCBI for retrieving publications from the MEDLINE database . For a user query, eSearch returns an XML file containing the number of publications returned by the query and all publication identification IDs (PMIDs). The XML file is parsed to get the list of PMIDs associated with a user query.
Genes associated with PMIDs are retrieved based on the gene-to-publication link table provided by Entrez-Gene . Links between Entrez-Gene IDs and PMIDs are created based on both manual curation within the NCBI and integration of information from other public databases. Publications linked to more than 500 genes are removed from the link table because they lack specificity. After this process, the link table included 3,509,732 genes and 647,523 publications for all organisms, among which 30,343 genes and 306,487 publications were related to human (as of 05/14/2011).
We studied two methods to prioritize the retrieved genes based on publication counts or the hypergeometric test. To prioritize using counts ("GLAD4U Counts"), each gene receives a score equal to the number of publications describing it in the link table. The other method ("GLAD4U Hypergeometric") uses the hypergeometric test to prioritize all retrieved genes. Specifically, for a given query Q and a gene G, let n be the number of publications retrieved for the query and present in the gene-to-publication link table (query-relevant publications) and k be the number of query-relevant publications that involves the gene G. Let us further assume that there are m publications in the gene-to-publication link table, j of which involve the gene G (gene-relevant publications). This method calculates the probability of observing k or more query-relevant publications for the gene by chance, based on the hypergeometric test and scores the gene using the following formula:
We used GO and disease terms as queries to evaluate the performance of the GLAD4U algorithms. Gene lists curated in GO, OMIM and GAD  were used as a gold standard (i.e. relevant genes). We developed a Perl script to parse the files "gene2go.gz"  and "gene_ontology.1_2.obo"  in order to generate gene lists for GO terms (as of 12/20/2009). Because of the parent-child relationship among the GO terms as described in the GO Direct Acyclic Graph, genes with granular annotations were associated with their parent terms using the Perl script. Using GAD, we identified all genes associated to a disease term. Using OMIM, we retrieved all IDs prefixed with "%" and "#" with the query in the title. Corresponding gene IDs were mapped by parsing the file "mim2gene"  (as of 12/22/2009). For each disease term, the lists obtained with GAD and OMIM were merged to serve as a gold standard. Retrieval performance was evaluated using precision, recall and F-measure. The F-measure is calculated by 2pr/(p+r), where p is the precision defined as and r is the recall defined as . We used the precision/recall curve, average precision (AP) and precision at the top k retrieved genes (k = 50 and k = 100) to evaluate the performance of our gene prioritization methods, and compared it to the performance of the ranked lists generated by EBIMed . All performance values are expressed in the text as mean ± standard deviation.
The GLAD4U user interface was developed in HTML and PHP languages. The scripts to deploy and update the algorithm on web servers were written in Perl, while the generation of hypergeometric test scores is using C. JQuery was used to implement user-features such as the ability to hide/ show options and functions. An email notification module was implemented to allow users to retrieve their results at a later time. GLAD4U (http://bioinfo.vanderbilt.edu/glad4u) is platform-independent and under a GNU GPL license . It was tested on Internet Explorer 5.0, Firefox 3.0, Safari 3.0, Chrome, Netscape 7 or any higher versions of these browsers.
application programming interface
genetic association database
GLAD4U prioritization algorithm using counts
GLAD4U prioritization algorithm using the hypergeometric test
gene list automatically derived for you
inhibitor of growth family: member 1
national center for biotechnology information
online mendelian inheritance in man
publication identification IDs
selected reaction monitoring.
We thank Dr. Hua Xu for useful comments on the analysis and Ms. Brandy Weidow for proofreading the manuscript. We appreciate the users who provided useful information through our interviews and online survey. This work was supported by the National Institutes of Health (NIH)/ National Institute of General Medical Sciences (NIGMS) through grant R01GM088822, the NIH/National Cancer Institute (NCI) through grant U54CA113007, and the NIH/ National Institute of Mental Health (NIMH) through grant P50MH078028.
This article has been published as part of BMC Genomics Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.