Skip to main content

Table 1 Dataset description

From: Disclosing ambiguous gene aliases by automatic literature profiling

 

Initial dataset

Dataset with PubMed abstracts

Dataset fulfilling the algorithm’s requirements*

Final dataset (ambiguous aliases excluded)

EntrezGene official symbols

100

73

68**

68

Aliases

425

256

223

165

Abstracts in text corpus

-

13355

12088

9005

Unique PubMed IDs in text corpus

-

11022

10312

7523

Redundancy in text corpus (%)

-

21

16.6

19.7

  1. * The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.
  2. ** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.