Skip to main content

HLA-SPREAD: a natural language processing based resource for curating HLA association from PubMed abstracts


Extreme complexity in the Human Leukocyte Antigens (HLA) system and its nomenclature makes it difficult to interpret and integrate relevant information for HLA associations with diseases, Adverse Drug Reactions (ADR) and Transplantation. PubMed search displays ~ 146,000 studies on HLA reported from diverse locations. Currently, IPD-IMGT/HLA (Robinson et al., Nucleic Acids Research 48:D948–D955, 2019) database houses data on 28,320 HLA alleles. We developed an automated pipeline with a unified graphical user interface HLA-SPREAD that provides a structured information on SNPs, Populations, REsources, ADRs and Diseases information. Information on HLA was extracted from ~ 28 million PubMed abstracts extracted using Natural Language Processing (NLP). Python scripts were used to mine and curate information on diseases, filter false positives and categorize to 24 tree hierarchical groups and named Entity Recognition (NER) algorithms followed by semantic analysis to infer HLA association(s). This resource from 109 countries and 40 ethnic groups provides interesting insights on: markers associated with allelic/haplotypic association in autoimmune, cancer, viral and skin diseases, transplantation outcome and ADRs for hypersensitivity. Summary information on clinically relevant biomarkers related to HLA disease associations with mapped susceptible/risk alleles are readily retrievable from HLASPREAD. The resource is available at URL This resource is first of its kind that can help uncover novel patterns in HLA gene-disease associations.


Human Leukocyte Antigen (HLA) locus consists of six classical genes (HLA-A, −B, −C, −DP, −DQ and -DR) that play an important role in eliciting immune response against pathogens [24] and three non-classical genes (HLA-E, −F and -G) that interact with Natural Killer cells to regulate virus-infected and malignant cells [25]. HLA genes harbour a large number of mutations. As of September 2020, there are 28,320 HLA alleles reported in IPD-IMGT/HLA database. These variations mostly arise to generate defensive mechanisms against pathogens. However, some variations also confer risk to autoimmune diseases like rheumatoid arthritis, multiple sclerosis, Type 1 diabetes and Graves’ disease etc. More than 100 different autoimmune diseases, infectious diseases and adverse drug reactions have been reported to be associated with HLA genes [4, 10, 32]. These alleles have clinical utility as diagnostic markers for example in rheumatoid arthritis, ankylosing spondylitis [17,18,19]. They are also used in genetic screening e.g. HLA-B*57:01 in Caucasian population for abacavir hypersensitivity, HLA-B*15:02 in Chinese and Asians for carbamazepine induced life-threatening conditions like Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) [9, 29]. In the context of transplantation, mismatch of HLA alleles between donor and recipient impacts the solid organ and hematopoietic stem cell transplantation outcomes [3, 26]. Each of the reported studies is unique in itself as they describe the molecular basis of disease associations, HLA matching and anti-HLA antibody formation that are relevant for transplantation. Besides, studies also report some relevant and associated clinical information, e.g. different HLA-B27 subtypes are reported to be associated with clinical categories under spondyloarthropathies [16]. There are other studies that implicate HLA allele association with the composition of gut microbiome and diseases [2, 12, 36]. The expanse of this information is immense as there is wide genetic variability and heterogeneity among populations [6]. Although advancements in HLA typing technologies has been beneficial in identifying novel HLA sequences [30], this has also led to reporting the same HLA allelic variant using different HLA nomenclature.

With the rapid increase in biomedical data, HLA alleles and their associations in multiple diseases, it becomes imperative to create a platform with structured information to query and retrieve relevant information. Current knowledge about HLA limits to individual papers that can be searched through PubMed or reviews where a subset of studies has been summarised. Hitherto, there exists no database that complies the existing HLA related information in an organised framework. In absence of such a repository, resource sharing among researchers and clinicians becomes a big challenge.

The integration of computer sciences with biomedical research has accelerated the progress, both in terms of novel discoveries and data structuring. Natural Language Processing (NLP) is a method to extract relevant information from unstructured data [7, 14, 15, 31]. A simple NLP pipeline contains 4 components: data assembly, pre-processing and normalization, Named Entity Recognition (NER) and Relation Extraction (RE). The output of NLP algorithms, i.e. structured dataset can be used to generate insights via direct interpretation or through downstream analyses. In recent times, NLP methods have started gaining popularity in biological sciences. For instance, Rakhi et al. [27] reported a text mining pipeline to study spice-disease associations and link phytochemicals from different spices/herbs to diseases. Another report by Lee highlights BioBERT [22], a pre-trained biomedical language representation model that can be used for various text mining tasks like NER, RE and question answering, specifically on biomedical datasets. Similarly, PubTator Central [35] is an open access tool available via NCBI that uses text mining algorithms for assisted bio-curation of entities in literature. The tool uses NER to identify and thus highlight six bio-entities viz. Gene, Disease, Chemical, Mutation, Cell Line and Species from abstracts and open access articles available on PubMed. Another interesting report by Kuleshov et al. [21] presents a machine compiled database for studying genotype-phenotype associations generated using applications of text mining on genome-wide association studies (GWAS). All these resources work on similar text mining algorithms, but each has a different set of applications and tasks to perform. The use of these resources as such in addressing the HLA research often overlooks the extent of variability of HLA complex and involved parameters in this domain. For instance, PubTator Central is able to mine gene names from literature, but would not pick HLA allele information e.g., it will highlight HLA-DRB1 when the user search query is HLA-DRB1*01:01. Conventional processes to individually mine a large amount of unstructured literature available on HLA research requires both manpower and resources. For understanding and integrating the observations from HLA studies we require knowledge of genomic datasets, i.e. diseases, SNPs, drugs, populations, and ethnic groups along with an understanding of the relationship between them. NLP based text mining is an ideal approach to understand the complexity of this process to create a structured information.

We provide HLA-SPREAD (Fig. 1) as a platform for integrated HLA resources that has been developed using NLP to understand the complexity of this locus. The resource provides a platform to summarize HLA related genomics knowledge as well as to design and develop new hypothesis. In this study, we have used publicly available ~ 28 million peer reviewed abstracts. We extracted biomedical entities including HLA alleles, diseases, SNPs, drugs and geographical locations. We also tried assigning positive and negative relationships between disease and alleles. This HLA connectivity was then used to address biologically and clinically relevant objectives like HLA-biomarkers and risk and protective alleles for various diseases.

Fig. 1
figure 1

Workflow of HLA-SPREAD

Construction and content

Data retrieval

MEDLINE was used as a source of biomedical literature that comprises more than 28 million peer-reviewed articles from over 5600 scholar journals. Bulk data was downloaded from the FTP server in XML format. HLA alleles with nomenclature were downloaded from IPD-IMGT/HLA database [11]. To maintain uniformity in disease names and their IDs, we used MeSH keywords from UMLS (Unified Medical Language System). Drugs associated with side effects were obtained from SIDER 4.1, Allele Frequency Net Database (AFND) and PharmGKB [13, 20, 33]. Allele frequency of HLA alleles were also taken from AFND. Extensive Pre-processing was done on all the datasets before they were implemented in the pipeline.

Pre-processing and keywords dictionary

PubMed parsing

A modified version of PubMed parser was used to extract PMID, title, abstract, publication date, journal, article type and authors’ information from MEDLINE biomedical literature dataset [1]. Only records with the above information were considered for further analysis and stored in a tabular format. All the subheadings in the abstract viz. background, introduction, objective, method, experimental design, result, discussion, importance, setting, design, study objective, patients, participants and conclusion were removed.

Disease dictionary

Mentions of disease keywords were identified using a dictionary created from UMLS 2019MRCONSO.RRF [5]. UMLS is a set of biomedical vocabulary that includes data from OMIM, Gene Ontology, Clinical repositories, Medical Subject Headings (MeSH) and NCBI taxonomy. In this study, we used MeSH descriptors including Entry Term (ET), Main Heading (MH), Preferred Entry term (PEP), Descriptor Sort Version (DSV) and Machine Permutation (PM). Descriptor Entry Version (DEV) was excluded as keywords belonging to this category were incomplete, e.g. abdominal injury was reported as abdominal inj. These descriptors are assigned a unique MeSH ID which is stored in a hierarchical format with 24 head categories along with a unique Descriptor ID. We termed the root form of the disease as level-zero and top-level diseases as level-one for our analysis. Multiple forms of a disease like diabetes insipidus, diabetes mellitus, type 1 diabetes, juvenile-onset diabetes and others are assigned the same MeSH ID. This dataset was also supplemented with keyword variants such as plural and lemmatised forms to increase the search space.

HLA dictionary

Keywords for HLA alleles and their nomenclature were fetched from the centralized repository of international ImMunoGeneTics project (IMGT) database. IMGT is updated quarterly with submission or deletion of alleles and their nomenclature and currently houses 28,320 alleles. Many reports do not follow the conventional HLA allele nomenclature which makes mapping a strenuous task. To maximally capture all HLA alleles, we created a dataset comprising of all possible keywords including the removal of special characters, whenever required. We have also attempted mapping all the old nomenclature to the current allele names. This dictionary also includes few generic HLA keywords like HLA class I, HLA class II, HLA linked and HLA associated. There are few alleles based on old nomenclature that belong to more than one antigenic group, hence they were put under “broad antigen” category. A few haplotypes that were a combination of more than one HLA allele were grouped in “haplotype” category.

Named entity recognition

Keyword matching across abstracts

A python-based NER pipeline was implemented to filter abstracts based on a dictionary matching approach using parallel multiprocessing. Disease and HLA allele keyword dictionaries were used for initial screening. Abstracts were converted to lower case with special characters removed and if a match was found in either title or text, the abstract was sentence tokenized using sentence tokenizer, a part of python Natural Language Tool Kit (NLTK). We encountered a great extent of variability in the names of disease keywords. Most of it had special characters like (−) and (‘) in the keyword or with the plural and singular forms. To deal with the former, we kept instances of sentences where special characters were not removed, this increased the search space that enables capturing of keywords such as Stevens-Johnson syndrome (Stevens Johnson syndrome), Graves’ disease (Graves disease). Our disease dictionary was already enriched with plural and lemmatized forms of keywords to tackle the latter. For HLA allele keywords, word boundary-based regex matching was implemented to search alleles in the sentences. Sentences with at least a single mention of both HLA allele and disease keywords were considered for further steps.

Identification of tags: populations, drugs and SNPs


The filtered abstracts were processed using spaCy NLP tagging algorithm (model: en_core_web_md) to search for mention of populations in text. From the two output tags, i.e. GPE (Geo-Political Entities) and NORP (Nationalities Or Religious Groups), we selected the keywords having the latter as GPE tag often reported scientific names of organisms as populations when applied on biomedical data, e.g. scientific names such as Chlamydia spp. and Chlamydomonas spp. were reported under GPE tags. The output was classified into countries and ethnic groups for further analysis with the help of an expert anthropologist. Manual curation of the obtained list was also done to remove plural and inappropriate entries.


The information on drugs with side effects were taken from the SIDER database (SIDER 4.1). We also added 16 drugs from AFND and 26 drugs from PharmGKB, whose information was missing in SIDER. The list of drugs was mapped across the dataset to check for its occurrences in selected HLA related abstracts. There were many instances where drug names were subpart of disease keywords, e.g. “insulin” was obtained as a false match wherever it was present as a part of the disease name “insulin dependent diabetes mellitus”. A small python snippet was written to remove such false positives.


SNP IDs were mapped across abstracts of the HLA dataset using the RegEx module of python. The algorithm iteratively searched for all instances of RSIDs using regular expression “[rR][sS][0–9]{2,}”. All the tags captured in various sentences of abstracts were stored in a list of strings format along with their respective PMIDs for facilitated future access.

Semantic assessment

N-GRAM evaluation and manual labelling

N-grams refers to a contiguous sequence of n items (can be syllables, letters, or word pairs) in a text for determining the context of said items in a sentence or paragraph. We used the functions of NLTK viz. WordNetLemmatizer, WordPunctTokenizer and CollocationFinder to create a corpus of NGRAMS (n = 1, 2 and 3) from the abstract dataset. After removal of stop words, that do not add significant meaning to the context, a subset consisting of all reported verb/adverb (n = 1), adverb-verb (n = 2,3) combinations based on a frequency cut-off was filtered out using Part of Speech (POS) tags of tokenised words. We observed that N-grams for negative labels often gave misleading information, e.g. “HLA-B27 negative” refers to the absence of allele rather than a negative association between entities. Hence, we used very stringent criteria for choosing negative labels. Manual annotation of positive and negative labels was then carried out on this dataset and a total of 1127 labels (Supplementary Table 1) were categorised (1107 positive and 20 negative) for labelling the sentences. We assert a positive label where the HLA allele is positively associated with disease and hence its presence makes individuals susceptible to disease, whereas in negative statements the HLA allele is negatively associated with disease and hence protective for the disease. We also considered negation words like “not, none, no” which if present, can reverse the actual meaning of the sentences. Instances of above mentioned three keyword sets (positive, negative and negation) were iteratively searched in all the sentences. Further, a coding scheme was constructed using the binary layout to label sentences as positive, negative, negation, complex ambiguous. Sentences having no match from either of the categories were labelled as others.

Root-verb and associated adverbs using dependency parsing

Dependency parsing refers to the formation of a tree layout based on the semantics of a sentence, where the root node is represented by a verb that describes relation between different entities of that sentence. Direct implementation of word tokenization, first step in dependency parsing, generates multiple tokens for single allele and disease keywords as shown in Fig. 2. Therefore, to ensure the accuracy of the algorithm, the allele and disease keywords present in each sentence were replaced with @GENE and @DISEASE tags and a parse tree was then generated using StanfordCoreNLP python module (Stanford-corenlp-full-2018-10-05 package). The list of verbs obtained from the root nodes of all the sentences in the dataset was manually curated under positive and negative labels. We also added a category “Studied/Investigatory” that doesn’t convey any positive or negative context but have mentions of both entities together, e.g. “To investigate the association of HLA-A, B, and DRB1 alleles with leukaemia in the Han population in Hunan province”.

Fig. 2
figure 2

Tokenization and Dependency parsing: In this example, the keyword “Multiple sclerosis” is tokenized to “Multiple” and “sclerosis” separately with parts of speech Adjective and Noun respectively

Sentence annotation

We termed our approach as “hybrid approach” for labelling sentences, where annotation was done using both N-gram labels and the type of root verbs. If a sentence had a positive N-gram label and a positive root verb, that inferred the relationship between entities as associated or linked, then the sentence was labelled as positive. For negative labelling also we used the same approach. Finally, labelling of sentences were grouped into different categories: 1) Positive, 2) Negative, 3) Both positive and negative, referring as Complex sentences, 4) Positive/negative + negation referring as Ambiguous group, 5) Investigatory and 6) Others (−).

Database and web server

HLA SPREAD database is built for quick and easy retrieval of information related to HLA genes. The web interface was designed in HTML5, CSS3 & ES6 (JavaScript) and the backend was developed in Laravel 8 (PHP Web Framework) & MySQL for the database. Laravel is a PHP web framework proposed for the development of a web system following the Model View Controller (MVC) architecture. We used D3.js for data visualization and SQL indexing for search table integration. The server was hosted using Apache HTTP (PHP) server. The database uses Relational Database Management System with data stored in the table. JavaScript handles the data visualizations and Laravel handles the search queries, indexing, and the data export section. This web interface is compatible with various devices and browsers except the feature “Show entries” in the search tab is visualised best in Mozilla Firefox. Figure 3 gives insights in using the HLA-SPREAD search.

Fig. 3
figure 3

HLA-SPREAD search: A screenshot of HLA-SPREAD to assist user in understanding HLA-SPREAD interface and easy retrieval of data

Utility and discussion

Mining Medline literature for HLA association

NLP based text mining of ~ 28 million publicly available biomedical abstracts provided 47,049 abstracts with either one or more sentences that describe the relationship between the HLA alleles and diseases. To understand the distribution of various kinds of articles published among the filtered abstracts, we studied the article type per year trend from 1975 to 2021 (Fig. 4). We found research journal, comparative study and review articles to have maximum numbers every year. In addition, there were papers corresponding to clinical trials phase I, II, III and IV and observational studies highlighting the importance of this locus in translational studies.

Fig. 4
figure 4

Nature and trends of HLA related publications in PubMed annually from 1975 onwards: Stacked Bar plot shows distribution of PubMed articles in different categories. a Diverse studies including clinical trials are reported, with maximum numbers represented in the “journal article” category. b A subplot of (a) after removing the most frequent “Journal article” type to visualise the trends in other categories

HLA genes, alleles and its distribution

There are 28,320 alleles in the IMGT database with many of them associated with a disease or pathological conditions. There also exists a great extent of variability in the names within articles. E.g. HLA-B*13:01, a risk factor for dapsone hypersensitivity syndrome in multiple populations was written as HLA-B*13:01, HLA-B*1301, B*1301, B(*)1301 and B1301 in different papers. If one has to search for an allele and its related information, the user must be aware of all possible formats of writing an allele encompassing its current and previous nomenclature. So, based on this, we converted all existing HLA keywords to a standard allele name. We identified only ~ 1% of the total alleles to be associated with conditions like diseases, graft survival, or drug reactions. To represent these alleles in the form of a graph, we collapsed the nomenclature to two-digit level (Fig. 5). Majority of the studies were with HLA-B loci, followed by HLA-A and HLA-DRB1, while fewer studies were on HLA-C locus. Each HLA alleles, collapsed to its two-digit information are linked to AFND server in the database, highlighting its allele frequency. The focus of our present study was also to understand the semantics between alleles and diseases, wherein we noted that some alleles were reported as protective and some as risk alleles. e.g. reports indicated HLA-DRB1*15 was protective for HIV and risk allele for pulmonary tuberculosis [23, 28]. We were also interested in exploring the effects of multiple alleles individually on a single disease. To address this, we listed out 54 articles (Supplementary Table 2) highlighting the fact that for a single disease, different alleles can have contrasting effects, e.g. HLA-DQA1*02:01 and HLA-DQB1*06:02 can be protective in Artemisia pollen-induced allergic rhinitis while HLA-DQA1*03:02 can be a risk factor [34].

Fig. 5
figure 5

The topmost reported HLA alleles associated with diseases: All the HLA alleles indicated have been grouped to their second digit and represented in the pie chart. HLA-A, HLA-B and HLA-DRB1 are the most studied amongst the HLA genes

Exploring diseases, its associated categories and other relevant information

The HLA studies were divided into four broad categories: Diseases, Transplantations, Sign and Symptoms, and Therapeutics/ADRs, to study the information systematically. This grouping was done based on the MeSH keywords identified in the abstracts There are a total of 24 categories for diseases in MeSH, ranging from C01 and C04 through C26. We grouped C23 as “Sign and Symptoms” and C20.452 (GVHD) as part of Transplantation and rest as disease categories. Keywords falling under E04 (Transplantation procedures) were also grouped under “Transplantation”. For “Therapeutics/ADRs”, we selected only those sentences that had mentions of drug keywords, allele name and disease names together. We filtered them further if they satisfied either of the three conditions: 1) Belongs to Drug adverse reactions category or 2) Sentences had mentions of keywords such as reactions, −induced(carbamazepine-induced) or 3) Disease keyword had mention of –induced (Drug-induced liver injury). The remaining were grouped as “Diseases”. There are 32,714, abstracts in the Disease category, 10,370 in Transplantation, 8574 in Signs and Symptoms and 429 in ADR’s.

To study the association with diseases, we analysed data from the “Diseases” and “Transplantation” category. Inconsistency in writing disease names increases the efforts in searching a specific query. To reduce this variability, MeSH ID was used to summarise the obtained information e.g. diseases like tumour, cancer, malignancy, and neoplasm (malignant and benign) were mapped to a single entity malignancy (D009369). Collapsing a large number of similar keywords to a single ID reduces the complexity in searching for articles related to particular diseases. We observed a total of 3661 different disease terms mapping to unique 1904 MeSH IDs. Figure 6 represents a snapshot of common HLA associated diseases. To examine the disease associations, we mapped it to level-one (level-zero) terms. Diabetes Mellitus Type 1, Rheumatoid Arthritis, Multiple Sclerosis (Autoimmune Disease), Melanoma and Leukemic (Neoplasms by Histologic Type), Psoriasis (Skin disease) and Celiac Disease (Metabolic) were the topmost HLA associated diseases. In the analysed abstracts, the list of HLA associated diseases/conditions indicates that some diseases were very frequently reported, whereas other diseases like Down syndrome, Guillain-Barre Syndrome, Polymyalgia Rheumatica were infrequently or rarely reported. Supplementary Table 3 represent the distribution of both common and less explored HLA associated diseases.

Fig. 6
figure 6

Diseases/conditions associated with HLA genes: Graph represents three level hierarchy of diseases. Each colour represents a level. There are 24 major categories as represented in green colour, which is further divided into subcategories. Each disease name is matched to its Mesh id and a normalised mesh keyword. Autoimmune, Neoplasms and Joint disease are the top most associated diseases. As anticipated, significant numbers of studies related to transplantation are also observed

To get an overall perspective of genes and diseases, we considered the diseases at level-one along with HLA gene. We observed the majority of reported associations with HLA-DRB1, followed by HLA-B and HLA-A (Fig. 7). We also listed details of individual allele-disease pairs for more information (Supplementary Table 4). HLA-DRB1 was reported to be linked with disease conditions like malignancies, rheumatoid arthritis, type 1 diabetes, multiple sclerosis and 1052 other diseases. HLA-B association was reported with spondylitis, polyarthritis, uveitis, sacroiliitis, psoriasis and 779 other diseases and HLA-A was reported to be associated with malignancies, melanoma, influenza, breast cancer and 654 other diseases. The analysis also takes into consideration the diseases which require transplantation and also include the complications associated with it both pre and post-transplantation. As anticipated, we observed that individuals suffering from beta Thalassemia and sickle cell anaemia (genetic and congenital disorders), multiple myeloma (an immunoproliferative disorder) and liver injury underwent transplantations of bone marrow, hematopoietic stem cells and renal tissue. However, there were other additional details included with the transplantation data such as disease history of patients before undergoing transplantation e.g. psoriasis, Graves’ disease, diabetic neuropathy and post-transplantation complications e.g. Ischemia, Necrosis, Fibrosis, Haemorrhage.” Such collated information under one platform may be of interest to a clinician for designing therapy modules. Supplementary Table 5 represents details of transplantation related studies.

Fig. 7
figure 7

Heatmap of HLA Disease associations: The gradient heat map representing the number of diseases associated with HLA genes. First column represents generic “HLA” studies where specific gene information is not mentioned. A large number of associations were also observed with Non-classical (HLA-E,F,G) genes

SNPs and HLA diseases

HLA loci have a repertoire of genetic variations, a large number of which have been linked to multiple diseases via genome-wide association studies (GWAS). Though GWAS lists information about SNPs in/associated with HLA gene, a number of genetic variation studies go unnoticed either because they are small cohort analysis or are not compiled in a single resource for systematic study. Thus, to include the overlooked studies and missing information, this analysis reports information from all kinds of studies and includes abstracts mainly from journal articles, review, metanalysis, letters, and clinical trials. To acquire robust data, we retained only those HLA variations, that are present in the sentences along with the disease and allele keywords. We identified 1543 unique SNPs mention and its details is compiled in Supplementary Table 6. Majority of SNPs mapped to intronic variants followed by missense and intergenic. Figure 8 represents genomic distribution of mapped SNPs. A substantial number of variations also mapped to genes other than HLA, indicating they may be in Linkage Disequilibrium (LD) or frequently occur in conditions like transplantation success or ADRs example [8]. We observed top hits of SNPs mapping to infectious diseases like HIV and hepatitis, inflammatory conditions like psoriasis, complex diseases like asthma and diabetes and hypersensitivity largely attributed by drug ADRs. SNP association studies are also based on a proxy SNP, which can be in LD with the causal variant and the LD values vary from one population to another. To address this, we also added population information of the studies whenever available in the abstract. The most studied SNP rs9277535, associated with hepatitis B virus, has been studied across a large number of populations from Asian and central Asian countries like China, Japan, Asia, Turkey, Korea, and Indonesia.

Fig. 8
figure 8

Genomic distribution of SNPs: Pie chart representing the number of variations in genic region with majority of them mapping to introns

Geographical spread of HLA literature across various ethnic groups and populations

Genetic differences in HLA genes across populations and their link with biological conditions make it imperative to consider geographical information while studying HLA association with a particular condition. We assumed that the population/ethnic groups name might not be present in the same sentences that mention HLA and disease, so we used a flexible approach here and fetched the names of geographical locations present anywhere in the abstracts. In total, we reported 149 unique NORP tags which were binned into 109 country-based populations and 40 ethnic groups. Figure 9 represents the frequency distribution of these matched populations belonging to the countries and ethnic groups. Japan, China, USA, India and Italy are the major countries where the HLA gene-disease association studies have been reported with disease groups as shown in Supplementary Table 7. Along with this, the European subcontinent has been extensively studied (1296 reports) as a major ethnic group. Apart from frequently studied areas, we also observed locations like New Zealand, Armenia and Sri Lanka that have a low number of reported studies. This type of analysis can help researchers understand not only the extent of allele-disease associations among populations in the context of these immune players but also the scope of research in their selected geographical location while planning their hypothesis.

Fig. 9
figure 9

Geographical Spread of HLA studies: Identified geographical locations are binned to the nearest a Country b Ethnic group. Color gradient representing the count of various HLA alleles with respect to disease or ARD’s studies. China, Japan and the USA report maximum studies and European, Asian and African are the most studied ethnic groups. This figure is generated employing the data analysed from HLA-SPREAD. Figure (a) was created using Maps options in the Tableau software and figure (b) was created using “treemap” package in R

Response to therapeutics

HLA genes are known to have association with various hypersensitivities and drug reactions, a few of them like Stevens-Johnson syndrome can also be life-threatening. Due to allele differences among individual and population level, these hypersensitivities vary, and thus studying these pharmacogenetic markers with the population information becomes important. For instance, we observed from our data that HLA-A*31:01 is associated with carbamazepine induced Stevens-Johnson syndrome in European population while HLA-B*15:02 is associated with Chinese and Indian populations. A meta resource like HLA-SPREAD can help understand such population-wise differences that obstruct designing of therapy modules for ADRs/ hypersensitivities. To be more specific, this analysis focuses on drugs that are present in sentences along with the disease and allele keywords. We observed a total of unique 7017 abstracts mentioning 506 unique drugs, of which 163 mapped to ADR category. Details of drugs and related information are listed in Supplementary Table 8. We also validated our results with AFND, a manually curated database that has information about ADRs and PharmGKB (Fig. 10a and b). Out of 167 drugs present, we were able to find 30 common with AFND and 44 common with PharmGKB. One of the drugs “Valporic acid”, mentioned in AFND, was not present in the actual cited article and 11 drugs in AFND and 26 drugs in PharmGKB could not be captured because of the stringent criteria of drug mapping i.e. the drug name should be present in the sentence along with disease and allele keyword. Figure 10c lists the frequency-based distribution of top 20 drugs fetched from our analysis. Interestingly, we also observed 133 and 119 drugs that are not mentioned in AFND database and PharmGKB respectively, e.g. HLA-B*38:02:01 allele was found to predict carbimazole/methimazole induced agranulocytosis, HLA-DRB1 associated azathioprine induced pancreatitis in IBD patients. This analysis highlights, how one can miss information apart from the time and manpower intensive nature in manual curation. The details of common and exclusive drugs in comparison with AFND and PharmGKB is listed in Supplementary Table 9.

Fig. 10
figure 10

Statistics of drugs related HLA studies: A Comparison of ADR’s identified using HLA-SPREAD with AFND. B Comparison of ADR’s identified using HLA-SPREAD with PharmGKB. C Bar plot showing the topmost 20 drugs identified

Insights from HLA-SPREAD: biomarker analysis

We demonstrate the usability of the database to address clinically relevant queries. Multiple questions on the identification of HLA alleles and diseases linked with hypersensitivity, allergy, genetic marker, prognosis and diagnosis can be addressed using HLA-SPREAD. As an example, we present an analysis to identify biomarkers in HLA studies. To address this question, we used an n-gram based approach to identify the keyword most frequently occurring with “marker” in the sentences.. The topmost occurring keywords identified were biomarker, genetic marker, HLA marker, predictive mraker, prognostic marker, risk marker and susceptibility marker. We checked the details of such sentences and complied the information (Supplementary Table 10). A few of them like abacavir hypersensitivity and SJS syndrome were present in multiple papers. HLA-G and HLA-E were also reported to be markers for conditions like tumour, transplantation and heart diseases.


In summary, we collated all the HLA associations from the list of ~ 28 billion publicly available abstracts and observed review/associations with an increasing trend since 1975. We also observed articles from clinical trials phase I, II, III and IV. One of the key highlight of the analysis is that we were able to reduce the complexity of HLA nomenclature by converting all existing old nomenclature in literature to the current format. This can facilitate the understanding of multiple studies across years and populations. The HLA-SPREAD database also has access to the worldwide allele frequency distribution across populations. We were also able to consolidate all the HLA studies into four different categories 1) Disease associations 2) ADR’s 3) Transplantation 4) Sign and symptoms. We listed the ARD’s across populations and identified HLA alleles used as a biomarker. Towards the end of the work, we have also addressed the semantics of the associations, i.e. if the HLA allele is protective or susceptible for disease/ADR association. This is one of its kind of efforts to integrate the diversity of HLA information into a structured format for ease of query and analysis. This could also provide an informative resource for the non-HLA specialists for initiating any new studies in populations and diseases.

Availability of data and materials

Database name: HLA-SPREAD.

Database homepage:

Browser requirement: JavaScript should be enabled; we recommend the use of the Firefox web browsers for an optimal experience.

Datasets used:

1) IPD-IMGT/HLA for HLA alleles:

2) Medline for PubMed abstracts:

3) SIDER for drug names:

4) PharmGKB for drug names:

5) AFND for HLA ADR drugs:




Human Leukocyte antigen


Natural Language processing


Adverse drug reactions


Allele Frequency Net Database


Unified Medical Language System


  1. Achakulvisut T, Acuna D, Kording K. Pubmed parser: a Python parser for PubMed open-access XML subset and MEDLINE XML dataset XML dataset. J Open Source Softw. 2020;5(46):1979.

    Article  Google Scholar 

  2. Andeweg SP, Keşmir C, Dutilh BE. Quantifying the impact of human leukocyte antigen on the human gut microbiome [preprint]. Bioinformatics. 2020.

  3. Ayuk F, Beelen DW, Bornhäuser M, Stelljes M, Zabelina T, Finke J, et al. Relative impact of HLA matching and non-HLA donor characteristics on outcomes of allogeneic stem cell transplantation for acute myeloid leukemia and myelodysplastic syndrome. Biol Blood Marrow Transplant. 2018;24(12):2558–67.

    Article  PubMed  Google Scholar 

  4. Blackwell JM, Jamieson SE, Burgner D. HLA and infectious diseases. Clin Microbiol Rev. 2009;22(2):370–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(90001):267D–270.

    Article  CAS  Google Scholar 

  6. Buhler S, Sanchez-Mazas A. HLA DNA sequence variation among human populations: molecular signatures of demographic and selective events. PLoS One. 2011;6(2):e14643.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Choi W, Choi C-H, Kim YR, Kim S-J, Na C-S, Lee H. HerDing: herb recommendation system to treat diseases using genes and chemicals. Database. 2016;2016:baw011.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. de Bakker PIW, McVean G, Sabeti PC, Miretti MM, Green T, Marchini J, et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet. 2006;38(10):1166–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Ferrell PB, McLeod HL. Carbamazepine, HLA-B*1502 and risk of Stevens–Johnson syndrome and toxic epidermal necrolysis: US FDA recommendations. Pharmacogenomics. 2008;9(10):1543–6.

    Article  CAS  PubMed  Google Scholar 

  10. Fricke-Galindo I, LLerena A, López-López M. An update on HLA alleles associated with adverse drug reactions. Drug Metab Pers Ther. 2017;32(2).

  11. Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. IPD-IMGT/HLA Database. Nucleic Acids Res. 2019;gkz950.

  12. Gomez A, Luckey D, Yeoman CJ, Marietta EV, Berg Miller ME, Murray JA, et al. Loss of sex and age driven differences in the gut microbiome characterize arthritis-susceptible *0401 mice but not arthritis-resistant *0402 mice. PLoS One. 2012;7(4):e36095.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Gonzalez-Galarza FF, McCabe A, dos Santos EJM, Jones J, Takeshita L, Ortega-Rivera ND, et al. Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools. Nucleic Acids Res. 2019:gkz1029.

  14. Jensen K, Panagiotou G, Kouskoumvekaki I. NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods. Nucleic Acids Res. 2015;43(D1):D940–5.

    Article  CAS  PubMed  Google Scholar 

  15. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 2006;7(2):119–29.

    Article  CAS  PubMed  Google Scholar 

  16. Kanga U, Mehra NK, Larrea CL, Lardy NM, Kumar A, Feltkamp TEW. Seronegative Spondyloarthropathies and HLA-B27 subtypes: a study in Asian Indians. Clin Rheumatol. 1996;15(S1):13–8.

    Article  PubMed  Google Scholar 

  17. Khan MA. HLA-B27 and its pathogenic role. J Clin Rheumatol. 2008;14(1):50–2.

    Article  PubMed  Google Scholar 

  18. Khan MA, Mathieu A, Sorrentino R, Akkoc N. The pathogenetic role of HLA-B27 and its subtypes. Autoimmun Rev. 2007;6(3):183–9.

    Article  CAS  PubMed  Google Scholar 

  19. Klimenta B, Nefic H, Prodanovic N, Jadric R, Hukic F. Association of biomarkers of inflammation and HLA-DRB1 gene locus with risk of developing rheumatoid arthritis in females. Rheumatol Int. 2019;39(12):2147–57.

    Article  CAS  PubMed  Google Scholar 

  20. Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Res. 2016;44(D1):D1075–9.

    Article  CAS  PubMed  Google Scholar 

  21. Kuleshov V, Ding J, Vo C, Hancock B, Ratner A, Li Y, et al. A machine-compiled database of genome-wide association studies. Nat Commun. 2019;10(1):3341.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (Oxford, England). 2020;36(4):1234–40.

    Article  CAS  Google Scholar 

  23. Li C-P, Zhou Y, Xiang X, Zhou Y, He M. Relationship of HLA-DRB1 gene polymorphism with susceptibility to pulmonary tuberculosis: updated meta-analysis. Int J Tuberc Lung Dis. 2015;19(7):841–9.

    Article  PubMed  Google Scholar 

  24. Mosaad YM. Clinical role of human leukocyte antigen in health and disease. Scand J Immunol. 2015;82(4):283–306.

    Article  CAS  PubMed  Google Scholar 

  25. Niehrs A, Altfeld M. Regulation of NK-cell function by HLA class II. Front Cell Infect Microbiol. 2020;10:55.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Petersdorf EW. Which factors influence the development of GVHD in HLA-matched or mismatched transplants? Best Pract Res Clin Haematol. 2017;30(4):333–5.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Rakhi NK, Tuwani R, Mukherjee J, Bagler G. Data-driven analysis of biomedical literature suggests broad-spectrum benefits of culinary herbs and spices. PLoS One. 2018;13(5):e0198030.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Ranasinghe S, Cutler S, Davis I, Lu R, Soghoian DZ, Qi Y, et al. Association of HLA-DRB1-restricted CD4+ T cell responses with HIV immune control. Nat Med. 2013;19(7):930–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Sawal N, Kanga U, Shukla G, Goyal V, Srivastava AK. Stevens-Johnson syndrome triggered by Levetiracetam—caution for use with carbamazepine. Seizure. 2020;80:63–4.

    Article  PubMed  Google Scholar 

  30. Saxena A, Suzuki S, Mourya M, Shiina T, Kanga U. Novel and extended HLA class I and II alleles encountered in Kashmiri Brahmin population from North India. HLA. 2020;96(4):487–9.

    Article  CAS  PubMed  Google Scholar 

  31. Sfakianaki P, Koumakis L, Sfakianakis S, Iatraki G, Zacharioudakis G, Graf N, et al. Semantic biomedical resource discovery: a natural language processing framework. BMC Med Inform Decis Mak. 2015;15(1):77.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Shiina T, Hosomichi K, Inoko H, Kulski JK. The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet. 2009;54(1):15–39.

    Article  CAS  PubMed  Google Scholar 

  33. Thorn CF, Klein TE, Altman RB. PharmGKB: the pharmacogenomics knowledge base. Methods Mol Biol (Clifton, N.J.). 2013;1015:311–20.

    Article  CAS  Google Scholar 

  34. Wang M, Xing Z-M, Yu D-L, Yan Z, Yu L-S. Association between HLA class II locus and the susceptibility to Artemisia pollen-induced allergic rhinitis in Chinese population. Otolaryngol Head Neck Surg. 2004;130(2):192–6.

    Article  PubMed  Google Scholar 

  35. Wei C-H, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 2019;47(W1):W587–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Xu H, Yin J. HLA risk alleles and gut microbiome in ankylosing spondylitis and rheumatoid arthritis. Best Pract Res Clin Rheumatol. 2019;33(6):101499.

    Article  PubMed  Google Scholar 

Download references


The authors would Acknowledge Dr. Yatender Kumar (NSIT) for permitting AK to work on this project. We would also acknowledge Mr. Praveen Sinha for designing webpage of HLA SPREAD, Dr. Debasis Dash, CSIR-IGIB for critical reviewing of work, Dr. Ganesh Bagler and Rudransh Tunwani from IIITD for NLP discussion, Dr. Ganganath Jha from Hazaribagh University in QC of population curation and Malika Seth in QC of semantic annotations. The authors would also like to acknowledge Mr. Raghunandanan MV and Mr. Amit Khulve at CSIR-IGIB for IT support.


This work is funded by COE M/o AYUSH grant GAP0183. The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript. The fellowship of DD was awarded from Department of Biotechnology (DBT).

Author information

Authors and Affiliations



MM, DD designed the study. DD and AK did analysis and interpretation of data. MM, DD and AK co-wrote the manuscript. B.R.M developed the database. UK helped in HLA interpretation and manuscript writing. The author(s) read and approved the final manuscript.

Corresponding authors

Correspondence to Dhwani Dholakia or Mitali Mukerji.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table 1.

List of positive and negative N-gram labels (adverbs/verbs) used for semantic analysis. Table 2. Catalogue of susceptible and risk alleles influencing same diseases. Table 3. Number of articles in broad categories. Table 4. HLA gene disease associations. Table 5. List of transplantation, its associated diseases along with pre and post complications information. Table 6. Details of SNPs annotated based on ensemble. Table 7. Geographical entities and their disease associations count. Table 8. HLA alleles and Adverse drug reactions. Table 9. Comparison of ADR drugs present in HLA-SPREAD with AFND and PharmGKB. Table 10. HLA indication as biomarkers.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dholakia, D., Kalra, A., Misir, B.R. et al. HLA-SPREAD: a natural language processing based resource for curating HLA association from PubMed abstracts. BMC Genomics 23, 10 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: