HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data
© Nariai et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
Human leucocyte antigen (HLA) genes play an important role in determining the outcome of organ transplantation and are linked to many human diseases. Because of the diversity and polymorphisms of HLA loci, HLA typing at high resolution is challenging even with whole-genome sequencing data.
We have developed a computational tool, HLA-VBSeq, to estimate the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to HLA allele sequences and abundance of reads on HLA alleles by variational Bayesian inference. We show the effectiveness of the proposed method over other methods through the analysis of predicting HLA types for HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1,-DQB1 and -DRB1) loci from the simulation data of various depth of coverage, and real sequencing data of human trio samples.
HLA-VBSeq is an efficient and accurate HLA typing method using high-throughput sequencing data without the need of primer design for HLA loci. Moreover, it does not assume any prior knowledge about HLA allele frequencies, and hence HLA-VBSeq is broadly applicable to human samples obtained from a genetically diverse population.
HLA loci on chromosome 6p21.3 are one of the most diverse and polymorphic region in the human genome, and the IMGT/HLA database release 3.15.0 currently containes 10,691 allele sequences . HLA class I molecules present endogenous antigens to CD8+ (cytotoxic) T cells, whereas HLA class II molecules present exogenous antigens to CD4+ (helper) T cells . HLA matching of classical class I loci (HLA-A, -B and -C) and three of class II loci (HLA-DQA1, -DQB1 and -DRB1) between a donor and patient lowers risks of acute graft-versus-host disease in unrelated haematopoietic stem cell transplantation  or organ transplantation . Specific alleles of class I loci have been found to be associated with the rate of progression from human immunodefinciency virus type 1 (HIV-1) infection to the acquired immunodefiniency syndrome (AIDS) . HLA-DR and -DQ loci have been found to be associated with autoimmune diseases, such as in type I diabetes , narcolepsy  and multiple sclerosis . Hence, a method to determine HLA types accurately and conveniently is needed for both clinical practices and basic research.
Conventionally, HLA types have been determined at 2-digit resolution (e.g., A*01), which approximates the serological antigen groupings. More recently, sequence specific oligonucleotide probes (SSOP) method has been used for HLA typing at 4-digit resolution (e.g., A*01:01), which can distinguish amino acid differences . Currently, targeted DNA sequencing for HLA typing  is the most popular approach for HLA typing over other conventional methods. Since the sequence-based approach directly determines both coding and non-coding regions, it can achieve HLA typing at 6-digit (e.g., A*01:01:01) and 8-digit (e.g., A*01:01:01:01) resolution, respectively. HLA typing at the highest resolution is desirable to distinguish existing HLA alleles from new alleles or null alleles from clinical perspective .
Because of recent improvement and cost reduction of the next generation sequencers (NGSs), several methods have been proposed to predict HLA types from high-throughput sequencing data. Seq2HLA  predicts 2-digit HLA types from RNA-Seq data, but is not designed for 4-digit HLA typing. HLAminer  predicts HLA types at 4-digit resolution based on the best alignment score of reads to the reference HLA allele sequences, whereas the most recently proposed PHLAT  predicts HLA types at 6-digit resolution based on the likelihood score considering SNP sites against reference sequences. Since there exists uncertainty in read alignments to highly homologous HLA allele sequences, accurate HLA typing at high-resolution is still challenging.
We have developed a computational method, HLA-VBSeq, to estimate HLA types effectively and accurately at 8-digit resolution from whole genome sequencing data. In the first step of the HLA-VBSeq pipeline, read sequences are aligned to the reference genomic sequences of the registered HLA allelles in the IMGT/HLA database, in which multiple hits are allowed. Then, HLA-VBSeq optimizes both read alignments to the HLA allele sequences and relative quantities of reads on HLA alleles simultaneously under a statistical framework by variational Bayesian inference. Our approach considers all the possible alignments of reads to HLA allele sequences, and calculates the marginal likelihood of data from gapped alignments of reads to the reference sequences, in which deletions and insertions as well as SNP sites are naturally considered. In our Bayesian approach, an optimal set of HLA allele sequences is estimated for accurate HLA typing. We apply HLA-VBSeq to the simulation data of 5x, 10x, 20x and 30x coverage and compare prediction performance of our method with those of PHLAT and HLAminer. We also apply HLA-VBSeq to the whole genome sequencing data of a CEU trio to show the effectiveness of the proposed method.
Optimization of read alignments to HLA allele sequences
where C is a constant, , T is the number of HLA alleles considered, and α t > 0 is the hyperparameter, which controls the complexity of model parameters. In our method, we use the uniform hyperparameter α0 and it is selected as a maximizer of the log marginal likelihood of the observed data.
Our goal is to estimate the posterior distributions of θgiven the data. However, this requires integrals over hidden variables and is intractable to compute in closed form. Hence, we use variational Bayesian (VB) inference to obtain the approximation of the full posterior distribution by assuming the factorization of latent variables and model parameters . In the variational Bayesian E (VBE) step, for read n, an expected read count generated from the HLA allele t is calculated based on the current estimate of the abundance parameter as . In the variational Bayesian M (VBM) step, the read abundance on each allele is calculated based on the current estimate of the expected read counts by , where Each step is iterated until a convergence criterion is satisfied (when the read quantities on HLA alleles are no longer updated). Update equations in each step can be calculated similarly as described in the previous work .
HLA typing from the optimized read alignment on HLA alleles
After the inference algorithm converges, HLA types are predicted based on the expected number of reads assigned to each allele. Because there exist sequencing errors (substitutions, deletions and insertions against reference sequences) and alignment errors, a threshold for the depth of coverage on HLA alleles is set. In our analysis, we set the threshold as 20% of the depth of coverage (i.e., if the data is 30x on average, then we use 6x for a threhold). For each HLA locus, a diplotype is decided as follows:
• If there is no allele that passes the threshold, then it is considered that there are not enough reads to identify a correct HLA type, and hence no allele is called.
• If there is only one allele that passes the threshold, and the depth of coverage is more than or equal to twice as that of the threshold, then the HLA locus is considered to be homozygous of that HLA allele. If the depth of coverage is less than twice as that of the threshold, then the allele is called as heterozygous.
• If there are two or more alleles that pass the threshold, then alleles are sorted according to the depth of coverage from high to low. The top two alleles are selected as candidates of HLA types. If the depth of coverage of the top one is more than twice as that of the second one, then the HLA locus is set as homozygous of the top one. Otherwise, the HLA locus is predicted as a diplotype of the top and second one.
Performance measure of HLA typing
Prediction performance is evaluated in terms of the prediction accuracy. In our analysis, the prediction accuracy is defined as the fraction of true positive predictions among the true HLA types. In this simulation experiment, two HLA alleles (either heterozygous or homozygous) for six HLA loci (HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1), or 12 HLA alleles in total, are evaluated for each individual. The prediction performance is evaluated separately at the 2-digit, 4-digit, 6-digit and 8-digit resolution for each method.
Results and discussion
Simulation data analysis
First, we evaluate the performance of predicting HLA types with HLA-VBSeq compared to other methods in simulation data analysis. From the comparative study recently published , we chose PHLAT and HLAminer as comparable methods that can type HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1, -DQB1 and-DRB1) loci at 4-digit resolution from whole-genome sequencing data. We prepared the simulation data of 1,000 human samples, whose HLA diplotypes for the six HLA loci were randomly chosen from the registered HLA alleles in the IMGT/HLA database release 3.15.0. Once HLA types are fixed for each individual, one SNP per 1,000 bp is incorporated in the HLA allele sequences for each individual, which is based on the average base diversity in the human genome . Then, 100 bp paired-end read data (5x, 10x, 20x and 30x), whose mean and standard deviation of the fragment length distribution were set as 300 bp and 40 bp, respectively, are generated with 0.1% substitution, deletion and insertion errors uniformly across HLA allele sequences.
Prediction accuracy of HLA-VBSeq and existing tools for HLA typing in the 30x simulation data analysis.
Real data analysis
Predicted HLA types of the CEU trio samples with HLA-VBSeq.
Predicted HLA types
Predicted HLA types of HLA-DQA1, -DQB1 and -DRB1 loci with HLA-VBSeq were coincided with those with PHLAT at 6-digit resolution except DQA1*01:01:02 (one allele in NA12878 and two alleles in NA12892). PHLAT instead predicted them as DQA1*01:01:01, whose genomic sequence was missing in the IMGT database release 3.15.0, and hence HLA-VBSeq could not predict the HLA type in our experimental condition.
HLA-VBSeq is an efficient and accurate HLA typing method using whole-genome sequencing data without the need of primer design for HLA loci or prior knowledge of HLA allele frequencies. Although we have evaluated the prediction performance with HLA-VBSeq using the simulation data of various depth of coverage and real data of whole-genome sequencing data, other high-throughput sequencing data, such as from target-sequencing can be utilized with minor modifications in the pipeline. However, we should bear in mind that because of the complexity of HLA loci and a fairly polymorphic nature of each locus, off-target sequences are often obtained by the target-sequencing approach, such as from pseudogenes . As population-wide sequencing data becomes available, HLA-VBSeq can be easily applied for HLA typing at any HLA loci, which will be useful for association studies to identify links to phenotypes, as well as for clinical works such as donor-recipient matching. Since the genomic sequences registered in the IMGT/HLA database are still not complete, there is room for improvement in predicting HLA types with HLA-VBSeq in the future.
Availability of supporting data
The implementation of HLA-VBSeq and the documentation is available in the website, http://nagasakilab.csml.org/hla
This work was supported (in part) by MEXT Tohoku Medical Megabank Project. All computational resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University.
The publication costs for this article were partly funded by MEXT Tohoku Medical Megabank Project.
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S2
- Robinson J, Halliwell JA, McWilliam H, Lopez R, Parham P, Marsh SG: The IMGT/HLA database. Nucleic acids research. 2013, D1222-1227. 41 DatabaseGoogle Scholar
- Horton R, Wilming L, Rand V, Lovering RC, Bruford EA, Khodiyar VK, Lush MJ, Povey S, Talbot CC, Wright MW, et al: Gene map of the extended human MHC. Nature reviews Genetics. 2004, 5 (12): 889-899. 10.1038/nrg1489.View ArticlePubMedGoogle Scholar
- Morishima Y, Sasazuki T, Inoko H, Juji T, Akaza T, Yamamoto K, Ishikawa Y, Kato S, Sao H, Sakamaki H, et al: The clinical significance of human leukocyte antigen (HLA) allele compatibility in patients receiving a marrow transplant from serologically HLA-A, HLA-B, and HLA-DR matched unrelated donors. Blood. 2002, 99 (11): 4200-4206. 10.1182/blood.V99.11.4200.View ArticlePubMedGoogle Scholar
- Marks C: Immunobiological determinants in organ transplantation. Annals of the Royal College of Surgeons of England. 1983, 65 (3): 139-144.PubMed CentralPubMedGoogle Scholar
- Carrington M, Nelson GW, Martin MP, Kissner T, Vlahov D, Goedert JJ, Kaslow R, Buchbinder S, Hoots K, O'Brien SJ: HLA and HIV-1: heterozygote advantage and B*35-Cw*04 disadvantage. Science (New York, NY). 1999, 283 (5408): 1748-1752. 10.1126/science.283.5408.1748.View ArticleGoogle Scholar
- Shiina T, Hosomichi K, Inoko H, Kulski JK: The HLA genomic loci map: expression, interaction, diversity and disease. Journal of human genetics. 2009, 54 (1): 15-39. 10.1038/jhg.2008.5.View ArticlePubMedGoogle Scholar
- Matsuki K, Juji T, Tokunaga K, Naohara T, Satake M, Honda Y: Human histocompatibility leukocyte antigen (HLA) haplotype frequencies estimated from the data on HLA class I, II, and III antigens in 111 Japanese narcoleptics. The Journal of clinical investigation. 1985, 76 (6): 2078-2083. 10.1172/JCI112211.PubMed CentralView ArticlePubMedGoogle Scholar
- Lincoln MR, Montpetit A, Cader MZ, Saarela J, Dyment DA, Tiislar M, Ferretti V, Tienari PJ, Sadovnick AD, Peltonen L, et al: A predominant role for the HLA class II region in the association of the MHC region with multiple sclerosis. Nature genetics. 2005, 37 (10): 1108-1112. 10.1038/ng1647.View ArticlePubMedGoogle Scholar
- Levine JE, Yang SY: SSOP typing of the Tenth International Histocompatibility Workshop reference cell lines for HLA-C alleles. Tissue antigens. 1994, 44 (3): 174-183. 10.1111/j.1399-0039.1994.tb02376.x.View ArticlePubMedGoogle Scholar
- Erlich RL, Jia X, Anderson S, Banks E, Gao X, Carrington M, Gupta N, DePristo MA, Henn MR, Lennon NJ, et al: Next-generation sequencing for HLA typing of class I loci. BMC genomics. 2011, 12: 42-10.1186/1471-2164-12-42.PubMed CentralView ArticlePubMedGoogle Scholar
- Elsner HA, Blasczyk R: Immunogenetics of HLA null alleles: implications for blood stem cell transplantation. Tissue antigens. 2004, 64 (6): 687-695. 10.1111/j.1399-0039.2004.00322.x.View ArticlePubMedGoogle Scholar
- Boegel S, Lower M, Schafer M, Bukur T, de Graaf J, Boisguerin V, Tureci O, Diken M, Castle JC, Sahin U: HLA typing from RNA-Seq sequence reads. Genome medicine. 2013, 4 (12): 102-View ArticleGoogle Scholar
- Warren RL, Choe G, Freeman DJ, Castellarin M, Munro S, Moore R, Holt RA: Derivation of HLA types from shotgun sequence datasets. Genome medicine. 2012, 4 (12): 95-10.1186/gm396.PubMed CentralView ArticlePubMedGoogle Scholar
- Bai Y, Ni M, Cooper B, Wei Y, Fury W: Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC genomics. 2014, 15: 325-10.1186/1471-2164-15-325.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England). 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.View ArticleGoogle Scholar
- Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013, 1303: 3997-ArXiv e-prints.Google Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England). 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticleGoogle Scholar
- Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK: An Introduction to Variational Methods for Graphical Models. Mach Learn. 1999, 37 (2): 183-233. 10.1023/A:1007665907178.View ArticleGoogle Scholar
- Nariai N, Hirose O, Kojima K, Nagasaki M: TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference. Bioinformatics (Oxford, England). 2013, 29 (18): 2292-2299. 10.1093/bioinformatics/btt381.View ArticleGoogle Scholar
- Dawson E, Chen Y, Hunt S, Smink LJ, Hunt A, Rice K, Livingston S, Bumpstead S, Bruskiewich R, Sham P, et al: A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome research. 2001, 11 (1): 170-178. 10.1101/gr.156901.PubMed CentralView ArticlePubMedGoogle Scholar
- Comas D, Mateu E, Calafell F, Perez-Lezaun A, Bosch E, Martinez-Arias R, Bertranpetit J: HLA class I and class II DNA typing and the origin of Basques. Tissue antigens. 1998, 51 (1): 30-40. 10.1111/j.1399-0039.1998.tb02944.x.View ArticlePubMedGoogle Scholar
- Major E, Rigo K, Hague T, Berces A, Juhos S: HLA typing from 1000 genomes whole genome and whole exome illumina data. PloS one. 2013, 8 (11): e78410-10.1371/journal.pone.0078410.PubMed CentralView ArticlePubMedGoogle Scholar
- Hosomichi K, Jinam TA, Mitsunaga S, Nakaoka H, Inoue I: Phase-defined complete sequencing of the HLA genes by next-generation sequencing. BMC genomics. 2013, 14: 355-10.1186/1471-2164-14-355.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.