ALPHLARD: a Bayesian method for analyzing HLA genes from whole genome sequence data

Hayashi, Shuto; Yamaguchi, Rui; Mizuno, Shinichi; Komura, Mitsuhiro; Miyano, Satoru; Nakagawa, Hidewaki; Imoto, Seiya

doi:10.1186/s12864-018-5169-9

Methodology article
Open access
Published: 01 November 2018

ALPHLARD: a Bayesian method for analyzing HLA genes from whole genome sequence data

Shuto Hayashi¹,
Rui Yamaguchi¹,
Shinichi Mizuno²,
Mitsuhiro Komura¹,
Satoru Miyano¹,
Hidewaki Nakagawa⁴ &
…
Seiya Imoto ORCID: orcid.org/0000-0002-2989-308X³

BMC Genomics volume 19, Article number: 790 (2018) Cite this article

6446 Accesses
16 Citations
4 Altmetric
Metrics details

Abstract

Background

Although human leukocyte antigen (HLA) genotyping based on amplicon, whole exome sequence (WES), and RNA sequence data has been achieved in recent years, accurate genotyping from whole genome sequence (WGS) data remains a challenge due to the low depth. Furthermore, there is no method to identify the sequences of unknown HLA types not registered in HLA databases.

Results

We developed a Bayesian model, called ALPHLARD, that collects reads potentially generated from HLA genes and accurately determines a pair of HLA types for each of HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes at 3rd field resolution. Furthermore, ALPHLARD can detect rare germline variants not stored in HLA databases and call somatic mutations from paired normal and tumor sequence data. We illustrate the capability of ALPHLARD using 253 WES data and 25 WGS data from Illumina platforms. By comparing the results of HLA genotyping from SBT and amplicon sequencing methods, ALPHLARD achieved 98.8% for WES data and 98.5% for WGS data at 2nd field resolution. We also detected three somatic point mutations and one case of loss of heterozygosity in the HLA genes from the WGS data.

Conclusions

ALPHLARD showed good performance for HLA genotyping even from low-coverage data. It also has a potential to detect rare germline variants and somatic mutations in HLA genes. It would help to fill in the current gaps in HLA reference databases and unveil the immunological significance of somatic mutations identified in HLA genes.

Background

Human leukocyte antigen (HLA) genes play a key role in immunological responses by presenting peptides to T cells. It is well known that HLA loci are highly polymorphic, and the polymorphism patterns define several thousands of types within HLA genes. HLA genotyping is a process that determines a pair of HLA types for an HLA gene. Since the relationships between HLA types and diseases have now been intensively investigated [1–5], HLA genotyping is considered as a fundamental step in immunological analysis. Further analysis enables us to identify novel HLA types and detect somatic mutations, which potentially affect the efficacy of immune therapy.

Recently, next generation sequencing-based approaches have been developed for HLA genotyping. These can be generally separated into two categories: those based on amplicon sequencing of HLA loci [6, 7] and others based on unbiased sequencing methods such as whole exome sequencing (WES) and RNA sequencing (RNA-seq) [8–15]. The amplicon sequencing-based methods are the most accurate owing to the sufficient coverage of sequence data, but are relatively expensive to perform and require specialized materials and equipment. The unbiased sequencing ones can be used without additional costs, but the accuracy of the results depends on the amount and quality of sequence reads generated from HLA loci. Previous papers have shown that the accuracy can reach 95% at 2nd field resolution from WES and RNA-seq data [10, 12, 13, 15] However, Bauer et al. has reported that these methods cannot achieve 80% accuracy from whole genome sequence (WGS) data [16]. Thus, HLA genotyping from WGS data remains a significant challenge, although this approach would provide more information of HLA loci than possible with WES and RNA-seq data, including details of the non-coding regions such as the introns and the untranslated regions.

To achieve high accuracy for WGS-based HLA genotyping and further analysis of HLA genes, we developed a series of computational methods, which involve collection of sequence reads that are potentially generated from a target HLA gene followed by HLA genotyping, using a novel Bayesian model termed ALelle Prediction in HLA Regions from sequence Data (ALPHLARD). This model was found to yield comparable accuracy to those based on WES and RNA-seq data at 3rd field resolution. Together with HLA genotyping, a notable feature of ALPHLARD is that it can estimate the personal HLA sequences of the sample. This enables achieving high accuracy for a sample whose HLA sequence is not included in the reference databases and further allows for calling rare germline variants not stored in the databases. We can also detect somatic mutations by comparing the HLA sequences of paired normal and tumor sequence data.

We illustrate the capability of our method by comparing the performance of ALPHLARD and existing methods using WES data from 253 HapMap samples and WGS data from the normal samples of 25 cancer patients. We also applied ALPHLARD to WGS data of the tumor samples of the cancer patients and detected three somatic point mutations and one case of loss of heterozygosity (LOH) in the HLA genes, which were validated by the Trusight HLA Sequencing Panels [17] and the Sanger sequencing.

Methods

Overview of our pipeline

Our pipeline consists of two steps as shown in Fig. 1. First, for each read and each HLA type, the HLA read score (HR score) is calculated, which quantifies the likelihood that the read comes from the HLA type. Based on the calculated HR scores, it is determined whether or not the read comes from a certain HLA gene. For example, by aligning read x to the reference sequences in HLA databases, we obtained the HR scores as shown in the bar graph of Fig. 1 a. Then, if the maximum HR score for the HLA-A gene is large enough and the difference in the maximum scores for the HLA-A gene and the other HLA genes is also large, we conclude that read x is most likely a specific read of the HLA-A gene. Otherwise, read x is judged to be a read produced from other regions. HLA genotyping is then performed using the collected reads for each HLA gene, as shown in Fig. 1 b. ALPHLARD outputs candidate pairs of HLA types according to the Bayesian posterior probabilities.

HLA reference data

We used HLA reference information that can be obtained from the IPD-IMGT/HLA database (release 3.28.0) [18]. There are two types of HLA reference sequences in the database: one is a complete genomic reference and the other is an exonic reference without non-coding regions. Some HLA types have both genomic and exonic reference information, but most HLA types have only exonic reference information.

The database also provides multiple sequence alignments (MSAs) at the genomic and the exonic levels for each HLA gene. We combined the two MSAs into a common MSA as follows: First, some gaps were inserted into exons of the genomic MSA for consistency with the exonic reference sequences. Then, missing non-coding sequences were replaced with the most similar genomic reference sequences. This integrated MSA is then used for alignment and realignment of the reads.

Collection and realignment of reads

First, all reads are mapped to a human reference genome, and reads mapped to the HLA region and unmapped reads are used at the next step. We use hg19 [19] as the reference sequence and define the HLA region as chr6:28,477,797-33,448,354, which covers HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes.

Next, the filtered reads are mapped to all HLA genomic and exonic reference sequences. We use BWA-MEM (version 0.7.10) [20] with the -a option to output all found alignments. Then, each mapped read is filtered based on whether or not it is likely to be produced by the target HLA gene. This filtering is performed according to the HR score s_i,j for the i^th read x_i and the j^th HLA type t_j, which is similar to the filtering procedure used in HLAforest [11]. If x_i is not aligned to t_j, s_i,j is −∞. Otherwise, let $(\tilde {x}_{i,j}, \tilde {t}_{i,j})$ be the alignment of x_i and t_j, which might include some gaps. $\tilde {x}_{{i,j,n}}$ and $\tilde {t}_{{i,j,n}}$ are defined as the n^th bases or gaps of $\tilde {x}_{i,j}$ and $\tilde {t}_{i,j}$, respectively, and $\tilde {b}_{{i,j,n}}$ is defined as the base quality of $\tilde {x}_{{i,j,n}}$. We suppose that $\tilde {p}_{{i,j,n}}$ is the probability of a mismatch between $\tilde {x}_{{i,j,n}}$ and $\tilde {t}_{{i,j,n}}$, which can be calculated by

$$ \tilde{p}_{i,j,n} = 10^{-\frac{\tilde{b}_{i,j,n}}{10}}. $$

Then, the HR score s_i,j is given by

$$ s_{i,j} = \sum\limits_{n} \left(\tilde{\alpha}_{i,j,n} + \tilde{\beta}_{i,j,n}\right), $$

where

$$\begin{array}{*{20}l} \tilde{\alpha}_{i,j,n} & =\left\{ \begin{array}{ll} \log (\frac{\tilde{p}_{i,j,n}}{3}) & (\text{if } \tilde{x}_{i,j,n}, \tilde{t}_{i,j,n} \in B \text{ and } \tilde{x}_{i,j,n} \neq \tilde{t}_{i,j,n}) \\ \alpha^{\texttt{d,o}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{d,e}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{i,o}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{i,e}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{N}} & \left(\begin{array}{l} \text{if } \tilde{x}_{i,j,n} = \mathtt{N} \text{ and } \tilde{t}_{i,j,n} \in B^{\texttt{N}} \\ \quad \text{or }\tilde{x}_{i,j,n} \in B^{\mathtt{N}} \text{ and } \tilde{t}_{i,j,n} = \mathtt{N} \end{array}\right) \\ 0 & \text{(otherwise)} \end{array}\right.\! {,} \\ \tilde{\beta}_{i,j,n} & = \left\{\begin{array}{ll} \beta & (\text{if } \tilde{x}_{i,j,n} \in B^{\mathtt{N}}) \\ 0 & \text{(otherwise)} \end{array}\right. {.} \end{array} $$

Here, B={A,C,G,T} and B^N={A,C,G,T,N}. The parameters α^d,o, α^d,e, α^i,o, α^i,e, and α^N take negative values as penalties for opening a deletion, extending a deletion, opening an insertion, extending an insertion, and N in the read or the HLA type, respectively. β is a positive constant reward for read length, which prefers longer reads. Then, the score of x_i for the target HLA gene $s^{*}_{i}$, and the score of x_i for the non-target HLA genes $\bar {s}^{*}_{i}$ are defined by

$$ s^{*}_{i} = \max_{t_{j} \in T} s_{i,j}, \quad \bar{s}^{*}_{i} = \max_{t_{j} \notin T} s_{i,j}, $$

where T is the set of HLA types in the target HLA gene. $s^{*}_{i}$ and $\bar {s}^{*}_{i}$ indicate how likely x_i is to be produced by the target HLA gene and the non-target HLA genes, respectively.

Thus, when x_i is an unpaired read, it is used for HLA genotyping if

$$ s^{*}_{i} > \theta^{u,m}, \quad s^{*}_{i} - \bar{s}^{*}_{i} > \theta^{u,d}, $$

where θ^u,m and θ^u,d are constant thresholds. When x_i and $x_{i^{\prime }}$ are paired, they are used for HLA genotyping if

$$ s^{*}_{i} + s^{*}_{i^{\prime}} > \theta^{p,m}, \quad \left(s^{*}_{i} + s^{*}_{i^{\prime}}\right) - \left(\bar{s}^{*}_{i} + \bar{s}^{*}_{i^{\prime}}\right) > \theta^{p,d}, $$

where θ^p,m and θ^p,d are constant thresholds. Paired reads are generally more effective than unpaired reads; hence, θ^p,m and θ^p,d should be less than θ^u,m and θ^u,d, respectively.

In the next step, all of the collected reads are realigned as follows. First, $\phantom {\dot {i}\!}t_{j^{*}}$ is defined as the best type for x_i in the target gene, which is obtained by

$$ j^{*} = {\underset{j: t_{j} \in T}{\arg\max}} s_{i,j}. $$

Then, x_i is realigned to be consistent with the alignment $\left (\tilde {x}_{i,j^{*}}, \tilde {t}_{i,j^{*}}\right)$ and the integrated MSA of the target HLA gene.

Bayesian model for analyzing HLA genes

Analysis of the target HLA gene by ALPHLARD is performed using the collected and realigned reads. Let $\hat {x}_{i}$ be the i^th paired (or unpaired) read(s) collected and realigned with the previous procedure, $\hat {x}_{i,n}$ be the n^th base or gap of $\hat {x}_{i}$, and $\hat {b}_{i,n}$ be the base quality of $\hat {x}_{i,n}$. Note that, hereafter, we regard paired reads as one sequence. The probability of mismatch $\hat {p}_{i,n}$ can be calculated by

$$ \hat{p}_{i,n} = 10^{-\frac{\hat{b}_{i,n}}{10}}. $$

Suppose that $R^{r}_{1}$ and $R^{r}_{2}$ are the HLA types of the sample, and that $S^{r}_{1}$ and $S^{r}_{2}$ are the true HLA sequences of the sample, which are introduced because the HLA sequences of the sample might not be registered in the reference (IPD-IMGT/HLA) database. Let $R^{d}_{1}, R^{d}_{2},...,$ be decoy HLA types and $S^{d}_{1}, S^{d}_{2},...,$ be decoy HLA sequences. These parameters could make this HLA analysis robust when reads from non-target homologous regions are misclassified into the target HLA gene at the previous filtering step. We will sometimes use R₁,R₂,R₃,R₄,..., and S₁,S₂,S₃,S₄,..., instead of $R^{r}_{1}, R^{r}_{2}, R^{d}_{1}, R^{d}_{2},...,$ and $S^{r}_{1}, S^{r}_{2}, S^{d}_{1}, S^{d}_{2},...,$ for convenience. I_i is defined as a parameter to indicate which sequence produced the read $\hat {x}_{i}$; that is, I_i=k means that $\hat {x}_{i}$ was generated from S_k. Then, the posterior probability of the parameters $p(\mathcal {R}, \mathcal {S}, \mathcal {I} | \hat {X})$ is given by

$$ p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) \propto p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{R}, \mathcal{S}\right) p\left(\mathcal{I}\right), $$

where $\hat {X} = \left \{ \hat {x}_{1}, \hat {x}_{2},... \right \}$, $\mathcal {R} = \left \{ R_{1}, R_{2},..., \right \}$, $\mathcal {S} = \left \{ S_{1}, S_{2},..., \right \}$, and $\mathcal {I} = \left \{ I_{1}, I_{2},... \right \}$.

The likelihood function $p\left (\hat {X} | \mathcal {S}, \mathcal {I}\right)$ is defined by

$$ p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) = \prod_{i} p\left(\hat{x}_{i} | S_{I_{i}}\right). $$

The likelihood of each read $p(\hat {x}_{i} | S_{k})$ is given by

$$ p\left(\hat{x}_{i} | S_{k}\right) = \prod_{n} p\left(\hat{x}_{i,n} | S_{k,n}\right), $$

where S_k,n is the n^th base or gap of S_k. The likelihood of each base $p\left (\hat {x}_{i,n} | S_{k,n}\right)$ is calculated by

$${{}\begin{aligned} & p(\hat{x}_{i,n} | S_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) (1 \! - \! \hat{p}_{i,n}) & (\text{if } \hat{x}_{i,n} \! = \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{\hat{p}_{i,n}}{3} & (\text{if }\hat{x}_{i,n} \! \in \! B \text{ and } \hat{x}_{i,n} \! \neq \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if }\hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right.\!\! {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{-}) \\ & \ \ = \left\{\begin{array}{ll} \gamma^{\mathtt{i}} (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ \gamma^{\mathtt{i}} \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ 1 \! - \! \gamma^{\mathtt{i}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {.} \end{aligned}} $$

Here, γ^d, γⁱ, and γ^N are the probabilities of a deletion error, an insertion error, and N, respectively.

The prior probability of the HLA types and the HLA sequences $p(\mathcal {R}, \mathcal {S})$ is defined by

$$ p\left(\mathcal{R}, \mathcal{S}\right) = \prod_{k} p\left(R_{k}\right) p\left(S_{k} | R_{k}\right). $$

Here, $p(R^{r}_{k})$ is the prior probability of the HLA type, which is calculated using The Allele Frequency Net Database [21]. On the other hand, $p(R^{d}_{k})$ is the prior probability of the decoy HLA type, which we assume as constant. The prior probability of the HLA sequence p(S_k|R_k) is given by

$$ p\left(S_{k} | R_{k}\right) = \prod_{n} p\left(S_{k,n} | R_{k,n}\right), $$

where and R_k,n is the n^th base or gap of R_k in the integrated MSA. The probability of a germline variant p(S_k,n|R_k,n) is calculated by

$${{}\begin{aligned} & p(S_{k,n} | R_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) (1 \! - \! \delta^{\texttt{s}}) & \text{(if $S_{k,n} \! = \! R_{k,n}$)} \\ (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{\delta^{\texttt{s}}}{3} & \text{(if $S_{k,n} \! \in \! B$ and $S_{k,n} \! \neq \! R_{k,n}$)} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if $S_{k,n} \! = \! \texttt{N}$)} \\ \delta^{\texttt{d}} & \text{(if $S_{k,n} \! = \! \texttt{-}$)} \end{array}\right. \! {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if $S_{k,n} \! \in \! B$)} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if $S_{k,n} \! = \! \texttt{N}$)} \\ \delta^{\texttt{d}} & \text{(if $S_{k,n} \! = \! \texttt{-}$)} \end{array}\right. {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{-}) \\ & \ \ = \left\{\begin{array}{ll} \delta^{\texttt{i}} (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if $S_{k,n} \! \in \! B$)} \\ \delta^{\texttt{i}} \delta^{\texttt{N}} & \text{(if $S_{k,n} \! = \! \texttt{N}$)} \\ 1 \! - \! \delta^{\texttt{i}} & \text{(if $S_{k,n} \! = \! \texttt{-}$)} \end{array}\right. {.} \end{aligned}} $$

Here, δ^s, δ^d, δⁱ, and δ^N are the probabilities of a true substitution, a true deletion, a true insertion, and a true N, respectively. Also, different values are used as these hyperparameters depending on whether R_k,n was imputed from the most similar reference base, which reduces the influence of misimputations. S_k,n tends to become N when it is ambiguous.

The prior probability of the indicator variables $p(\mathcal {I})$ is defined by

$$ p(\mathcal{I}) = \prod_{i} p(I_{i}) $$

Here, p(I_i) is the prior probability of the indicator variable, which is calculated by

$${\begin{aligned} p(I_{i}) \propto \left\{\begin{array}{ll} 1 & \text{(if } I_{i} = 1 \text{ or } I_{i} = 2) \\ \epsilon & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$

ε reflects how likely the reads are to be produced by non-target homologous regions.

Efficient sampling with elaborate MCMC schemes

The parameters of the model above are sampled using two Markov chain Monte Carlo (MCMC) schemes, Gibbs sampling and the Metropolis-Hastings algorithm, with parallel tempering to make the parameter sampling efficient. Gibbs sampling is mainly used for local search, and Metropolis-Hastings sampling is periodically used for more global search. For the Metropolis-Hastings algorithm, we constructed two novel proposal distributions that enable the parameters to jump from mode to mode and lead more efficient sampling.

One of the proposal distributions is focused on positions not covered with any read. First, $S^{\texttt {N}}_{k}$ is defined as a modified HLA sequence whose bases are replaced with Ns at positions not covered with any read produced by S_k, which is given by

$${\begin{aligned} S^{\texttt{N}}_{k,n} = \left\{\begin{array}{ll} S_{k,n} & \text{(if } \exists i; I_{i} = k \text{ and } \hat{x}_{i} \text{ covers the } n^{\text{th}} \text{ base of } S_{k}) \\ \texttt{N} & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$

A candidate HLA type and a candidate HLA sequence are then sampled based on

$$\begin{array}{*{20}l} R^{*}_{k} & \sim p\left(R^{*}_{k} | S^{\texttt{N}}_{k}\right), \\ S^{*}_{k} & \sim p\left(S^{*}_{k} | R^{*}_{k}, \mathcal{I}, \hat{X}\right). \end{array} $$

Then, the acceptance rate r can be calculated based on the Metropolis-Hastings algorithm, which is given by

$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right) p\left(R^{*}_{k}, S^{*}_{k} \to R_{k}, S_{k} | \mathcal{I}, \hat{X}\right)} {p\left(R_{k}, S_{k} | \mathcal{I}, \hat{X}\right) p\left(R_{k}, S_{k} \to R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right)} \\ & = \frac{p\left(S^{\texttt{N}}_{k} | R_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R^{*}_{k}\right)} {p\left(S^{\texttt{N}}_{k} | R^{*}_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R_{k}\right)}. \end{array} $$

This proposal distribution makes the sampling more efficient when there is ambiguity in the HLA types attributed to some uncovered positions. For example, let t_j and $\phantom {\dot {i}\!}t_{j^{\prime }}$ be HLA types that only differ with one mismatch at the n^th position. If a sample has t_j as an HLA type but there are no reads from t_j covering the n^th position, we cannot determine whether the HLA type is t_j or $\phantom {\dot {i}\!}t_{j^{\prime }}$. However, once R_k becomes $\phantom {\dot {i}\!}t_{j^{\prime }}$, S_k,n becomes the n^th base of $\phantom {\dot {i}\!}t_{j^{\prime }}$ with high probability. Then, R_k becomes $\phantom {\dot {i}\!}t_{j^{\prime }}$ with high probability, and this process is repeated. This is because R_k and S_k are separately sampled in the Gibbs sampling in spite of their high correlation. Thus, the proposal distribution prevents the parameters from getting stuck by sampling them simultaneously.

The other proposal distribution swaps non-decoy and decoy parameters. In this proposal distribution, indices for non-decoy and decoy parameters are uniformly sampled, and the HLA types and the HLA sequences at the indices are swapped. After swapping, candidate indicator variables are sampled based on the conditional distribution given the swapped parameters. Suppose that $\mathcal {R^{*}}$ and $\mathcal {S^{*}}$ are HLA types and HLA sequences after swapping, and that $\mathcal {I^{*}}$ is a set of candidate indicator variables. Then, the acceptance rate r can be calculated by

$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right) p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} \to \mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right)} {p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} \to \mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right)} \\ & = \frac{p\left(\mathcal{R^{*}}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S^{*}}, \mathcal{I}\right) p\left(\mathcal{I}\right)} {p\left(\mathcal{R}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{I}\right)}. \end{array} $$

This proposal distribution enables quickly distinguishing reads from the target HLA gene and non-target homologous regions.

Some procedures are used in the burn-in period to avoid getting stuck in local optima. At the beginning of sampling, a multi-start strategy is used to reduce the influence of initial parameters. Specifically, some MCMC runs are carried out, and initial parameters are sampled from the last parameters of the MCMC runs. In addition, reference sequences are periodically copied to HLA sequences in the burn-in period. This works well to get better parameters because there are many local optima where the parameters of the HLA sequences are twisted as if some crossovers occurred. Although these two approaches do not satisfy the detailed balance condition, since they are carried out only in the burn-in period, sampled parameters correctly reflect the posterior distribution.

After sampling the parameters, HLA genotyping can be performed by counting $R^{r}_{1}$ and $R^{r}_{2}$. We used the most sampled HLA genotype in the MCMC process as the candidate. The HLA sequences of a sample can be also inferred by counting $S^{r}_{1}$ and $S^{r}_{2}$. Even though misimputations in HLA reference sequences can cause mistyping, the HLA sequence can still be determined correctly, and the misimputations and the mistyping can be corrected.

Results

WES and WGS datasets

To evaluate the capability of our method, we obtained 253 WES data with the HLA genotypes from the International HapMap Project [22] that had been used by Szolek et al. [13] and Shukla et al. [15]. We further downsampled these data to 1/2, 1/4, 1/8, and 1/16 to simulate low-coverage data.

We also used paired normal and tumor WGS data of 25 Japanese cancer patients, including 20 liver cancer and 5 microsatellite-unstable colon cancer samples. These data were obtained from an Illumina HiSeq system with a 101-bp pair-end read length. The sequence data were deposited into the International Cancer Genome Consortium (ICGC) database (https://dcc.icgc.org/).

The sequencing-based typing (SBT) approach, which is guaranteed to be accurate at 2nd field resolution, was used for validation of the 20 liver cancer samples. Additional HLA genotyping using the TruSight HLA Sequencing Panels, which are theoretically guaranteed to be accurate at full (4th field) resolution, was performed for 7 out of the above 20 liver cancer samples to reduce ambiguity of the SBT genotyping. The 5 microsatellite-unstable samples were genotyped using the TruSight HLA Sequencing Panels, in order to verify not only the HLA genotypes but also the presence of somatic mutations. We regarded the results of the SBT approach and/or the TruSight HLA Sequencing Panels as the correct information. If the results differed between the two methods, we assumed that the result of the TruSight HLA Sequencing Panel was correct.

WES- and WGS-based HLA genotyping

For performance comparison, we used three existing methods, OptiType [13], PHLAT [12], and HLA-VBSeq [14] because it has been reported that they achieve the highest accuracy for WES- and WGS-based HLA genotyping [16]. First, we applied ALPHLARD and the existing methods to the original and the downsampled WES data (Additional file 1: Tables S1-S5). Because the gold standard HLA genotypes were determined from exon 2 and 3, we used only the exons as the reference sequences in ALPHLARD. Figure 2 shows the performance of the methods. ALPHLARD kept higher accuracy compared with the other methods even when the downsampling ratio was low. The accuracy of the existing methods was consistent with the preceding paper [16].

We also applied the methods to the normal WGS data and compared the determined HLA genotypes with those obtained by the SBT approach and the TruSight HLA Sequencing Panel (Additional file 2: Tables S6-S13). The performance of the four methods is shown in Table 1 and Additional file 3: Table S14. ALPHLARD clearly achieved a higher accuracy rate than the other methods. Moreover, the HLA-B genotype of one sample was inferred differently between the SBT approach and the TruSight HLA Sequencing Panel, and the result of ALPHLARD for this sample was identical to that of the TruSight HLA Sequencing Panel. This suggests that ALPHLARD could be potentially superior to the SBT approach in some cases. HLA-VBSeq achieved higher accuracy from the WGS data than from the WES data. This would be because HLA-VBSeq uses non-coding information such as the introns and the untranslated regions. The accuracy of the existing methods was consistent with the preceding paper [16].

Table 1 WGS-based HLA genotyping accuracy that indicates how many HLA types were correctly determined with ALPHLARD, OptiType, PHLAT, and HLA-VBSeq

Full size table

Detection of somatic mutations

Next, we searched for somatic point mutations in the HLA genes. They were detected by comparing the inferred HLA sequences between paired normal and tumor samples of each patient. We detected three somatic point mutations in the microsatellite-unstable samples: two single-base deletions and one single-base insertion (Fig. 3 and Additional file 4: Figures S1 and S2). One of the deletions occurred in a homopolymeric region in exon 1 of the HLA-A gene, and the other occurred in a homopolymeric region in exon 1 of the HLA-B gene. Both of these mutations caused a frameshift, leading to an early stop codon and ultimate loss of function of the HLA allele. It is known that the HLA-A and HLA-B genes are homologous, and we found that the two deletions occurred at homologously the same position. Moreover, one of the HLA-A types (A*68:11N) has a single-base deletion at exactly the same homopolymeric position. These observations suggest that the homopolymeric regions are deletion hotspots. The insertion occurred in a homopolymeric region at the beginning of exon 4 of the HLA-A gene, which changed the HLA-A allele from A*31:01:02 to A*31:14N. This region is known as an insertion hotspot in some HLA types such as A*01:04N and B*51:11N, and the insertion causes no expression of the allele [23–26]. The three indels identified were validated by the TruSight HLA Sequencing Panels and the Sanger sequencing.

We further sought cases of LOH in the HLA genes as follows. First, we focused on two types of patients: (i) those for which HLA genotypes were uniquely determined for the normal sample but not for the tumor sample, and (ii) those for which HLA genotypes of both the normal and the tumor samples were uniquely but not identically determined. Then, we checked whether the collected reads of the tumor sample supported the HLA genotype inferred for the normal sample.

We next checked the influence of misimputations in HLA reference sequences. The HLA types of the used samples had complete genomic references, and hence had no imputed bases. Therefore, in each sample, we made artificial misimputations to the reference sequences of the true HLA types by replacing one base at a SNP position with another variant of a similar HLA reference sequence. This replacement possibly has a large effect to HLA genotyping. We applied ALPHLARD to the WES and the WGS data using the misimputed references. The accuracy was 85.3% for the WES data and 80.4% for the WGS data at 2nd field resolution, which implies ALPHLARD has the capability of accurate HLA genotyping even if there occurred misimputations in HLA reference sequences. We note that this problem could be solved as the HLA reference database becomes complete.

We were able to detect one likely case of LOH in the tumor sample of a patient, RK069. At each heterozygous single nucleotide polymorphism (SNP) position in each HLA locus, the log odds ratio was calculated for the WGS data and the Trusight HLA Sequencing Panels based on the number of reads that supported the SNP (Fig. 4 and Additional file 5: Figures S3-S7). These figures suggest that A*26:01:01, B*35:01:01, C*03:03:01, DPA1*01:03:01, DQA1*03:02, and DRB1*12:01:01 might be lost in the tumor sample of RK069.

Discussion

In this paper, we presented a new Bayesian method, ALPHLARD, which performs not only HLA genotyping but also infer the HLA sequences of a sample. The results showed that our method ALPHLARD achieved higher accuracy for HLA genotyping from both WES and WGS data than existing methods. We presume that the high performance of ALPHLARD originates from the following reasons. First, the search space of ALPHLARD is all possible HLA allele pairs. Some methods treat an HLA allele pair as two independent HLA alleles; that is they give a score to each HLA allele and output the most and the second most probable HLA alleles without directly considering the combinations. This approximation reduces the computation time but works well only when the coverage of the sequence data is sufficient. Therefore, such methods would not achieve high accuracy for HLA genotyping from WGS data. Second, ALPHLARD takes into account whether or not bases and gaps are observed at each position by inserting the parameters for HLA sequences between the parameters for HLA genotypes and collected reads. Most of read count-based HLA genotyping algorithms consider only the number of reads mapped to each HLA allele. However, even if a lot of reads are mapped to an HLA allele, it does not seem to be the true HLA type if there are several regions not covered by any read. We believe that what is really important is not the number of reads but the range covered by sufficient reads. Third, ALPHLARD uses some decoy parameters in addition to non-decoy ones. This is why ALPHLARD can robustly and accurately perform HLA genotyping even if there exist some reads from non-target homologous regions that are similar to the target HLA gene.

Besides HLA genotypes, ALPHLARD gives us beneficial information that cannot be obtained from other methods. First, somatic mutations such as point mutations and LOHs can be detected by comparing the sampled HLA sequences of paired normal and tumor samples. We detected three indels and one case of LOH, which lead to loss of function of the HLA alleles. These mutations are biologically important because they weaken the immune function and would be related to tumor progression. Second, novel HLA types not registered in HLA databases can be identified by comparing the inferred HLA genotype and HLA sequences. Unfortunately, no novel HLA type was observed in our analysis. However, ALPHLARD would be flexible enough to detect the difference between novel HLA types and known ones because the process of novel HLA type identification is theoretically the same as that of HLA somatic mutation detection.

Although ALPHLARD performed better HLA genotyping than other existing methods, its accuracy would not be sufficient for clinical applications including hematopoietic stem cell transplantation. However, ALPHLARD could be a effective tool in combination with other methods such as SBT since the sample might have a novel HLA type.

Conclusion

Our new Bayesian-based HLA analysis method, ALPHLARD, showed good performance for HLA genotyping. It also has a potential to detect rare germline variants and somatic mutations in HLA genes. A large amount of WGS data has been recently produced by big projects such as the ICGC. Applying our method to such big data would help to fill in the current gaps in HLA reference databases and unveil the immunological significance of somatic mutations identified in HLA genes.

Abbreviations

HLA:: Human leukocyte antigen
LOH:: Loss of heterozygosity
MCMC:: Markov chain Monte Carlo
MSA:: Multiple sequence alignment
SBT:: Sequencing-based typing
SNP:: Single nucleotide polymorphism
WES:: Whole exome sequencing
WGS:: Whole genome sequencing

References

Schlosstein L, Terasaki PI, Bluestone R, Pearson CM. High association of an HL-A antigen, W27, with ankylosing spondylitis. N Engl J Med. 1973; 288(14):704–6.
Article CAS Google Scholar
Ohno S, Ohguchi M, Hirose S, Matsuda H, Wakisaka A, Aizawa M. Close association of HLA-Bw51 with Behçet’s disease. Arch Ophthalmol. 1982; 100(9):1455–8.
Article CAS Google Scholar
Prieto-Pérez R, Cabaleiro T, Daudén E, Abad-Santos F. Gene polymorphisms that can predict response to anti-TNF therapy in patients with psoriasis and related autoimmune diseases. Pharmacogenomics J. 2013; 13(4):297–305.
Article Google Scholar
Mignot E. Genetics of narcolepsy and other sleep disorders. Am J Hum Genet. 1997; 60(6):1289–302.
Article CAS Google Scholar
Jones EY, Fugger L, Strominger JL, Siebold C. MHC class II proteins and disease: a structural perspective. Nat Rev Immunol. 2006; 6(4):271–82.
Article CAS Google Scholar
Erlich RL, Jia X, Anderson S, Banks E, Gao X, Carrington M, Gupta N, DePristo MA, Henn MR, Lennon NJ, et al. Next-generation sequencing for HLA typing of class I loci. BMC Genomics. 2011; 12:42.
Article CAS Google Scholar
Hosomichi K, Jinam TA, Mitsunaga S, Nakaoka H, Inoue I. Phase-defined complete sequencing of the HLA genes by next-generation sequencing. BMC Genomics. 2013; 14:355.
Article CAS Google Scholar
Warren RL, Choe G, Freeman DJ, Castellarin M, Munro S, Moore R, Holt RA. Derivation of HLA types from shotgun sequence datasets. Genome Med. 2012; 4:95.
Article CAS Google Scholar
Boegel S, Löwer M, Schäfer M, Bukur T, De Graaf J, Boisguérin V, Türeci Ö, Diken M, Castle JC, Sahin U. HLA typing from RNA-Seq sequence reads. Genome Med. 2012; 4:102.
Article Google Scholar
Liu C, Yang X, Duffy B, Mohanakumar T, Mitra RD, Zody MC, Pfeifer JD. ATHLATES: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res. 2013; 41(14):142.
Article CAS Google Scholar
Kim HJ, Pourmand N. HLA haplotyping from RNA-seq data using hierarchical read weighting. PLoS ONE. 2013; 8(6):67885.
Article Google Scholar
Bai Y, Ni M, Cooper B, Wei Y, Fury W. Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC genomics. 2014; 15:325.
Article Google Scholar
Szolek A, Schubert B, Mohr C, Sturm M, Feldhahn M, Kohlbacher O. OptiType: precision HLA typing from next-generation sequencing data. Bioinformatics. 2014; 30(23):3310–6.
Article CAS Google Scholar
Nariai N, Kojima K, Saito S, Mimori T, Sato Y, Kawai Y, Yamaguchi-Kabata Y, Yasuda J, Nagasaki M. HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. BMC genomics. 2015; 16(2):7.
Article Google Scholar
Shukla SA, Rooney MS, Rajasagi M, Tiao G, Dixon PM, Lawrence MS, Stevens J, Lane WJ, Dellagatta JL, Steelman S, et al. Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes. Nat Biotechnol. 2015; 33(11):1152–8.
Article CAS Google Scholar
Bauer DC, Zadoorian A, Wilson LOW, Melbourne Genomics Health Alliance, Thorne NP. Evaluation of computational programs to predict HLA genotypes from genomic sequencing data. Brief Bioinform. 2016; 19(2):179–187.
PubMed Central Google Scholar
Weimer ET, Montgomery M, Petraroia R, Crawford J, Schmitz JL. Performance Characteristics and Validation of Next-Generation Sequencing for Human Leucocyte Antigen Typing. J Mol Diagn. 2016; 18(5):668–75.
Article CAS Google Scholar
Robinson J, Halliwell JA, Hayhurst JD, Flicek P, Parham P, Marsh SGE. The IPD and IMGT/HLA database: allele variant databases. Nucleic Acids Res. 2015; 43(D1):423–31.
Article Google Scholar
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.Initial sequencing and analysis of the human genome. Nature. 2001; 409(6822):860–921.
Article CAS Google Scholar
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25(14):1754–60.
Article CAS Google Scholar
González-Galarza FF, Takeshita LYC, Santos EJM, Kempson F, Maia MHT, da Silva ALS, e Silva ALT, Ghattaoraya GS, Alfirevic A, Jones AR, et al. Allele frequency net 2015 update: new features for HLA epitopes, KIR and disease and HLA adverse drug reaction associations. Nucleic Acids Res. 2015; 43(D1):784–8.
Article Google Scholar
Consortium TIH. A haplotype map of the human genome. Nature. 2005; 437(7063):1299.
Article Google Scholar
Laforet M, Froelich N, Parissiadis A, Pfeiffer B, Schell A, Faller B, Woehl-Jaegle ML, Cazenave JP, Tongio MM. A nucleotide insertion in exon 4 is responsible for the absence of expression of an HLA-A*01 allele. Tissue Antigens. 1997; 50(4):347–50.
Article CAS Google Scholar
Magor KE, Taylor EJ, Shen SY, Martinez-Naves E, Valiante NM, Wells RS, Gumperz JE, Adams EJ, Little A-M, Williams F, et al. Natural inactivation of a common HLA allele (A*2402) has occurred on at least three separate occasions. J Immunol. 1997; 158(11):5242–50.
CAS PubMed Google Scholar
Smith DM, Gardner WB, Baker JE, Cox ST, Kresie LA. A new HLA-A*31 null allele, A*3114N. Tissue Antigens. 2006; 68(6):526–7.
Article CAS Google Scholar
Elsner HA, Drábek J, Rebmann V, Ambruzova Z, Grosse-Wilde H, Blasczyk R. Non-expression of HLA-B*5111N is caused by an insertion into the cytosine island at exon 4 creating a frameshift stop codon. Tissue Antigens. 2001; 57(4):369–72.
Article CAS Google Scholar
Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabé RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, et al. International network of cancer genome projects. Nature. 2010; 464(7291):993–8.
Article CAS Google Scholar

Download references

Acknowledgements

The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Funding

This work was supported by Japan Society for the Promotion of Science (15H02775 and 15H05912).

Availability of data and materials

The WGS data were deposited into the ICGC database (https://dcc.icgc.org/).

Author information

Authors and Affiliations

Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
Shuto Hayashi, Rui Yamaguchi, Mitsuhiro Komura & Satoru Miyano
Center for Advanced Medical Innovation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, 812-8582, Japan
Shinichi Mizuno
Health Intelligence Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
Seiya Imoto
RIKEN Center for Integrative Medical Sciences, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
Hidewaki Nakagawa

Authors

Shuto Hayashi
View author publications
You can also search for this author in PubMed Google Scholar
Rui Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Mizuno
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuhiro Komura
View author publications
You can also search for this author in PubMed Google Scholar
Satoru Miyano
View author publications
You can also search for this author in PubMed Google Scholar
Hidewaki Nakagawa
View author publications
You can also search for this author in PubMed Google Scholar
Seiya Imoto
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SH, RY, SMiy, and SI designed the research. SH developed the method. SH and MK benchmarked the method. SMiz performed SBT genotyping to the samples. HN provided the whole genome and the amplicon sequencing data of the samples. SH and SI wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Seiya Imoto.

Ethics declarations

Ethics approval and consent to participate

All of the human subjects agreed with informed consent to participate in the study following ICGC guidelines [27]. IRBs at RIKEN and the associated hospitals participating in this study approved this work.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Tables S1-S5. The validated HLA genotypes and the results of ALPHLARD, OptiType, and PHLAT from 253 HapMap WES data. (XLSX 218 kb)

Additional file 2

Tables S6-S13. The validated HLA genotypes and the results of ALPHLARD, OptiType, and PHLAT from 25 WGS data. (XLSX 31 kb)

Additional file 3

Table S14. WGS-based HLA genotyping accuracy that indicates how many samples were fully correctly genotyped with ALPHLARD, OptiType, PHLAT, and HLA-VBSeq. (PDF 11 kb)

Additional file 4

Figures S1 and S2. Somatic point mutations in microsatellite-unstable colon cancer samples. (PDF 291 kb)

Additional file 5

Figures S3-S7. Log odds ratios of the depths of heterozygous HLA SNPs in a liver cancer sample. (PDF 104 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Hayashi, S., Yamaguchi, R., Mizuno, S. et al. ALPHLARD: a Bayesian method for analyzing HLA genes from whole genome sequence data. BMC Genomics 19, 790 (2018). https://doi.org/10.1186/s12864-018-5169-9

Download citation

Received: 17 May 2018
Accepted: 15 October 2018
Published: 01 November 2018
DOI: https://doi.org/10.1186/s12864-018-5169-9

ALPHLARD: a Bayesian method for analyzing HLA genes from whole genome sequence data

Abstract

Background

Results

Conclusions

Background

Methods

Overview of our pipeline

HLA reference data

Collection and realignment of reads

Bayesian model for analyzing HLA genes

Efficient sampling with elaborate MCMC schemes

Results

WES and WGS datasets

WES- and WGS-based HLA genotyping

Detection of somatic mutations

Discussion

Conclusion

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1

Additional file 2

Additional file 3

Additional file 4

Additional file 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us