Overview of our pipeline
Our pipeline consists of two steps as shown in Fig. 1. First, for each read and each HLA type, the HLA read score (HR score) is calculated, which quantifies the likelihood that the read comes from the HLA type. Based on the calculated HR scores, it is determined whether or not the read comes from a certain HLA gene. For example, by aligning read x to the reference sequences in HLA databases, we obtained the HR scores as shown in the bar graph of Fig. 1a. Then, if the maximum HR score for the HLA-A gene is large enough and the difference in the maximum scores for the HLA-A gene and the other HLA genes is also large, we conclude that read x is most likely a specific read of the HLA-A gene. Otherwise, read x is judged to be a read produced from other regions. HLA genotyping is then performed using the collected reads for each HLA gene, as shown in Fig. 1b. ALPHLARD outputs candidate pairs of HLA types according to the Bayesian posterior probabilities.
HLA reference data
We used HLA reference information that can be obtained from the IPD-IMGT/HLA database (release 3.28.0) [18]. There are two types of HLA reference sequences in the database: one is a complete genomic reference and the other is an exonic reference without non-coding regions. Some HLA types have both genomic and exonic reference information, but most HLA types have only exonic reference information.
The database also provides multiple sequence alignments (MSAs) at the genomic and the exonic levels for each HLA gene. We combined the two MSAs into a common MSA as follows: First, some gaps were inserted into exons of the genomic MSA for consistency with the exonic reference sequences. Then, missing non-coding sequences were replaced with the most similar genomic reference sequences. This integrated MSA is then used for alignment and realignment of the reads.
Collection and realignment of reads
First, all reads are mapped to a human reference genome, and reads mapped to the HLA region and unmapped reads are used at the next step. We use hg19 [19] as the reference sequence and define the HLA region as chr6:28,477,797-33,448,354, which covers HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes.
Next, the filtered reads are mapped to all HLA genomic and exonic reference sequences. We use BWA-MEM (version 0.7.10) [20] with the -a option to output all found alignments. Then, each mapped read is filtered based on whether or not it is likely to be produced by the target HLA gene. This filtering is performed according to the HR score si,j for the ith read xi and the jth HLA type tj, which is similar to the filtering procedure used in HLAforest [11]. If xi is not aligned to tj, si,j is −∞. Otherwise, let \((\tilde {x}_{i,j}, \tilde {t}_{i,j})\) be the alignment of xi and tj, which might include some gaps. \(\tilde {x}_{{i,j,n}}\) and \(\tilde {t}_{{i,j,n}}\) are defined as the nth bases or gaps of \(\tilde {x}_{i,j}\) and \(\tilde {t}_{i,j}\), respectively, and \(\tilde {b}_{{i,j,n}}\) is defined as the base quality of \(\tilde {x}_{{i,j,n}}\). We suppose that \(\tilde {p}_{{i,j,n}}\) is the probability of a mismatch between \(\tilde {x}_{{i,j,n}}\) and \(\tilde {t}_{{i,j,n}}\), which can be calculated by
$$ \tilde{p}_{i,j,n} = 10^{-\frac{\tilde{b}_{i,j,n}}{10}}. $$
Then, the HR score si,j is given by
$$ s_{i,j} = \sum\limits_{n} \left(\tilde{\alpha}_{i,j,n} + \tilde{\beta}_{i,j,n}\right), $$
where
$$\begin{array}{*{20}l} \tilde{\alpha}_{i,j,n} & =\left\{ \begin{array}{ll} \log (\frac{\tilde{p}_{i,j,n}}{3}) & (\text{if } \tilde{x}_{i,j,n}, \tilde{t}_{i,j,n} \in B \text{ and } \tilde{x}_{i,j,n} \neq \tilde{t}_{i,j,n}) \\ \alpha^{\texttt{d,o}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{d,e}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{i,o}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{i,e}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{N}} & \left(\begin{array}{l} \text{if } \tilde{x}_{i,j,n} = \mathtt{N} \text{ and } \tilde{t}_{i,j,n} \in B^{\texttt{N}} \\ \quad \text{or }\tilde{x}_{i,j,n} \in B^{\mathtt{N}} \text{ and } \tilde{t}_{i,j,n} = \mathtt{N} \end{array}\right) \\ 0 & \text{(otherwise)} \end{array}\right.\! {,} \\ \tilde{\beta}_{i,j,n} & = \left\{\begin{array}{ll} \beta & (\text{if } \tilde{x}_{i,j,n} \in B^{\mathtt{N}}) \\ 0 & \text{(otherwise)} \end{array}\right. {.} \end{array} $$
Here, B={A,C,G,T} and BN={A,C,G,T,N}. The parameters αd,o, αd,e, αi,o, αi,e, and αN take negative values as penalties for opening a deletion, extending a deletion, opening an insertion, extending an insertion, and N in the read or the HLA type, respectively. β is a positive constant reward for read length, which prefers longer reads. Then, the score of xi for the target HLA gene \(s^{*}_{i}\), and the score of xi for the non-target HLA genes \(\bar {s}^{*}_{i}\) are defined by
$$ s^{*}_{i} = \max_{t_{j} \in T} s_{i,j}, \quad \bar{s}^{*}_{i} = \max_{t_{j} \notin T} s_{i,j}, $$
where T is the set of HLA types in the target HLA gene. \(s^{*}_{i}\) and \(\bar {s}^{*}_{i}\) indicate how likely xi is to be produced by the target HLA gene and the non-target HLA genes, respectively.
Thus, when xi is an unpaired read, it is used for HLA genotyping if
$$ s^{*}_{i} > \theta^{u,m}, \quad s^{*}_{i} - \bar{s}^{*}_{i} > \theta^{u,d}, $$
where θu,m and θu,d are constant thresholds. When xi and \(x_{i^{\prime }}\) are paired, they are used for HLA genotyping if
$$ s^{*}_{i} + s^{*}_{i^{\prime}} > \theta^{p,m}, \quad \left(s^{*}_{i} + s^{*}_{i^{\prime}}\right) - \left(\bar{s}^{*}_{i} + \bar{s}^{*}_{i^{\prime}}\right) > \theta^{p,d}, $$
where θp,m and θp,d are constant thresholds. Paired reads are generally more effective than unpaired reads; hence, θp,m and θp,d should be less than θu,m and θu,d, respectively.
In the next step, all of the collected reads are realigned as follows. First, \(\phantom {\dot {i}\!}t_{j^{*}}\) is defined as the best type for xi in the target gene, which is obtained by
$$ j^{*} = {\underset{j: t_{j} \in T}{\arg\max}} s_{i,j}. $$
Then, xi is realigned to be consistent with the alignment \(\left (\tilde {x}_{i,j^{*}}, \tilde {t}_{i,j^{*}}\right)\) and the integrated MSA of the target HLA gene.
Bayesian model for analyzing HLA genes
Analysis of the target HLA gene by ALPHLARD is performed using the collected and realigned reads. Let \(\hat {x}_{i}\) be the ith paired (or unpaired) read(s) collected and realigned with the previous procedure, \(\hat {x}_{i,n}\) be the nth base or gap of \(\hat {x}_{i}\), and \(\hat {b}_{i,n}\) be the base quality of \(\hat {x}_{i,n}\). Note that, hereafter, we regard paired reads as one sequence. The probability of mismatch \(\hat {p}_{i,n}\) can be calculated by
$$ \hat{p}_{i,n} = 10^{-\frac{\hat{b}_{i,n}}{10}}. $$
Suppose that \(R^{r}_{1}\) and \(R^{r}_{2}\) are the HLA types of the sample, and that \(S^{r}_{1}\) and \(S^{r}_{2}\) are the true HLA sequences of the sample, which are introduced because the HLA sequences of the sample might not be registered in the reference (IPD-IMGT/HLA) database. Let \(R^{d}_{1}, R^{d}_{2},...,\) be decoy HLA types and \(S^{d}_{1}, S^{d}_{2},...,\) be decoy HLA sequences. These parameters could make this HLA analysis robust when reads from non-target homologous regions are misclassified into the target HLA gene at the previous filtering step. We will sometimes use R1,R2,R3,R4,..., and S1,S2,S3,S4,..., instead of \(R^{r}_{1}, R^{r}_{2}, R^{d}_{1}, R^{d}_{2},...,\) and \(S^{r}_{1}, S^{r}_{2}, S^{d}_{1}, S^{d}_{2},...,\) for convenience. Ii is defined as a parameter to indicate which sequence produced the read \(\hat {x}_{i}\); that is, Ii=k means that \(\hat {x}_{i}\) was generated from Sk. Then, the posterior probability of the parameters \(p(\mathcal {R}, \mathcal {S}, \mathcal {I} | \hat {X})\) is given by
$$ p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) \propto p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{R}, \mathcal{S}\right) p\left(\mathcal{I}\right), $$
where \(\hat {X} = \left \{ \hat {x}_{1}, \hat {x}_{2},... \right \}\), \(\mathcal {R} = \left \{ R_{1}, R_{2},..., \right \}\), \(\mathcal {S} = \left \{ S_{1}, S_{2},..., \right \}\), and \(\mathcal {I} = \left \{ I_{1}, I_{2},... \right \}\).
The likelihood function \(p\left (\hat {X} | \mathcal {S}, \mathcal {I}\right)\) is defined by
$$ p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) = \prod_{i} p\left(\hat{x}_{i} | S_{I_{i}}\right). $$
The likelihood of each read \(p(\hat {x}_{i} | S_{k})\) is given by
$$ p\left(\hat{x}_{i} | S_{k}\right) = \prod_{n} p\left(\hat{x}_{i,n} | S_{k,n}\right), $$
where Sk,n is the nth base or gap of Sk. The likelihood of each base \(p\left (\hat {x}_{i,n} | S_{k,n}\right)\) is calculated by
$${{}\begin{aligned} & p(\hat{x}_{i,n} | S_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) (1 \! - \! \hat{p}_{i,n}) & (\text{if } \hat{x}_{i,n} \! = \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{\hat{p}_{i,n}}{3} & (\text{if }\hat{x}_{i,n} \! \in \! B \text{ and } \hat{x}_{i,n} \! \neq \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if }\hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right.\!\! {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{-}) \\ & \ \ = \left\{\begin{array}{ll} \gamma^{\mathtt{i}} (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ \gamma^{\mathtt{i}} \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ 1 \! - \! \gamma^{\mathtt{i}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {.} \end{aligned}} $$
Here, γd, γi, and γN are the probabilities of a deletion error, an insertion error, and N, respectively.
The prior probability of the HLA types and the HLA sequences \(p(\mathcal {R}, \mathcal {S})\) is defined by
$$ p\left(\mathcal{R}, \mathcal{S}\right) = \prod_{k} p\left(R_{k}\right) p\left(S_{k} | R_{k}\right). $$
Here, \(p(R^{r}_{k})\) is the prior probability of the HLA type, which is calculated using The Allele Frequency Net Database [21]. On the other hand, \(p(R^{d}_{k})\) is the prior probability of the decoy HLA type, which we assume as constant. The prior probability of the HLA sequence p(Sk|Rk) is given by
$$ p\left(S_{k} | R_{k}\right) = \prod_{n} p\left(S_{k,n} | R_{k,n}\right), $$
where and Rk,n is the nth base or gap of Rk in the integrated MSA. The probability of a germline variant p(Sk,n|Rk,n) is calculated by
$${{}\begin{aligned} & p(S_{k,n} | R_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) (1 \! - \! \delta^{\texttt{s}}) & \text{(if \(S_{k,n} \! = \! R_{k,n}\))} \\ (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{\delta^{\texttt{s}}}{3} & \text{(if \(S_{k,n} \! \in \! B\) and \(S_{k,n} \! \neq \! R_{k,n}\))} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ \delta^{\texttt{d}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. \! {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if \(S_{k,n} \! \in \! B\))} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ \delta^{\texttt{d}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{-}) \\ & \ \ = \left\{\begin{array}{ll} \delta^{\texttt{i}} (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if \(S_{k,n} \! \in \! B\))} \\ \delta^{\texttt{i}} \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ 1 \! - \! \delta^{\texttt{i}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. {.} \end{aligned}} $$
Here, δs, δd, δi, and δN are the probabilities of a true substitution, a true deletion, a true insertion, and a true N, respectively. Also, different values are used as these hyperparameters depending on whether Rk,n was imputed from the most similar reference base, which reduces the influence of misimputations. Sk,n tends to become N when it is ambiguous.
The prior probability of the indicator variables \(p(\mathcal {I})\) is defined by
$$ p(\mathcal{I}) = \prod_{i} p(I_{i}) $$
Here, p(Ii) is the prior probability of the indicator variable, which is calculated by
$${\begin{aligned} p(I_{i}) \propto \left\{\begin{array}{ll} 1 & \text{(if } I_{i} = 1 \text{ or } I_{i} = 2) \\ \epsilon & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$
ε reflects how likely the reads are to be produced by non-target homologous regions.
Efficient sampling with elaborate MCMC schemes
The parameters of the model above are sampled using two Markov chain Monte Carlo (MCMC) schemes, Gibbs sampling and the Metropolis-Hastings algorithm, with parallel tempering to make the parameter sampling efficient. Gibbs sampling is mainly used for local search, and Metropolis-Hastings sampling is periodically used for more global search. For the Metropolis-Hastings algorithm, we constructed two novel proposal distributions that enable the parameters to jump from mode to mode and lead more efficient sampling.
One of the proposal distributions is focused on positions not covered with any read. First, \(S^{\texttt {N}}_{k}\) is defined as a modified HLA sequence whose bases are replaced with Ns at positions not covered with any read produced by Sk, which is given by
$${\begin{aligned} S^{\texttt{N}}_{k,n} = \left\{\begin{array}{ll} S_{k,n} & \text{(if } \exists i; I_{i} = k \text{ and } \hat{x}_{i} \text{ covers the } n^{\text{th}} \text{ base of } S_{k}) \\ \texttt{N} & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$
A candidate HLA type and a candidate HLA sequence are then sampled based on
$$\begin{array}{*{20}l} R^{*}_{k} & \sim p\left(R^{*}_{k} | S^{\texttt{N}}_{k}\right), \\ S^{*}_{k} & \sim p\left(S^{*}_{k} | R^{*}_{k}, \mathcal{I}, \hat{X}\right). \end{array} $$
Then, the acceptance rate r can be calculated based on the Metropolis-Hastings algorithm, which is given by
$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right) p\left(R^{*}_{k}, S^{*}_{k} \to R_{k}, S_{k} | \mathcal{I}, \hat{X}\right)} {p\left(R_{k}, S_{k} | \mathcal{I}, \hat{X}\right) p\left(R_{k}, S_{k} \to R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right)} \\ & = \frac{p\left(S^{\texttt{N}}_{k} | R_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R^{*}_{k}\right)} {p\left(S^{\texttt{N}}_{k} | R^{*}_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R_{k}\right)}. \end{array} $$
This proposal distribution makes the sampling more efficient when there is ambiguity in the HLA types attributed to some uncovered positions. For example, let tj and \(\phantom {\dot {i}\!}t_{j^{\prime }}\) be HLA types that only differ with one mismatch at the nth position. If a sample has tj as an HLA type but there are no reads from tj covering the nth position, we cannot determine whether the HLA type is tj or \(\phantom {\dot {i}\!}t_{j^{\prime }}\). However, once Rk becomes \(\phantom {\dot {i}\!}t_{j^{\prime }}\), Sk,n becomes the nth base of \(\phantom {\dot {i}\!}t_{j^{\prime }}\) with high probability. Then, Rk becomes \(\phantom {\dot {i}\!}t_{j^{\prime }}\) with high probability, and this process is repeated. This is because Rk and Sk are separately sampled in the Gibbs sampling in spite of their high correlation. Thus, the proposal distribution prevents the parameters from getting stuck by sampling them simultaneously.
The other proposal distribution swaps non-decoy and decoy parameters. In this proposal distribution, indices for non-decoy and decoy parameters are uniformly sampled, and the HLA types and the HLA sequences at the indices are swapped. After swapping, candidate indicator variables are sampled based on the conditional distribution given the swapped parameters. Suppose that \(\mathcal {R^{*}}\) and \(\mathcal {S^{*}}\) are HLA types and HLA sequences after swapping, and that \(\mathcal {I^{*}}\) is a set of candidate indicator variables. Then, the acceptance rate r can be calculated by
$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right) p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} \to \mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right)} {p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} \to \mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right)} \\ & = \frac{p\left(\mathcal{R^{*}}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S^{*}}, \mathcal{I}\right) p\left(\mathcal{I}\right)} {p\left(\mathcal{R}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{I}\right)}. \end{array} $$
This proposal distribution enables quickly distinguishing reads from the target HLA gene and non-target homologous regions.
Some procedures are used in the burn-in period to avoid getting stuck in local optima. At the beginning of sampling, a multi-start strategy is used to reduce the influence of initial parameters. Specifically, some MCMC runs are carried out, and initial parameters are sampled from the last parameters of the MCMC runs. In addition, reference sequences are periodically copied to HLA sequences in the burn-in period. This works well to get better parameters because there are many local optima where the parameters of the HLA sequences are twisted as if some crossovers occurred. Although these two approaches do not satisfy the detailed balance condition, since they are carried out only in the burn-in period, sampled parameters correctly reflect the posterior distribution.
After sampling the parameters, HLA genotyping can be performed by counting \(R^{r}_{1}\) and \(R^{r}_{2}\). We used the most sampled HLA genotype in the MCMC process as the candidate. The HLA sequences of a sample can be also inferred by counting \(S^{r}_{1}\) and \(S^{r}_{2}\). Even though misimputations in HLA reference sequences can cause mistyping, the HLA sequence can still be determined correctly, and the misimputations and the mistyping can be corrected.