HLA-VBSeq pipeline
An overview of the HLA-VBSeq pipeline for estimating HLA types is described in Figure 1. First, reads obtained by whole-genome sequencing are aligned to the reference genome (GRCh37/hg19) with decoy sequences (hs37d5) with an alignment tool, BWA-MEM [15]. BWA-MEM is robust to sequencing errors and can be applicable to read sequence lengths up to a few megabases [16]. Second, reads aligned to HLA loci (HLA-A, -B, -C, -DM, -DO, -DP, -DQ, -DR, -E, -F, -G, -H, -J, -K, -L, -P, -V, -MIC, and -TAP) and unmapped reads are extracted from the BAM file with SAMtools [17]. For the case of paired-end sequencing data, if one of the paired-end mates is aligned to an HLA locus and the other is not, then both reads of the pair are extracted and used for downstream analyses. Then, the extracted reads are re-aligned to the collection of all the genomic HLA allele sequences in the IMGT/HLA database, in which multiple alignments to the reference sequences for each read are allowed with the "-a" option in BWA-MEM. Here, all the genomic DNA sequences registered in the IMGT/HLA database release 3.15.0, including pseudogenes, are considered. For example, the numbers of registered genomic DNA sequences for HLA-A, -B, -C, - DQA1, -DQB1, and -DRB1 are 126, 168, 118, 27, 18, and 27, respectively. Finally, expected read counts on HLA alleles are estimated by variational Bayesian inference under a statistical framework for HLA typing. Details are described in the following sections.
Optimization of read alignments to HLA allele sequences
From the BAM file constructed above, our goal is to estimate the most appropriate alignments of reads to the allele sequences, and to predict the most likely set of HLA types at the same time. The generative model of read data used in HLA-VBSeq is described in Figure 2. In the statistical framework, an ambiguous alignment of the first and second nucleotide sequences of read n to the reference HLA allele sequence t is treated as the hidden variable Z
nt
, where Z
nt
is an indicator variable and it takes one if read n is generated from allele t, and zero otherwise. Read abundance (depth of coverage, after normalization by the length of the allele sequence) on HLA alleles are treated as a model parameter θ. In the variational Bayesian approach, model parameters are estimated as the posterior distribution. We use the Dirichlet distribution for the prior distribution of the parameter vector θ
where C is a constant, , T is the number of HLA alleles considered, and α
t
> 0 is the hyperparameter, which controls the complexity of model parameters. In our method, we use the uniform hyperparameter α0 and it is selected as a maximizer of the log marginal likelihood of the observed data.
Our goal is to estimate the posterior distributions of θgiven the data. However, this requires integrals over hidden variables and is intractable to compute in closed form. Hence, we use variational Bayesian (VB) inference to obtain the approximation of the full posterior distribution by assuming the factorization of latent variables and model parameters [18]. In the variational Bayesian E (VBE) step, for read n, an expected read count generated from the HLA allele t is calculated based on the current estimate of the abundance parameter as . In the variational Bayesian M (VBM) step, the read abundance on each allele is calculated based on the current estimate of the expected read counts by , where Each step is iterated until a convergence criterion is satisfied (when the read quantities on HLA alleles are no longer updated). Update equations in each step can be calculated similarly as described in the previous work [19].
HLA typing from the optimized read alignment on HLA alleles
After the inference algorithm converges, HLA types are predicted based on the expected number of reads assigned to each allele. Because there exist sequencing errors (substitutions, deletions and insertions against reference sequences) and alignment errors, a threshold for the depth of coverage on HLA alleles is set. In our analysis, we set the threshold as 20% of the depth of coverage (i.e., if the data is 30x on average, then we use 6x for a threhold). For each HLA locus, a diplotype is decided as follows:
• If there is no allele that passes the threshold, then it is considered that there are not enough reads to identify a correct HLA type, and hence no allele is called.
• If there is only one allele that passes the threshold, and the depth of coverage is more than or equal to twice as that of the threshold, then the HLA locus is considered to be homozygous of that HLA allele. If the depth of coverage is less than twice as that of the threshold, then the allele is called as heterozygous.
• If there are two or more alleles that pass the threshold, then alleles are sorted according to the depth of coverage from high to low. The top two alleles are selected as candidates of HLA types. If the depth of coverage of the top one is more than twice as that of the second one, then the HLA locus is set as homozygous of the top one. Otherwise, the HLA locus is predicted as a diplotype of the top and second one.
Performance measure of HLA typing
Prediction performance is evaluated in terms of the prediction accuracy. In our analysis, the prediction accuracy is defined as the fraction of true positive predictions among the true HLA types. In this simulation experiment, two HLA alleles (either heterozygous or homozygous) for six HLA loci (HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1), or 12 HLA alleles in total, are evaluated for each individual. The prediction performance is evaluated separately at the 2-digit, 4-digit, 6-digit and 8-digit resolution for each method.