High-sensitivity HLA typing by Saturated Tiling Capture Sequencing (STC-Seq)

Background Highly polymorphic human leukocyte antigen (HLA) genes are responsible for fine-tuning the adaptive immune system. High-resolution HLA typing is important for the treatment of autoimmune and infectious diseases. Additionally, it is routinely performed for identifying matched donors in transplantation medicine. Although many HLA typing approaches have been developed, the complexity, low-efficiency and high-cost of current HLA-typing assays limit their application in population-based high-throughput HLA typing for donors, which is required for creating large-scale databases for transplantation and precision medicine. Results Here, we present a cost-efficient Saturated Tiling Capture Sequencing (STC-Seq) approach to capturing 14 HLA class I and II genes. The highly efficient capture (an approximately 23,000-fold enrichment) of these genes allows for simplified allele calling. Tests on five genes (HLA-A/B/C/DRB1/DQB1) from 31 human samples and 351 datasets using STC-Seq showed results that were 98% consistent with the known two sets of digitals (field1 and field2) genotypes. Additionally, STC can capture genomic DNA fragments longer than 3 kb from HLA loci, making the library compatible with the third-generation sequencing. Conclusions STC-Seq is a highly accurate and cost-efficient method for HLA typing which can be used to facilitate the establishment of population-based HLA databases for the precision and transplantation medicine. Electronic supplementary material The online version of this article (10.1186/s12864-018-4431-5) contains supplementary material, which is available to authorized users.


Background
The human leukocyte antigen (HLA) complex is located on chromosome 6p21 which encodes major histocompatibility complex (MHC) proteins involved in immune functions [1,2]. The highly polymorphic HLA class I (A, B and C) and II (DRB1 and DQB1) genes are crucial in immune rejection of transplantations, immune response to infections, pathogenesis of autoimmune diseases, adverse reactions to medications and cancer development [3][4][5].
Thus, identifying HLA polymorphisms, also called HLA typing, is clinically important.
According to the IMGT/HLA database [6], there are over 3600 alleles for HLA-A, 4400 alleles for HLA-B, 3200 alleles for HLA-C and 1900 alleles for HLA-DRB1. Additionally, the coding sequences of HLA genes of the same class are highly homologous. Many methods for high-resolution typing (two-field resolution, proteincoding variant) of HLA-A/B/C/DRB1/DQB1 have been successfully established, such as sequence-specific oligonucleotide probes (SSOP), sequence-specific primers (SSP) and Sanger-sequencing-based typing (SBT) [7][8][9][10][11][12][13]. For high-resolution typing (the first and second field), these methods involve iterative procedures that start with low-resolution typing followed by additional characterizations. Consequently, these methods are both time and labor intensive prohibiting them for highthroughput processing. In addition, there are shortcomings with Sanger sequencing-based approaches because variants in an amplicon cannot be phased leading to cis/ trans ambiguities [14,15].
Next-generation sequencing (NGS) technology developed in the last decade has been widely used in medicine [16]. NGS has advantages of providing single DNA molecule sequence data in a high-throughput manner which allows for highly confident HLA allele determination [17,18]. NGS-based HLA typing methods rely on either amplicon-based or hybridization-based enrichment of HLA loci followed by massively parallel sequencing [19]. Amplicon-based capture is laborious and requires extensive PCR optimization; hybridization-based enrichment requires an expensive, high-quality probe-pool to cover all of the allelic variations of the targeted HLA genes. These drawbacks make it difficult for large-scale HLA typing.
Here, we used low-cost on-chip long HLA cDNA fragments as baits to capture the coding regions of 14 HLA class I and II genes (HLA-A/B/C/DPA1/DPB1/ DQA1/DQB1/DRA/DRB1/DRB3/DRB4/DRB5/E/G). The use of high-density on-chip baits allowed us to capture the coding regions of the HLA genes with very high coverage. This advantage improves the accuracy of HLA typing for five genes (HLA-A/B/C/DRB1/DQB1) with a novel high-performance analysis pipeline compared to a previously reported hybridization-based NGS HLA typing approach [20].
We also used a 120 bp fragment which has only 85% matched base with an on-chip probe to mimic the lowest homologous exon (B*73:01 exon 5) shown in Fig. 1b. As expected, the DNA fragment can be efficiently pulled down by the capture chip (Fig. 1c). Previous work has shown that 20 bp to 150 bp complementary sequences can efficiently support hybridization [21][22][23]. Next, we checked every base of all IMGT documented HLA alleles of the 14 genes. We found that every base of the exons (>23 bp) of all documented HLA alleles of 14 HLA genes can be covered by at least one DNA fragment that had enough complementary sequence with an on-chip bait to enable hybridization capture (Fig. 1d, Additional file 3: Figure S3).
We tested the hybridization capture efficiency of STC on 14 HLA genes using 31 samples. After capture, the target regions were enriched, on average, by 23,038-fold and, on average, 73% of bases (range 37.4-87.8%) inside the coding region were covered by mapped reads (only the positions of the first base of the mapped reads were counted) at a 0.25 M sequencing depth (Fig. 2a). By contrast, a previously reported bead-based oligonucleotide capture system only showed a 700-fold enrichment [20] and, on average, 54.5% bases (range 39.8-75.6%) of the coding regions had mapped reads at a similar sequencing depth (Fig. 2a). These data indicate that the extraordinary length (200-1500 bp) and a high number of the on-chip double-stranded baits make it possible to acquire high complexity data for the captured HLA DNA fragments.
The high diversity of mapped reads in the HLA regions provides information about the genetic linkage of polymorphisms. Therefore, it is easy to exclude most alleles with abnormal mapping, which makes the subsequent HLA allele calling more straight-forward. To perform HLA calling, we used the base coverage information and distribution of the first bases of the mapped reads to perform two rounds of screening. We first removed the alleles that did not have complete base coverage inside large exons (> = 70 bp) by a 70 bp continuously aligned region of any sequencing read. Next, we used a window (22 bp by default) to scan the coding region of the remaining alleles and count the number of mapped reads inside each window (only the positions of the first base of the mapped reads were counted). We removed the alleles that contained a window with zero mapped reads, but the corresponding window of any other allele(s) of the same gene had mapped reads. After these two rounds of filtration, in the majority of cases, less than 100 candidate alleles remained for the 14 genes. To further narrow the candidate alleles, we used a genotyping strategy which is based on the hypothesis that the correct genotype could maximally explain the mapped reads on the remaining candidate alleles. In the first step, we randomly paired the remaining alleles of the same gene, counted the relative amounts of unique mapped reads between the two alleles (see details in Methods) and removed the allele(s) that had 15-fold fewer unique mapped reads compared to their partners. Then, we randomly paired the remaining alleles of the same gene and reported the allele pair(s) that maximally explained the mapped reads of remaining alleles of same genes together with the correct alleles of the other HLA genes (Fig. 2b) (for details, see the Methods).
We tested STC-seq analysis pipeline on 382 datasets (31 datasets are generated by STC-Seq, and 351 datasets are from a previously published work [20]) with known allele types for the HLA-A, B, C, DQB1 and DRB1 genes. The results for all five HLA genes were consistent (98% correct, 2% incorrect, 6.3% ambiguity) (Fig. 2c). We also tested a previously reported algorithm,  HLAssign, on STC-seq datasets. However, its results were less consistent. (88.7% correct, 11.3% incorrect, 54.2% ambiguity) (Fig. 2c). Because our allele-calling pipeline considered the interference of multi-gene mapped reads, STC-seq reported significantly fewer ambiguous allele combinations than HLAssign on the sequencing data of both STC-Seq and HLAssign [20] ( Fig. 2c, Additional file 4: Table S1). We also checked the correlation between the percentage of exon bases having mapped reads and HLA typing accuracy. We found that approximately 29% of exon bases with mapped reads was the threshold below which the typing accuracy dropped dramatically (Fig. 2d, Additional file 4: Table S1). STC-seq requires 0.1 M reads (100 bp singleend) to reach this threshold for typing the 14 HLA genes, whereas the HLAssign needs 0.2 M reads to reach the same threshold (Fig. 2d). In summary, because of improved capture efficiency and the high performance of the analysis pipeline, the HLA typing accuracy of STC-Seq is better than HLAssign, a hybridization-based NGS approach (Fig. 2c, d).
The lengths of the highly homologous regions between HLA genes can extend beyond the read-length of most NGS platforms. This causes many diploid ambiguities and is a significant drawback of NGS-based HLA typing approaches. Third-generation sequencing platforms, such as PacBio SMRT Sequencing, deliver single-molecule observations with long reads that are capable of spanning the majority of HLA class I and II genes. Direct sequencing of full-length HLA genes can provide directly phased and high-resolution results [24]. Because of the high similarity between STC long baits and the corresponding HLA genes, STC can capture an integral HLA locus for thirdgeneration sequencing-based HLA typing. As a proof of principle, we used STC baits for the HLA-A, B and C genes to capture their targets from a 3-10 kb fragmented human genomic DNA (Fig. 3a). As expected, STC successfully enriched (by more than 2000-fold) the integral loci of the HLA-A, B and C genes (Fig. 3b).

Discussion
In this work, we developed an NGS-based high-resolution HLA genotyping workflow (STC-Seq) and HLA allele analysis pipeline. Instead of conventional in-solution hybridization capture approaches, STC-Seq uses long double-stranded HLA cDNA fragments as baits. STC-Seq provided an average 23,038-fold enrichment of the coding regions of the 14 HLA genes. We also presented a highresolution HLA-calling pipeline that reached 98% consistency and 6.3% ambiguity for analyzing the HLA-A/ B/C/DRB1/DQB1 loci on STC-seq datasets and previously published NGS datasets [20] (Fig. 2c). This HLA-calling pipeline also outputs high-resolution results for the HLA-DQA1/DRA/DPA1/DRB1/DRB3/DRB4/DRB5/E/G genes captured and sequenced by STC-Seq.
Recently, several NGS-based HLA class I and II typing methods have been reported [25]. They all require enrichment of HLA DNA fragments over the genomic background before sequencing. Although amplicon-based enrichment is a widely used approach [26,27], the amplicon size is restricted due to the short NGS reads. Tedious primer optimization steps are required to improve coverage and avoid co-amplification of pseudogenes and highly homologous HLA loci [20]. Additionally, amplification bias and target dropouts occur frequently [28,29]. In this work, we chose a hybridization-based approach that can fit a wide range of target size and has greater flexibility in adding new target genes. Based on the characterization of the probe and capture matrix, there are three documented hybridization-based capture methods: i) beadbased enrichment using biotin-labeled oligonucleotide baits [30,31]; ii) solid-based enrichment using oligonucleotide baits [32,33]; and iii) solid-based enrichment method using large gDNA or cDNA fragments [34][35][36]. The enrichment of targeted HLA regions is moderate using oligonucleotide baits, as reported by a previous work [20], and the cost of beads and in vitro synthesized oligonucleotide baits is very high. To allow highresolution HLA-typing of large populations at 5-10% cost of current common hybridization-based and amplicon-based NGS approaches (for measurement details, see the Methods), we developed STC-Seq to capture and sequence the coding region of 14 HLA genes using large CDS fragment as baits. Previous work has shown that 20 to 150 bp complementary sequences can efficiently generate hybridization signals [21][22][23], which explains why the long CDS fragments of one HLA allele could serve as a universal bait to efficiently pull down the corresponding homologous alleles. Moreover, long double-stranded baits can acquire a high diversity of captured reads, which contributes to the improvement of the HLA typing accuracy comparing to a hybridization-based NGS HLA typing approach [20]. Currently, there are nine documented HLA null alleles because of their special intronic sequences. These intronic sequences are all in the 10 bp regions of intron/exon boundaries and can be efficiently pulled down by STC on-chip probes (Additional file 5: Table  S2). We also provided a script to ID these null alleles in the analysis pipeline (for details, see Methods). Current NGS HLA typing approaches are not effective in identifying novel alleles due to their short sequence reads [20]. The read lengths of third-generation singlemolecule sequencing can reach 20 kb with high quality [37][38][39]. Because the genomic loci of class I and II HLA alleles are between approximately 1 kb and 17 kb [40], third-generation single-molecule sequencing can directly provide HLA allele-level resolution and should be the ultimate solution to identify novel HLA alleles [41]. Although PCR amplicon approaches have been successfully tested for HLA-A, -B and -C in third-generation sequencing [41], the potential risk of long-template PCR artifacts (i.e., chimeras, mutations and drop-off) is high [42][43][44]. We reason that the double-stranded long baits of STC-seq should be a better enrichment approach for third-generation sequencing-based HLA typing.

Conclusions
In summary, we developed a high-resolution, low-cost and highly accurate HLA typing pipeline, STC-seq, that does not require the expensive reagents of hybridizationbased enrichment (i.e., beads and oligonucleotide baits) or laborious steps of amplicon-based enrichment. These advantages of STC-Seq can significantly facilitate the establishment of population-based HLA databases for the precision and transplantation medicine.

Sample
Twenty-six genomic DNA (gDNA) samples with known first two sets of digitals (field1 and field2) genotypes and five genomic DNA (gDNA) samples with known first three sets of digitals genotypes of HLA-A/B/C/DRB1/ DQB1 were obtained from the China Marrow Donor Program (CMDP).

NGS library preparation
Genomic DNA samples were sonicated to an average fragment size of 300 bp. One-hundred nanograms of fragmented gDNA was used for DNA library construction with the NEBNext Ultra II DNA Library Prep Kit (E7645) according to the manufacturer's protocol. These bait fragments were denatured in 0.5 M NaOH at room temperature for 20 min. To prepare the capture chip, denatured baits were applied to a 40 mm 2 Nylon chip and vacuum dried at 80°C for 1 h. The capture chip can be stored in dry conditions at room temperature for at least 3 months.

Hybridization capture of HLA fragments and NGS sequencing
Whole genomic DNA libraries (1 μg each) were mixed with the adaptor blockers and 5 μg of Cot-I DNA. The DNA mixture was denatured at 95°C for 5 min and then snap cooled on ice immediately. Next, the denatured DNA mixture and capture chip were transferred into hybridization solution (6xSSC, 1% SDS, 5× Denhardt's Solution). Hybridization was performed at 65°C for 4 h. After hybridization, the capture chip was washed with 2× SSC and 0.1% SDS for 5 min and 0.2× SSC and 0.1% SDS for 2 × 5 min at 55°C. The captured DNAs were eluted with 100 μl of TE at 95°C for 10 min and purified by using a PCR clean-up kit (Qiagen). The eluted DNAs were subjected to 15 cycles of PCR amplification using the Illumina P5 and P7 primers and subjected to another round of hybridization capture with the same conditions.
After enrichment, DNA samples were sequenced on the Illumina HiSeq 2500 platform with the single-end 100 bp model.

Capture of integral loci of HLA genes
Genomic DNA was briefly fragmented to produce fragments larger than 3 kb. Five-hundred nanograms of fragmented gDNA was used as the input for captureenrichment according to the method described above. After elution, RT-PCR was performed.

Capture of B*73:01 exon5
The sequence of B*73:01 exon5 was synthesized by Synbio Biotechnologies. B*73:01 exon5 was first mixed with fragmented genomic DNA with a known genotype (HLA-A*01:01:01). Next, the mixture was constructed to an NGS library and captured by STC chip according to the method described above. After enrichment, RT-PCR was performed.

Cost measurement of STC-Seq and other common NGS HLA typing approaches
The capture chip cost of STC-Seq is 0.5 USD/sample. The NGS library cost is 10 USD per sample (en.vazyme.com). So the cost of STC-Seq is around 16  HLA allele-calling pipeline (1)Removing alleles with insufficient coverage: 1. We first generate all possible artificial reads (70 bp) using the large exons (> = 70 bp) of all of the alleles of the 14 HLA genes from the IMGT/HLA database. 2. After removing PCR duplications, we converted the sequencing reads to FASTA format and used them as a mapping reference. 3. We aligned the artificial reads against the mapping reference using bowtie (bowtie -S -k 1 -best -p 20 -solexa-quals -v 0). 4. We removed the allele(s) for which the distance of any adjacent mapped artificial reads was more than 70 bp in an exon. (2)Initial screening: 5. The sequencing reads (after removing PCR duplicates) were mapped to the coding regions of the remaining HLA alleles using bowtie (bowtie -S -a -p 20 -solexa-quals -v 0). 6. We used the remaining alleles of the same genes to build matrices in which the columns are the allele names and rows are the base positions of the longest allele of the gene. The position of a base was filled with "1" if there were mapped reads (only the position of the first base of a mapped read was counted); otherwise, the position was filled with "0". The null position of any allele was marked with "1" if the corresponding position of any other allele of same gene had a mapped read. 7. We used a window (22 bp by default) to scan the coding region of every allele and count the number of mapped reads (100 bp continuously mapped) in each window. 8. In the same matrix, we removed the allele(s) for which there was a window without mapped reads (100 bp continuously mapped) if any other allele had mapped reads. , only sequencing reads with continuously aligned length reads > = 70 bp were regarded as mapped reads. 10.We pooled the alleles with > = 6 unique mapped reads (only mapped to one allele of the 14 + H&Y HLA genes) and considered them to be the "real Availability of data and materials All data are available on request. Dr. Junling Jia (junlingjia@zju.edu.cn) can be contacted to obtain the data. All raw data of sequencing has been deposited in BIGD (http:// bigd.big.ac.cn/) Genome Sequence Archive (GSA) with an accession number CRA000645.
Authors' contributions YJ, YL, DJ, JZ and LW prepared the STC capture-chips and performed hybridization capture experiments and Next-generation sequencings. RL, YJ, CW, YD and XX designed HLA typing algorithms, wrote HLA typing scripts and performed data analysis. JZ, MZ and JJ designed STC capture-chips and HLA capture experiments, and wrote the manuscript. All authors have read and approved the manuscript

Ethics approval
Our work did not involve any human and animal tissue. This study has been approved by the ethics committee of the First Affiliated Hospital, School of Medicine, Zhejiang University. Dan Du from the China Marrow Donor Program gave the permission for this study to use their validated DNA samples.