A phased SNP-based classification of sickle cell anemia HBB haplotypes

Background Sickle cell anemia causes severe complications and premature death. Five common β-globin gene cluster haplotypes are each associated with characteristic fetal hemoglobin (HbF) levels. As HbF is the major modulator of disease severity, classifying patients according to haplotype is useful. The first method of haplotype classification used restriction fragment length polymorphisms (RFLPs) to detect single nucleotide polymorphisms (SNPs) in the β-globin gene cluster. This is labor intensive, and error prone. Methods We used genome-wide SNP data imputed to the 1000 Genomes reference panel to obtain phased data distinguishing parental alleles. Results We successfully haplotyped 813 sickle cell anemia patients previously classified by RFLPs with a concordance >98%. Four SNPs (rs3834466, rs28440105, rs10128556, and rs968857) marking four different restriction enzyme sites unequivocally defined most haplotypes. We were able to assign a haplotype to 86% of samples that were either partially or misclassified using RFLPs. Conclusion Phased data using only four SNPs allowed unequivocal assignment of a haplotype that was not always possible using a larger number of RFLPs. Given the availability of genome-wide SNP data, our method is rapid and does not require high computational resources.


Background
Sickle cell anemia affects millions worldwide and is associated with high morbidity and mortality [1]. The concentration of fetal hemoglobin (HbF) is the main pathophysiological modulator [2]. Five major haplotypes of the β-globin gene (HBB) cluster are associated with different levels of HbF [3,4]. Patients with the highest HbF generally have the mildest disease [5,6]. Therefore, classification of patients' haplotype is useful for prognostic purposes and for studying the genetic differences that contribute to the HbF variability among these haplotypes.
Haplotypes of sickle cell anemia were first ascertained by analysis of restriction fragment length polymorphisms (RFLPs) in the HBB gene cluster [7]. This classification was based on detecting whether or not cleavage occurred at five to eight restriction sites when DNA was digested with restriction endonucleases, as shown in Fig. 1 [8][9][10]. This method is time-consuming and can lead to error [10]. Fluorescence resonance energy transfer coupled with high-resolution melting (HRM) assay is another method to classify sickle cell haplotypes, but it is also labor intensive and requires multiple laboratory assays [10]. Neither method is capable of differentiating between parental and maternal alleles in an individual so that without informative genetic data from family members, the phasing of restriction patterns is not possible, and in many cases ascertainment of a haplotype is either equivocal or impossible. We used genome-wide association study (GWAS) data imputed to a reference panel to obtain a phased output. The phased GWAS data allowed assigning SNPs to parental chromosomes, which facilitated the classification procedure using fewer SNPs.

Haplotype classification
GWAS data were available for patients with sickle cell anemia from the Cooperative Study of Sickle Cell Disease (CSSCD) [11]. SNP array data containing 588,451 markers were evaluated using PLINK to identify and remove SNPs with minor allele frequency (MAF) < 0.01, that violated Hardy-Weinberg Equilibrium (HWE), and had more than 0.05 missing genotype information [12]. Genotypes for a total of 560,170 SNPs were imputed using the Michigan Imputation Server [13], the 1000 Genomes Phase 3 v5 reference panel, and the Eagle phasing algorithm to obtain phased output [14,15]. We developed a Python script based on VCF and PYSAM Python modules to read SNP information and assign the haplotype accordingly [16,17]. Code and an example are available on GitHub (https://github.com/eshaikho/haplo-typeClassifier). We used this script to classify 1394 samples that were previously classified by RFLP in the CSSCD. We selected four SNPs (rs3834466, rs28440105, rs10128556, and rs968857) which define all of the haplotypes spanning the β-globin gene cluster ( Table 1).

Calculation of HbF average per haplotype
To check the consistency of classification and average HbF for each haplotype, we used samples with available HbF level information including 559 of the 813 samples that were successfully classified with RFLPs, 916 of the samples classified with SNP-based methodology, and 252 of samples that were either partially classified or failed classification

Classification of haplotypes in Saudi sickle cell anemia patients and in a library of sickle cell anemia induced pluripotent stem cells (iPSCs)
Since CSSCD patients are mostly African American, we tested our method using data obtained from sickle cell anemia patients from the Eastern and Southwestern   [2,19]. To further test our method on a mixed population of diverse ethnicity we reclassified haplotypes originally ascertained using RFLPs in a library of sickle cell anemia-derived iPSCs [20].

Haplotype classification
Of

Calculation of HbF average per haplotype
The average haplotype HbF level of patients classified with our method is consistent with average HbF in haplotypes reported in literature based on RFLPs (Table 3) [5,6]. The average haplotype HbF for samples unclassifiable using RFLPs, but classified using phased SNP data matched the average HbF for each known haplotype ( Table 3). Boxplots of the most common five haplotypes in the CSSCD cohort show the consistency of HbF levels across the three classification groups (Fig. 2).    (Table 3). In a library of sickle cell anemia iPSCs there was high concordance between the two methods of haplotype ascertainment. The only discordance was in two patients classified originally as BEN/BEN that according to SNPbased reclassification were CAM/BEN and CAR/SEN (Table 4). Importantly, we were able to assign a haplotype to 15 of 17 iPSC samples that were classified as either atypical or were indeterminate using RFLPs.

Discussion
In adults, homozygotes for the BEN, CAR, and CAM haplotypes were associated with HbF of 5-7% of total hemoglobin; SEN and AI haplotypes had HbF levels of about 10% and 20%, respectively. Using GWAS data we were able to classify with high accuracy and time efficiency the haplotype of sickle cell anemia patients using four SNPs. The primary feature of our classification method is a phasing step after genotype imputation where SNP alleles are assigned to parental chromosomes, and the haplotype of each chromosome is assigned independently. This method was superior to ascertaining haplotype by RFLP using unphased SNPs at five sites and was successfully applied in a few seconds on a personal computer.
Haplotyping errors can occur using the SNP-based method because of the SNP genotyping platform, imputation errors, and ambiguities arising from phasing algorithms. Nevertheless, in African-origin patient samples, we were able to achieve a concordance of 99% percent (805/813) between 4-SNP haplotypes derived from a phasing algorithm using GWAS data with 5-SNP haplotypes determined using restriction analysis. Haplotype assignment in a sickle cell anemia iPSC library also showed high concordance and demonstrated the efficiency of SNP-based method to classify samples that failed RFLP classification. The discordance between SNP-based and RFLP ascertainment most likely resulted from errors in the RFLP classification that is sensitive to the presence of other SNPs in the restriction sites and the vagaries of restriction enzyme analysis and Southern blotting that was used for haplotype analysis in the CSSCD.
The major discordance between RFLP and SNP-based analysis for classification of Saudi Southwestern patients occurred among 11 subjects classified as CAR/CAR. Eight of these 11 were reclassified as BEN/BEN and three as BEN heterozygotes. The only difference between CAR and BEN haplotypes is the SNP rs968857 at the HincII site 5′ of HBD (Fig. 1, Table 2). It is most likely that this discrepancy was a result of an error in RFLP analysis. If the discordance was due to imputation quality, the error rate would probably match the imputation error. There is 100% discordance at this HincII site while the imputation quality score of rs968857 is R 2 = 0.99. One patient with HbF of 20.4% originally as SEN/SEN by RFLP was reclassified as AI/AI.
To investigate the discordance in Southwestern Province patients, we examined the genotype data of the HBB gene cluster downstream of OR51V1 (5′ olfactory receptor gene cluster) and upstream of OR51B4 (3′ olfactory gene cluster) in both patient groups. The SNP genotypes of the 11 CAR/CAR that we reclassified as BEN/BEN or BEN heterozygous had the same SNP genotype of BEN/BEN patients that were classified as such with both methods. The genotype data, average HbF, and the imputation quality score of rs968857 suggest that the high discordance in Southwestern Saudi patients is due to RFLP errors.
A limitation of our method is the dependency on the availability of GWAS data for many SNPs in the β-  [20] globin gene region. However, many large patient cohorts have been genotyped using genome-wide SNP arrays. In these patients, haplotype information might be useful as a covariate in a genetic risk analysis. RFLP analysis might be suitable for a small number of patients but requires optimization of all of the individual assays. The main advantage of our haplotype determination method is the rapid classification and high accuracy. This method can also be used for whole genome sequence data classification after SNP calling and phasing. Moreover, it is not sensitive to SNPs that alter the restriction enzyme recognition sequence that can lead to error using RFLPs.

Conclusion
Phased data using only four SNPs allowed unequivocal assignment of a β-globin gene cluster haplotype that was not always possible using a larger number of RFLPs, and was also more accurate. With the availability of genomewide SNP data our method is rapid and does not require high computational resources.