- Methodology article
- Open Access
A phased SNP-based classification of sickle cell anemia HBB haplotypes
BMC Genomics volume 18, Article number: 608 (2017)
Sickle cell anemia causes severe complications and premature death. Five common β-globin gene cluster haplotypes are each associated with characteristic fetal hemoglobin (HbF) levels. As HbF is the major modulator of disease severity, classifying patients according to haplotype is useful. The first method of haplotype classification used restriction fragment length polymorphisms (RFLPs) to detect single nucleotide polymorphisms (SNPs) in the β-globin gene cluster. This is labor intensive, and error prone.
We used genome-wide SNP data imputed to the 1000 Genomes reference panel to obtain phased data distinguishing parental alleles.
We successfully haplotyped 813 sickle cell anemia patients previously classified by RFLPs with a concordance >98%. Four SNPs (rs3834466, rs28440105, rs10128556, and rs968857) marking four different restriction enzyme sites unequivocally defined most haplotypes. We were able to assign a haplotype to 86% of samples that were either partially or misclassified using RFLPs.
Phased data using only four SNPs allowed unequivocal assignment of a haplotype that was not always possible using a larger number of RFLPs. Given the availability of genome-wide SNP data, our method is rapid and does not require high computational resources.
Sickle cell anemia affects millions worldwide and is associated with high morbidity and mortality . The concentration of fetal hemoglobin (HbF) is the main pathophysiological modulator . Five major haplotypes of the β-globin gene (HBB) cluster are associated with different levels of HbF [3, 4]. Patients with the highest HbF generally have the mildest disease [5, 6]. Therefore, classification of patients’ haplotype is useful for prognostic purposes and for studying the genetic differences that contribute to the HbF variability among these haplotypes.
Haplotypes of sickle cell anemia were first ascertained by analysis of restriction fragment length polymorphisms (RFLPs) in the HBB gene cluster . This classification was based on detecting whether or not cleavage occurred at five to eight restriction sites when DNA was digested with restriction endonucleases, as shown in Fig. 1 [8,9,10]. This method is time-consuming and can lead to error . Fluorescence resonance energy transfer coupled with high-resolution melting (HRM) assay is another method to classify sickle cell haplotypes, but it is also labor intensive and requires multiple laboratory assays . Neither method is capable of differentiating between parental and maternal alleles in an individual so that without informative genetic data from family members, the phasing of restriction patterns is not possible, and in many cases ascertainment of a haplotype is either equivocal or impossible. We used genome-wide association study (GWAS) data imputed to a reference panel to obtain a phased output. The phased GWAS data allowed assigning SNPs to parental chromosomes, which facilitated the classification procedure using fewer SNPs.
GWAS data were available for patients with sickle cell anemia from the Cooperative Study of Sickle Cell Disease (CSSCD) . SNP array data containing 588,451 markers were evaluated using PLINK to identify and remove SNPs with minor allele frequency (MAF) < 0.01, that violated Hardy-Weinberg Equilibrium (HWE), and had more than 0.05 missing genotype information . Genotypes for a total of 560,170 SNPs were imputed using the Michigan Imputation Server , the 1000 Genomes Phase 3 v5 reference panel, and the Eagle phasing algorithm to obtain phased output [14, 15]. We developed a Python script based on VCF and PYSAM Python modules to read SNP information and assign the haplotype accordingly [16, 17]. Code and an example are available on GitHub (https://github.com/eshaikho/haplotypeClassifier). We used this script to classify 1394 samples that were previously classified by RFLP in the CSSCD. We selected four SNPs (rs3834466, rs28440105, rs10128556, and rs968857) which define all of the haplotypes spanning the β-globin gene cluster (Table 1).
Calculation of HbF average per haplotype
To check the consistency of classification and average HbF for each haplotype, we used samples with available HbF level information including 559 of the 813 samples that were successfully classified with RFLPs, 916 of the samples classified with SNP-based methodology, and 252 of samples that were either partially classified or failed classification with RFLPs. We calculated the average HbF level for each haplotype using psych R package , and generated a boxplot for the most common haplotypes (Benin homozygotes [BEN/BEN], Benin/Central African Republic compound heterozygotes [BEN/CAR], Benin/Senegal compound heterozygotes [BEN/SEN], Benin/Cameroon compound heterozygotes [BEN/CAM], Central African Republic homozygotes [CAR/CAR]) in this cohort to show the consistency of HbF levels across the three groups (five RFLP classification, SNP-based classification, and the group that failed five RFLP classification but were able to be classified with the SNP-based method).
Classification of haplotypes in Saudi sickle cell anemia patients and in a library of sickle cell anemia induced pluripotent stem cells (iPSCs)
Since CSSCD patients are mostly African American, we tested our method using data obtained from sickle cell anemia patients from the Eastern and Southwestern Provinces of Saudi Arabia. Eastern Province patients tend to have the autochthonous Arab Indian (AI) haplotype as the major haplotype, while Southwestern Province patients mostly have the BEN haplotype that was introduced from Africa. The HbF levels in Saudi Benin patients is twice as high as African American patients with this haplotype [2, 19]. To further test our method on a mixed population of diverse ethnicity we reclassified haplotypes originally ascertained using RFLPs in a library of sickle cell anemia-derived iPSCs .
Of 371 CSSCD patients classified as BEN/BEN using five RFLPs, we achieved a concordance of 98% (367/371) using four phased SNPs. We achieved >99% concordance for patients classified as BEN/CAR using RFLPs. For BEN/SEN, BEN/CAM, CAR/CAR, CAR/SEN, CAR/CAM, CAR/AI, SEN/SEN, SEN/CAM, SEN/AI, and CAM/CAM haplotypes our concordance with the RFLP method was 100% although the numbers of patients in each category was smaller. Two patients classified originally as BEN/AI failed reclassification (Table 2). Discordance between our method and the five RFLP method occurred in only eight of 813 patients providing an overall concordance rate > 99%. Two patients classified as BEN/AI with RFLP were reclassified as UNKNOWN/SEN. Four patients classified as BEN/BEN were reclassified as UNKNOWN/BEN, CAR/CAR, CAM/BEN, and SEN/BEN. The last two patients were classified as CAR/SEN and SEN/BEN with our methods instead of BEN/CAR according to RFLP analysis. Importantly, we were able to assign a haplotype to 86% (343/ 395) of samples from the CSSCD that failed classification using RFLPs.
Calculation of HbF average per haplotype
The average haplotype HbF level of patients classified with our method is consistent with average HbF in haplotypes reported in literature based on RFLPs (Table 3) [5, 6]. The average haplotype HbF for samples unclassifiable using RFLPs, but classified using phased SNP data matched the average HbF for each known haplotype (Table 3). Boxplots of the most common five haplotypes in the CSSCD cohort show the consistency of HbF levels across the three classification groups (Fig. 2 ).
Classification of haplotype in Saudi sickle cell anemia and sickle iPSCs
Haplotypes among 55 Southwestern Province patients classified using the RFLP method included 39 BEN/BEN, 11 CAR/CAR, 2 BEN/SEN, 1 BEN/CAR, 1 SEN/SEN and one unknown. The distribution of haplotypes for these subjects derived using the SNP method was 48 BEN/BEN, 3 BEN/UNKNOWN, 2 BEN/CAM, and one AI/AI (Table 3). The concordance between RFLP and SNP-based classification was 67%. For the 30 Eastern Province patients, we had 100% concordance since all patients reclassified as AI/AI (Table 3).
In a library of sickle cell anemia iPSCs there was high concordance between the two methods of haplotype ascertainment. The only discordance was in two patients classified originally as BEN/BEN that according to SNP-based reclassification were CAM/BEN and CAR/SEN (Table 4). Importantly, we were able to assign a haplotype to 15 of 17 iPSC samples that were classified as either atypical or were indeterminate using RFLPs.
In adults, homozygotes for the BEN, CAR, and CAM haplotypes were associated with HbF of 5-7% of total hemoglobin; SEN and AI haplotypes had HbF levels of about 10% and 20%, respectively. Using GWAS data we were able to classify with high accuracy and time efficiency the haplotype of sickle cell anemia patients using four SNPs. The primary feature of our classification method is a phasing step after genotype imputation where SNP alleles are assigned to parental chromosomes, and the haplotype of each chromosome is assigned independently. This method was superior to ascertaining haplotype by RFLP using unphased SNPs at five sites and was successfully applied in a few seconds on a personal computer.
Haplotyping errors can occur using the SNP-based method because of the SNP genotyping platform, imputation errors, and ambiguities arising from phasing algorithms. Nevertheless, in African-origin patient samples, we were able to achieve a concordance of 99% percent (805/813) between 4-SNP haplotypes derived from a phasing algorithm using GWAS data with 5-SNP haplotypes determined using restriction analysis. Haplotype assignment in a sickle cell anemia iPSC library also showed high concordance and demonstrated the efficiency of SNP-based method to classify samples that failed RFLP classification. The discordance between SNP-based and RFLP ascertainment most likely resulted from errors in the RFLP classification that is sensitive to the presence of other SNPs in the restriction sites and the vagaries of restriction enzyme analysis and Southern blotting that was used for haplotype analysis in the CSSCD.
In 30 Saudi East patients where the AI haplotype was ascertained by genotyping rs7482144 (Xmn1 5′ to HBG2), rs3834466 (Hinc2 5′ to HBE1), and rs549964658 (5′ to HBD) we had 100% concordance.
The major discordance between RFLP and SNP-based analysis for classification of Saudi Southwestern patients occurred among 11 subjects classified as CAR/CAR. Eight of these 11 were reclassified as BEN/BEN and three as BEN heterozygotes. The only difference between CAR and BEN haplotypes is the SNP rs968857 at the HincII site 5′ of HBD (Fig. 1 , Table 2). It is most likely that this discrepancy was a result of an error in RFLP analysis. If the discordance was due to imputation quality, the error rate would probably match the imputation error. There is 100% discordance at this HincII site while the imputation quality score of rs968857 is R2 = 0.99. One patient with HbF of 20.4% originally as SEN/SEN by RFLP was reclassified as AI/AI.
To investigate the discordance in Southwestern Province patients, we examined the genotype data of the HBB gene cluster downstream of OR51V1 (5′ olfactory receptor gene cluster) and upstream of OR51B4 (3′ olfactory gene cluster) in both patient groups. The SNP genotypes of the 11 CAR/CAR that we reclassified as BEN/BEN or BEN heterozygous had the same SNP genotype of BEN/BEN patients that were classified as such with both methods. The genotype data, average HbF, and the imputation quality score of rs968857 suggest that the high discordance in Southwestern Saudi patients is due to RFLP errors.
A limitation of our method is the dependency on the availability of GWAS data for many SNPs in the β-globin gene region. However, many large patient cohorts have been genotyped using genome-wide SNP arrays. In these patients, haplotype information might be useful as a covariate in a genetic risk analysis. RFLP analysis might be suitable for a small number of patients but requires optimization of all of the individual assays. The main advantage of our haplotype determination method is the rapid classification and high accuracy. This method can also be used for whole genome sequence data classification after SNP calling and phasing. Moreover, it is not sensitive to SNPs that alter the restriction enzyme recognition sequence that can lead to error using RFLPs.
Phased data using only four SNPs allowed unequivocal assignment of a β-globin gene cluster haplotype that was not always possible using a larger number of RFLPs, and was also more accurate. With the availability of genome-wide SNP data our method is rapid and does not require high computational resources.
Piel FB, Steinberg MH, Rees DC. Sickle cell disease. N Engl J Med. 2017;376(16):1561–73.
Akinsheye I, Alsultan A, Solovieff N, Ngo D, Baldwin CT, Sebastiani P, Chui DH, Steinberg MH. Fetal hemoglobin in sickle cell anemia. Blood. 2011;118(1):19–27.
Nagel RL, Fabry ME, Pagnier J, Zohoun I, Wajcman H, Baudin V, Labie D. Hematologically and genetically distinct forms of sickle cell anemia in Africa. The Senegal type and the Benin type. N Engl J Med. 1985;312(14):880–4.
Pagnier J, Mears JG, Dunda-Belkhodja O, Schaefer-Rego KE, Beldjord C, Nagel RL, Labie D. Evidence for the multicentric origin of the sickle cell hemoglobin gene in Africa. Proc Natl Acad Sci U S A. 1984;81(6):1771–3.
Perrine RP, Brown MJ, Clegg JB, Weatherall DJ, May A. Benign sickle-cell anaemia. Lancet. 1972;300(7788):1163–7.
Powers DR. Beta s-gene-cluster haplotypes in sickle cell anemia. Clinical and hematologic features. Hematol Oncol Clin North Am. 1991;5(3):475–93.
Sutton M, Bouhassira EE, Nagel RL. Polymerase chain reaction amplification applied to the determination of beta-like globin gene cluster haplotypes. Am J Hematol. 1989;32(1):66–9.
Antonarakis SE, Boehm CD, Giardina PJ, Kazazian HH Jr. Nonrandom association of polymorphic restriction sites in the beta-globin gene cluster. Proc Natl Acad Sci U S A. 1982;79(1):137–41.
Rezende PV, Costa KS, Domingues Junior JC, Silveira PB, Belisário AR, Silva CM, Viana MB. Clinical, hematological and genetic data of a cohort of children with hemoglobin SD. Rev Bras Hematol Hemoter. 2016;38(3):240–6.
Joly P, Lacan P, Garcia C, Delasaux A, Francina A. Rapid and reliable beta-globin gene cluster haplotyping of sickle cell disease patients by FRET light cycler and HRM assays. Clin Chim Acta. 2011;412(13-14):1257–61.
Solovieff N, Milton JN, Hartley SW, Sherva R, Sebastiani P, Dworkis DA, Klings ES, Farrer LA, Garrett ME, Ashley-Koch A, et al. Fetal hemoglobin in sickle cell anemia: genome-wide association studies suggest a regulatory region in the 5′ olfactory receptor gene cluster. Blood. 2010;115(9):1815–22.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Das S, Forer L, Schonherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–7.
Loh P-R, Palamara PF, Price AL. Fast and accurate long-range phasing in a UK biobank cohort. Nat Genet. 2016;48(7):811–81615.
The Genomes Project C. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.
Casbon J, PyVCF. https://github.com/jamescasbon/PyVCF/. Accessed 20 May 2017.
Pysam-developers, Pysam. https://github.com/pysam-developers/pysam/. Accessed May 13 2017.
Revelle W, (2017) psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois, USA, https://cran.r-project.org/package=psych. Version = 1.7.5.
Alsultan A, Solovieff N, Aleem A, AlGahtani FH, Al-Shehri A, Elfaki Osman M, Kurban K, Bahakim H, Kareem Al-Momen A, Baldwin CT, et al. Fetal hemoglobin in sickle cell anemia: Saudi patients from the southwestern province have similar HBB haplotypes but higher HbF levels than African Americans. Am J Hematol. 2011;86(7):612–4.
Park S, Gianotti-Sommer A, Molina-Estevez FJ, Vanuytsel K, Skvir N, Leung A, Rozelle SS, Shaikho EM, Weir I, Jiang Z, et al. A comprehensive, ethnically diverse library of sickle cell disease-specific induced Pluripotent stem cells. Stem Cell Reports. 2017;8(4):1076–85.
Funded in part by R01 HL 068970, RC2 HL 101212, R01 87681 (MHS), T32 HL007501 (EMS), from the NIH Bethesda, MD.
Availability of data and materials
Data can be made available by contacting John J. Farrell at firstname.lastname@example.org
Ethics approval and consent to participate
These studies used non-identifiable archived data from cohort studies of sickle cell anemia patients from the United States and Saudi Arabia, as reported previously (paper references Alsultan et al., 2011, Solovieff et al., 2010). Because of the DNA samples for the CSSCD were obtained from NIH Biolinc, no consents were available. For the Saudi patients, approval was obtained from the IRB at King Saud University, Riyadh, (09-668).
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Shaikho, E.M., Farrell, J.J., Alsultan, A. et al. A phased SNP-based classification of sickle cell anemia HBB haplotypes. BMC Genomics 18, 608 (2017). https://doi.org/10.1186/s12864-017-4013-y
- Sickle cell
- Haplotype classification