Skip to main content

Star allele search: a pharmacogenetic annotation database and user-friendly search tool of publicly available 1000 Genomes Project biospecimens

Abstract

Here we describe a new public pharmacogenetic (PGx) annotation database of a large (nā€‰=ā€‰3,202) and diverse biospecimen collection of 1000 Genomes Project cell lines and DNAs. The database is searchable with a user friendly, web-based tool (www.coriell.org/StarAllele/Search). This resource leverages existing whole genome sequencing data and PharmVar annotations to characterize *alleles for each biospecimen in the collection. This new tool is designed to facilitate in vitro functional characterization of *allele haplotypes and diplotypes as well as support clinical PGx assay development, validation, and implementation.

Peer Review reports

Background

Pharmacogenomics (PGx) holds the potential to improve medication management by increasing efficacy and by reducing toxicity [1,2,3,4,5,6,7]. Translating pharmacogenomic research into clinical care, however, requires a robust inter-disciplinary infrastructure [8, 9]. Characterizing the full range of functionally relevant human pharmacogenetic variation is limited by the documented underrepresentation of many communities living in the United States and around the world [10,11,12,13,14,15,16], and this effort would benefit from a large and diverse collection of publicly available and well-characterized cell lines. Such a resource would facilitate a more comprehensive understanding of pharmacogene variation and in vitro drug response [17,18,19,20,21,22,23]. Moreover, a well-characterized and diverse set of publicly available and renewable DNA samples would benefit the clinical communities that require positive and negative controls for assay development, validation, implementation, and proficiency testing for robust PGx testing.

The Genetic Reference and Testing Materials Coordination Program (GeT-RM) has used a variety of clinical testing methods to characterize lymphoblastoid cell line (LCL) DNAs for 28 pharmacogenes [24], and more recently has incorporated next generation sequencing data for the characterization of CYP2D6 [25] (nā€‰=ā€‰179), as well as CYP2C8, CYP2C9 and CYP2C19 (nā€‰=ā€‰137) [26]. Here we describe a complementary PGx annotation resource that includes a significantly larger set (nā€‰=ā€‰3,202) of renewable and publicly available 1000 Genomes Project LCLs and DNAs available through the National Human Genome Research Institute (NHGRI) Sample Repository for Human Genetic Research (https://catalog.coriell.org/1/NHGRI) and the National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository (https://catalog.coriell.org/1/NIGMS). This new annotation resource leverages 30x whole genome sequencing (WGS) data [27], is downloadable (Table S1) and may be searched with a user-friendly, web-based tool, Star Allele Search (www.coriell.org/StarAllele/Search).

Construction and content

Table 1 includes a summary of the publicly available 1000 Genomes Project biospecimens included in the star allele annotation database. The majority of the samples (nā€‰=ā€‰3,023) are available through the NHGRI Sample Repository for Human Genetic Research (https://catalog.coriell.org/1/NHGRI), and the collection of Utah Residents (Centre dā€™Etude du Polymorphisme Humain (CEPH)) with Northern and Western European Ancestry biospecimens (nā€‰=ā€‰179) are available through the NIGMS Human Genetic Cell Repository (https://catalog.coriell.org/1/NIGMS). Table S1 includes each individual NHGRI Sample Repository for Human Genetic Research and NIGMS Human Genetic Cell Repository identifier for all of the 1000 Genomes Project biospecimens annotated in the star allele annotation database.

TableĀ 1 List of included populations

We leveraged existing publicly available 30x coverage WGS data from 3,202 samples generated and phased by the New York Genome Center (NYGC) [27]. The detailed description of the data collection and analysis can be found in Byrska-Bishop et al. [27]. Briefly, 3,202 samples from the 1000 Genomes Project collection were selected for inclusion [27] in the WGS data collection (TableĀ 1). The sample set includes 2,504 unrelated individuals as well as 698 relatives (that together complete 602 trios) [27], and the WGS data were collected with an Illumina NovaSeq 6000 System [27]. The raw WGS data were aligned to the GRCh38 reference genome, and variant calling was performed with GATK [27, 28]. The WGS variant information was additionally phased into haplotypes; autosomal single nucleotide variants (SNVs) and insertion / deletions (INDELs) were statistically phased using SHAPEIT-duoHMM with pedigree-based correction [27, 29, 30].

We used the phased NYGC WGS variant call format (VCF) files [27] identified through the www.internationalgenome.org website (https://www.internationalgenome.org/data-portal/data-collection/30x-grch38), and accessed from the following publicly available file transfer protocol (FTP) site (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/), which were last modified on 2022ā€“11-14 08:33. This dataset was filtered prior to phasing as described in the NYGC README file, such that passing variants met the following criteria: 1) missing genotype rateā€‰<ā€‰5%; 2) Hardy Weinberg test P-Valueā€‰>ā€‰1e-10; Mendelian error rateā€‰ā‰¤ā€‰5%; and 3) minor allele countā€‰ā‰„ā€‰2. We did not perform any additional data post processing. We compared the variants in the phased VCF files against PGx annotations for 12 of the 13 pharmacogenes annotated in PharmVar [31,32,33] version 5.2.13 using ursaPGx [34], which implements Cyrius for CYP2D6 calling using raw WGS binary alignment map (BAM) files [35].

The detailed description of the ursaPGx annotation can be found here [34]. Briefly, for each non-CYP2D6 pharmacogene, the star allele defining variants according to PharmVar are extracted from the phased VCF file, and the annotation is assigned when all star allele defining variants are present for a given VCF haplotype. In cases where no complete match between the phased haplotype and any PharmVar star allele occurs, the haplotype is annotated as ambiguous (Amb). The complete list of variants included in the phased VCF used for non-CYP2D6 star allele annotations can be searched at the following website, either by specific rsid (up to 100 rsids can be included in a single search) or by HUGO gene symbol (https://www.coriell.org/SNPSearch/WGS).

CYP2D6 annotations were generated with Cyrius [35] via ursaPGx [34]. The detailed description of the Cyrius annotation approach is described by Chen et al. 2021 [35]; the most relevant details to Star Allele Search are as follows. Cyrius first infers the combined number of CYP2D6 and CYP2D7 copies from the WGS BAM files using the reads mapped to either gene and then uses 117 variants to further differentiate between CYP2D6 and CYP2D7 reads for gene specific copy number inference [35]. The Cyrius output differentiates several classes of annotations (https://github.com/Illumina/Cyrius). For the purposes of Star Allele Search, and to be as consistent as possible with the ursaPGx annotation approach for non-CYP2D6 pharmacogenes, we retained only those annotations where Cyrius indicates a unique and non-ambiguous match to a given PharmVar *allele annotation (ā€œFilterā€ā€‰=ā€‰ā€œPASSā€ indicating a passing, confident call, and ā€œCall_infoā€ā€‰=ā€‰ā€œunique_matchā€ indicating a specific match to the annotated PharmVar *allele) in the sample JavaScript object notation (json) output file. More detail about the Cyrius annotation for each sample included in the json output, including the specific variants used for *allele annotation for each sample are included in Table S2.

Utility and discussion

Here we describe a new public PGx annotation database with a user friendly, web-based search tool of associated lymphoblastoid cell line and DNA biospecimens. This new resource complements existing databases generated by GeT-RM; while GeT-RM works directly with clinical laboratories to develop robust PGx annotated biospecimens designed to serve as reference materials for genetic testing, this effort is extremely involved and not easily scalable to larger collections of biospecimens.

This new annotation database therefore offers a slightly less robust characterization of a significantly larger collection of diverse biospecimens (TableĀ 1) to support PGx related research efforts and to serve as a starting point for clinical testing communities to identify potentially relevant reference materials for their testing needs. More specifically, Star Allele Search uses a single WGS dataset [27] as well as a single annotation approach. These choices maximize consistency and transparency across all of the biospecimen annotations in the database and are thereby well suited for a large database of thousands of samples. Any researcher interested in using Star Allele Search annotations can view the specific variants included in each *allele annotation (for non-CYP2D6 pharmacogenes at https://www.coriell.org/SNPSearch/WGS, and for CYP2D6 in Table S2), and can view each specific *allele annotation definition at PharmVar (https://www.pharmvar.org/genes). Moreover, as PharmVar releases new versions of their annotations, we are well positioned to periodically update a corresponding version of Star Allele Search shortly thereafter. However, the relatively large size of the biospecimen set is not well suited to the more robust GeT-RM approach that leverages sequencing data collected from multiple laboratories together with multiple annotation analysis pipelines, and then constructs a consensus *allele annotation for each sample for each included pharmacogene [25, 26].

To assess the quality and accuracy of the PGx annotation database, we compared overlapping samples that were already characterized by GeT-RM using next-generation sequencing data, which are available for CYP2C8, CYP2C9, CYP2C19, and CYP2D6 [25, 26]. In total, we identified 87 overlapping samples between GeT-RM and the current annotation dataset [25, 26]. We found 100%, 99% and 97% concordance, respectively between our annotation and the GeT-RM NGS consensus annotation for CYP2C8, CYP2C19, and CYP2C9 [26], and we found 94% concordance between our annotation and the GeT-RM NGS consensus annotation for CYP2D6 [25].

Our CYP2C19 comparison identified a discrepancy for a single sample (NA19122). The GeT-RM NGS consensus is *2/*35 [26], while our annotation is *2|*Amb. We note that as described above, our approach requires a complete match between a given phased haplotype and all of the PharmVar defining variants for a given star allele. For NA19122, the first haplotype included all of the variants required to annotate *2 (non-reference alleles for rs12769205, rs4244285, and rs3758581), consistent with GeT-RM [26]; however, the second haplotype in our phased VCF file includes both variants required to annotate *35 (non-reference alleles for rs12769205 and rs3758581) as well as a non-reference allele at rs17882687, which in our approach precludes it from an unambiguous call of *35 or *15.

We identified discordant CYP2C9 star allele annotations for three samples. Our approach annotated two samples (NA19143 and NA19213) as *1|*1 while the GeT-RM NGS consensus is *1/*6 [26]. This discrepancy is due to the limitation of the WGS phased VCF file we used which unfortunately does not contain rs9332131, the single base deletion that defines *6. We annotated the third discordant sample (HG01190) to be *61|*1, whereas the GeT-RM NGS consensus is *2/*61 [26]. We believe this difference is due to differences in variant calling and phasing approaches. In the phased VCF we used, this sample is heterozygous for both variants required to annotate *61 (rs1799853 and rs202201137), and both of these variants occur on the first haplotype of the sample. Here we also note that while the consensus annotation is *2/*61, a minority of the groups participating in the study annotated this sample as *1/*61 [26].

For CYP2D6, we identified a perfect match for 66 overlapping samples, with an additional two samples (NA07000 and NA19143) concordant between the ursaPGx implementation of Cyrius and the tentative GeT-RM assignment designated with parentheses [25]. The matching samples include six samples with more than two copies of CYP2D6 (HG00436:*2ā€‰Ć—ā€‰2/*71, NA19109:*2ā€‰Ć—ā€‰2/*29, NA19207:*2ā€‰Ć—ā€‰2/*10, NA19226:*2/*2ā€‰Ć—ā€‰2, NA19819:*2/*4ā€‰Ć—ā€‰2, and NA19920:*1/*4ā€‰Ć—ā€‰2), which lends confidence to our approachā€™s ability to detect CYP2D6 copy number variation. We found 16 additional sample annotations where the GeT-RM consensus matches the most confident Cyrius call; however, the Cyrius output notes either more than one match (Table S2, Call_infoā€‰=ā€‰more_than_one_match), or an imperfect match to the closest PharmVar annotation (Table S2, Call_infoā€‰=ā€‰pick_common_allele). In these 16 cases, the Star Allele Search annotation is listed as ā€˜Ambā€™, while the detailed Cyrius raw call (ā€˜Raw_star_alleleā€™ column) and most confident diplotype call (ā€˜Genotypeā€™ column) are included in Table S2. For the discordant NA18519 annotation, the ursaPGx implementation of Cyrius annotated *106/*29, while GeT-RM annotated *1/*29. As far as we can tell from the detail included in Table S2 of the publicationā€™s supplementary materials [25], the *106 defining variant (rs28371733) was not included in the NA18519 annotation assessment; *106 was not detected by the assays used for the full set of included samples (nā€‰=ā€‰179), including genotyping, PharmacoScan, iPLEX V1.1, CYP2D6 V1.1, a custom panel, and VeriDose, but rather sequencing (Sanger, NGS or SMRT) appears to have been used for a subset of 50 samples that did not include NA18519. The ursaPGx implementation of Cyrius was also not able to fully resolve the diplotype for NA18565 using short read WGS data beyond *36/*36ā€‰+ā€‰*10, while GeT-RM was able to fully resolve the diplotype to *10/*36ā€‰Ć—ā€‰2 (one *10 allele and a second allele with two copies of *36).

In addition to the GeT-RM annotation benchmarking, we compared PGx annotation using the newest NYGC 30x WGS dataset available [27] against the older Phase 3 10xā€‰coverage WGS dataset available for a subset of 2,504 unrelated individuals [36] for CYP2C9 and CYP2C19. Several *allele-defining variants were present in the Phase 3 10xā€‰dataset but absent from the NYGC phased 30x VCF files (Table S3). In particular CYP2C9 *6 (rs9332131, A deletion), *7 (rs67807361, A allele), *16 (rs72558192, G allele), *33 (rs200183364, A allele), *36 (rs114071557, G allele), *45 (rs199523631, T allele), *63 (rs141489852, A allele), *68 (rs542577750 A allele), and *73 (rs17847037, T allele) and CYP2C19 *16 (rs192154563, T allele), *24 (rs118203757, A allele), and *30 (rs145328984, T allele).

In total, star allele search includes 663 diplotypes across 13 pharmacogenes (TableĀ 2, Table S1), excluding diplotypes with one or two ambiguous (i.e., Amb) allele calls. Each unique diplotype and associated diplotype frequency in the database is detailed in Table S4, and each unique *allele haplotype and associated haplotype allele frequency in the database is detailed in Table S5. To determine the contribution of the larger sample set included in the database, we identified 3, 3, 7, and 10 new *alleles, respectively in the dataset relative to GeT-RM [25, 26] for CYP2C19 (*11, *22, *34), CYP2C8 (*6, *11, *14), CYP2C9 (*12, *13, *14, *29, *31, *44, *66), and CYP2D6 (*27, *32, *34, *49, *84, *86, *117, *121, *125, *139) (Fig.Ā 1, Table S6). We performed a similar comparison for unique pairs of *alleles (diplotype combinations). We chose to conservatively exclude ambiguous calls, copy number variants and complex CYP2D6 structural variants and identified 12, 17, 23, and 129 new diplotypes, respectively, for CYP2C8, CYP2C19, CYP2C9, and CYP2D6 (Fig.Ā 1, Table S6).

TableĀ 2 List of PharmVar annotated pharmacogenes, number of diplotypes and *alleles included in database
Fig.Ā 1
figure 1

New *alleles and diplotypes identified in sample set (nā€‰=ā€‰3,202) relative to GeT-RM. The top panel of Fig.Ā 1 displays the total number of new star alleles (Y-axis) for CYP2C19 (nā€‰=ā€‰3), CYP2C8 (nā€‰=ā€‰3), CYP2C9 (nā€‰=ā€‰7) and CYP2D6 (nā€‰=ā€‰10), respectively along the X-axis in magenta. The bottom panel of Fig.Ā 1 displays the total number of new diplotypes (Y-axis) for CYP2C19 (nā€‰=ā€‰17), CYP2C8 (nā€‰=ā€‰12), CYP2C9 (nā€‰=ā€‰23) and CYP2D6 (nā€‰=ā€‰129), respectively along the X-axis in purple

This new star allele annotated biospecimen database is of use for a wide range of applications. For example, researchers interested in functionally characterizing *alleles of interest can use the resource to choose LCLs with the most relevant diplotype combinations; researchers interested in developing new PGx assays can use the resource to benchmark performance; and clinical laboratories can use the resource to minimize the number of positive and negative control DNAs needed for a given PGx test.

We have additionally developed Star Allele Search (Fig.Ā 2), which is a web-based search tool of the new PGx biospecimen annotation database to facilitate these types of research and clinical applications. In addition to this new database and search tool, users can choose to search the WGS data one variant at a time, up to one hundred variants at a time, or by gene (https://www.coriell.org/SNPSearch/WGS; [27]). Users can also search gene expression data collected from a subset of the *allele annotated LCLs (nā€‰=ā€‰462) (http://omicdata.coriell.org/geuv-expression-browser/; [37]).

Fig.Ā 2
figure 2

Star Allele database search results example for CYP2C19. FigureĀ 2 displays a screen shot of the web-based Star Allele Search. This example is displaying results for CYP2C19, chosen from the dropdown search on the top, left-hand side of the page. The user may choose to view the list of PharmVar annotated pharmacogenes, the NCBI entry for the selected gene, the associated Gene Search page (which will display all of the variants included in the 30x WGS dataset for the selected gene), or to return to the general search page. The user may choose to export the Star Allele search results to a CSV file by clicking the green button on the right-hand side of the page. The user may additionally choose to filter by a given Star Allele diplotype, and this filtered drop down also displays the number of samples with each corresponding diplotypes. FigureĀ 2 displays results after filtering for *2|*2 diplotypes in the database

All of these genomic data search tools are designed to complement each other to ensure researchers have a simple way to search a large collection of biospecimen genetic, genomic, and transcriptomic profiles with a web-based interface that does not require bioinformatic skill or experience. For example, a researcher interested in developing a CYP2C19 assay could first view, sort, filter and/or download a comma-separated value (CSV) file of all of the CYP2C19 variants included in the WGS dataset with a single HUGO symbol search (https://www.coriell.org/SNPSearch/WGS) to confirm the variants of interest are present in the data; then view, sort, filter and/or download a CSV of the annotated CYP2C19 *alleles for the entire sample set with Star Allele Search (www.coriell.org/StarAllele/Search) to identify the biospecimens with the relevant diplotypes; if an alternate annotation scheme is needed (i.e., not PharmVar), the researcher can view, sort, filter, and/or download a CSV of up to 100 individual CYP2C19 variants at a time to investigate any alternate combination of variants needed for the alternative annotation scheme (https://www.coriell.org/SNPSearch/WGS).

It is important to note all of the limitations of our approach and annotations. Our database annotations are based on short read (150 base pair, paired-end reads), 30x coverage WGS, and computational phasing [27]. Any error in variant calling or missing single nucleotide or larger structural variation, as well as any error in phasing in the input VCF will propagate into annotation errors (for the non-CYP2D6 pharmacogenes included in Star Allele Search). In addition, any error or missing single nucleotide or larger structural variation in the BAM files analyzed with Cyrius used for CYP2D6 annotation will similarly produce annotation errors (in CYP2D6 annotations included in Star Allele Search). While this is the most robust, large-scale WGS dataset available for this sample set at present, we anticipate that as long-read sequencing becomes more affordable and more accessible, that phase uncertainty (particularly for rare variants) will significantly go down and structural variation resolution will significantly improve. We also employed PharmVar annotation for our database and chose a strict matching requirement for each *allele annotation. This choice resulted in several ambiguous biospecimen calls in cases where one or both phased haplotypes were not an exact match to any PharmVar defined *allele. The number of pharmacogenes annotated in our database is limited by the number of genes annotated by PharmVar. Currently PharmVar includes thirteen pharmacogenes. Although the number of genes is limited, the clinical impact of these pharmacogenes is significant with CYP2C9, CYP2C19, CYP2D6, CYP3A4, CYP3A5, CYP2A6, CYP2B6, and CYP2C8 alone metabolizing the vast majority of drugs in clinical use (e.g. [38],). Our automated approach, however, facilitates version updates to Star Allele Search as PharmVar releases new annotation versions with additional pharmacogenes.

Conclusion

We have developed a public resource of PGx annotation for a large (nā€‰=ā€‰3,202) and diverse set of 1000 Genomes Project LCLs and DNAs that are available for general research use. This new resource includes a database of star allele annotation for each biospecimen and an accompanying web-based search tool (www.coriell.org/StarAllele/Search). This new tool is especially relevant to researchers interested in in vitro functional characterization of *alleles as well as for use in support of clinical PGx assay development, validation, and implementation.

Availability of data and materials

The detailed database content is available in Table S1 and is available through a web-based search tool (www.coriell.org/StarAllele/Search). The majority of associated 1000 Genomes Project biospecimens (LCLs and DNA) are available through the NHGRI Sample Repository for Human Genetic Research (https://catalog.coriell.org/1/NHGRI), and the collection of Utah Residents (CEPH) with Northern and Western European Ancestry biospecimens are available through the NIGMS Human Genetic Cell Repository (https://catalog.coriell.org/1/NIGMS).

References

  1. Zhang G, Zhang Y, Ling Y, Jia J. Web resources for pharmacogenomics. Genomics Proteomics Bioinformatics. 2015;13(1):51ā€“4.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  2. Bank PCD, Swen JJ, Guchelaar HJ. Implementation of pharmacogenomics in everyday clinical settings. Adv Pharmacol. 2018;83:219ā€“46.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  3. Dunnenberger HM, Crews KR, Hoffman JM, Caudle KE, Broeckel U, Howard SC, Hunkler RJ, Klein TE, Evans WE, Relling MV. Preemptive clinical pharmacogenetics implementation: current programs in five US medical centers. Annu Rev Pharmacol Toxicol. 2015;55:89ā€“106.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  4. Gharani N, Keller MA, Stack CB, Hodges LM, Schmidlen TJ, Lynch DE, Gordon ES, Christman MF. The Coriell personalized medicine collaborative pharmacogenomics appraisal, evidence scoring and interpretation system. Genome Med. 2013;5(10):93.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  5. Relling MV, Krauss RM, Roden DM, Klein TE, Fowler DM, Terada N, Lin L, Riel-Mehan M, Do TP, Kubo M, et al. New pharmacogenomics research network: an open community catalyzing research and translation in precision medicine. Clin Pharmacol Ther. 2017;102(6):897ā€“902.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  6. Relling MV, Evans WE. Pharmacogenomics in the clinic. Nature. 2015;526(7573):343ā€“50.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  7. Bush WS, Crosslin DR, Owusu-Obeng A, Wallace J, Almoguera B, Basford MA, Bielinski SJ, Carrell DS, Connolly JJ, Crawford D, et al. Genetic variation among 82 pharmacogenes: the PGRNseq data from the eMERGE network. Clin Pharmacol Ther. 2016;100(2):160ā€“9.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  8. Schmidlen T, Sturm AC, Scheinfeldt LB. Pharmacogenomic (PGx) counseling: exploring participant questions about PGx test results. J Pers Med. 2020;10(2):29.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  9. Herr TM, Peterson JF, Rasmussen LV, Caraballo PJ, Peissig PL, Starren JB. Pharmacogenomic clinical decision support design and multi-site process outcomes analysis in the eMERGE Network. J Am Med Inform Assoc. 2019;26(2):143ā€“8.

    ArticleĀ  PubMedĀ  Google ScholarĀ 

  10. Scheinfeldt LB, Brangan A, Kusic DM, Kumar S, Gharani N. Common treatment, common variant: evolutionary prediction of functional pharmacogenomic variants. J Pers Med. 2021;11(2):131.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  11. Bentley AR, Callier S, Rotimi CN. Diversity and inclusion in genomic research: why the uneven progress? J Community Genet. 2017;8(4):255ā€“66.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  12. Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, Daly MJ, Bustamante CD, Kenny EE. Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet. 2017;100(4):635ā€“49.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  13. Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538(7624):161ā€“4.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  14. Scheinfeldt LB, Tishkoff SA. Recent human adaptation: genomic approaches, interpretation and insights. Nat Rev Genet. 2013;14(10):692ā€“702.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  15. Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, Highland HM, Patel YM, Sorokin EP, Avery CL, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570(7762):514ā€“8.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  16. Sirugo G, Williams SM, Tishkoff SA. The missing diversity in human genetic studies. Cell. 2019;177(1):26ā€“31.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  17. Wheeler HE, Dolan ME. Lymphoblastoid cell lines in pharmacogenomic discovery and clinical translation. Pharmacogenomics. 2012;13(1):55ā€“70.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  18. Zhang W, Dolan ME. Use of cell lines in the investigation of pharmacogenetic loci. Curr Pharm Des. 2009;15(32):3782ā€“95.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  19. Choy E, Yelensky R, Bonakdar S, Plenge RM, Saxena R, De Jager PL, Shaw SY, Wolfish CS, Slavik JM, Cotsapas C, et al. Genetic analysis of human traits in vitro: drug response and gene expression in lymphoblastoid cell lines. PLoS Genet. 2008;4(11):e1000287.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  20. Jack J, Rotroff D, Motsinger-Reif A. Lymphoblastoid cell lines models of drug response: successes and lessons from this pharmacogenomic model. Curr Mol Med. 2014;14(7):833ā€“40.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  21. Morag A, Kirchheiner J, Rehavi M, Gurwitz D. Human lymphoblastoid cell line panels: novel tools for assessing shared drug pathways. Pharmacogenomics. 2010;11(3):327ā€“40.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  22. Green AJ, Anchang B, Akhtari FS, Reif DM, Motsinger-Reif A. Extending the lymphoblastoid cell line model for drug combination pharmacogenomics. Pharmacogenomics. 2021;22(9):543ā€“51.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  23. Shukla SJ, Dolan ME. Use of CEPH and non-CEPH lymphoblast cell lines in pharmacogenetic studies. Pharmacogenomics. 2005;6(3):303ā€“10.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  24. Pratt VM, Everts RE, Aggarwal P, Beyer BN, Broeckel U, Epstein-Baak R, Hujsak P, Kornreich R, Liao J, Lorier R, et al. Characterization of 137 genomic DNA reference materials for 28 pharmacogenetic genes: a GeT-RM collaborative project. J Mol Diagn. 2016;18(1):109ā€“23.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  25. Gaedigk A, Turner A, Everts RE, Scott SA, Aggarwal P, Broeckel U, McMillin GA, Melis R, Boone EC, Pratt VM, et al. Characterization of reference materials for genetic testing of CYP2D6 alleles: a GeT-RM collaborative project. J Mol Diagn. 2019;21(6):1034ā€“52.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  26. Gaedigk A, Boone EC, Scherer SE, Lee SB, Numanagic I, Sahinalp C, Smith JD, McGee S, Radhakrishnan A, Qin X, et al. CYP2C8, CYP2C9, and CYP2C19 characterization using next-generation sequencing and haplotype analysis: a GeT-RM collaborative project. J Mol Diagn. 2022;24(4):337ā€“50.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  27. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022;185(18):3426-3440 e3419.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  28. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297ā€“303.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  29. Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes. Nat Methods. 2011;9(2):179ā€“81.

    ArticleĀ  PubMedĀ  Google ScholarĀ 

  30. Oā€™Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, Traglia M, Huang J, Huffman JE, Rudan I, et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014;10(4):e1004234.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  31. Gaedigk A, Casey ST, Whirl-Carrillo M, Miller NA, Klein TE. Pharmacogene variation consortium: a global resource and repository for pharmacogene variation. Clin Pharmacol Ther. 2021;110(3):542ā€“5.

    ArticleĀ  PubMedĀ  Google ScholarĀ 

  32. Gaedigk A, Whirl-Carrillo M, Pratt VM, Miller NA, Klein TE. PharmVar and the landscape of pharmacogenetic resources. Clin Pharmacol Ther. 2020;107(1):43ā€“6.

    ArticleĀ  PubMedĀ  Google ScholarĀ 

  33. Gaedigk A, Ingelman-Sundberg M, Miller NA, Leeder JS, Whirl-Carrillo M, Klein TE, PharmVar Steering C. The Pharmacogene Variation (PharmVar) consortium: incorporation of the human cytochrome P450 (CYP) allele nomenclature database. Clin Pharmacol Ther. 2018;103(3):399ā€“401.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  34. Gennaro C, Dara K, Jozef M, Neda G, Laura S. ursaPGx: a new R package to annotate pharmacogenetic star alleles using phased whole genome sequencing data. bioRxiv. 2023;2023.2007.2024.550372. https://doi.org/10.1101/2023.07.24.550372.

  35. Chen X, Shen F, Gonzaludo N, Malhotra A, Rogert C, Taft RJ, Bentley DR, Eberle MA. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. Pharmacogenomics J. 2021;21(2):251ā€“61.

    ArticleĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  36. Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68ā€“74.

    ArticleĀ  Google ScholarĀ 

  37. Lappalainen T, Sammeth M, Friedlander MR, t Hoen PA, Monlong J, Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506ā€“11.

    ArticleĀ  CASĀ  PubMedĀ  PubMed CentralĀ  Google ScholarĀ 

  38. Zanger UM, Schwab M. Cytochrome P450 enzymes in drug metabolism: regulation of gene expression, enzyme activities, and impact of genetic variation. Pharmacol Ther. 2013;138(1):103ā€“41.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

Download references

Acknowledgements

We thank Coriellā€™s IT team, and John Witherspoon in particular, for supporting the implementation of Star Allele Search.

Funding

This study was funded by NHGRI 5U24HG008736 to LS.

Author information

Authors and Affiliations

Authors

Contributions

LS designed and implemented the project and contributed to writing and editing the manuscript; NG designed the project and contributed to writing and editing the manuscript. DK, JM, and GC contributed to the design of the project and contributed to writing and editing the manuscript.

Corresponding author

Correspondence to L. Scheinfeldt.

Ethics declarations

Ethics approval and consent to participate

All human data used in this study is publicly available through the 1000 Genomes Project [27], and all associated biospecimens belong to the NIGMS Human Genetic Cell Repository or the NHGRI Sample Repository for Human Genetic Research. All NHGRI Sample Repository for Human Genetic Research biospecimens have been consented for general research use and associated public genomic data sharing. All NIGMS Human Genetic Cell Repository biospecimens included in the 1000 Genomes Project and thereby the current study have been consented for general research use and associated public genomic data sharing.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisherā€™s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Star Allele Search Database.

Additional file 2: Table S2.

Cyrius CYP2D6 Output Details.

Additional file 3: Table S3.

CYP2C9 and CYP2C219 Annotation Details.

Additional file 4: Table S4.

Diplotype Frequencies.

Additional file 5: Table S5.

Star Allele Frequencies.

Additional file 6: Table S6.

New Star Alleles and Diplotypes.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gharani, N., Calendo, G., Kusic, D. et al. Star allele search: a pharmacogenetic annotation database and user-friendly search tool of publicly available 1000 Genomes Project biospecimens. BMC Genomics 25, 116 (2024). https://doi.org/10.1186/s12864-024-09994-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-024-09994-6

Keywords