Skip to main content

Characterization of sequence variations in the extended flanking regions using massively parallel sequencing in 21 A-STRs and 21 Y-STRs

Abstract

In forensic genetics, utilizing massively parallel sequencing (MPS) to analyze short tandem repeats (STRs) has demonstrated several advantages compared to conventional capillary electrophoresis (CE). Due to the current technical limitations, although flanking region polymorphisms had been mentioned in several previous studies, most studies focused on the core repeat regions of STRs or the variations in the adjacent flanking regions. In this study, we developed an MPS system consisting of two sets of multiplex PCR systems to detect not only the STR core repeat regions but also to observe variants located at relatively distant positions in the flanking regions. The system contained 42 commonly used forensic STRs, including 21 autosomal STRs (A-STRs) and 21 Y-chromosomal STRs (Y-STRs), and a total of 350 male individuals from a Chinese Han population were genotyped. The length and sequence variants per locus were tallied and categorized based on length (length-based, LB), sequence without flanking region (core repeat regions sequence-based, RSB), and sequence with flanking region (core repeat and flanking regions sequence-based, FSB), respectively. Allele frequencies, Y-haplotype frequencies, and forensic parameters were calculated based on LB, RSB, and FSB, respectively, to evaluate the improvement in discrimination power, heterozygosity, and effectiveness of forensic systems. The results suggested the sequence variations have more influence on A-STRs and could improve the identification ability of MPS-STR genotyping. Concordance between MPS and CE methods was confirmed by using commercial CE-based STR kits. The impact of flanking region variations on STR genotype analysis and potential factors contributing to discordances were discussed. A total of 58 variations in the flanking regions (53 SNPs/SNVs and 5 InDels) were observed and most variations (48/58) were distributed in A-STRs. In summary, this study delved deeper into the genetic information of forensic commonly used STR and advanced the application of massively parallel sequencing in forensic genetics.

Peer Review reports

Background

Massively parallel sequencing (MPS), also referred to as next-generation sequencing (NGS), has recently gained significant attention for its immense potential in the field of forensic genetics [1,2,3]. Short tandem repeat (STR), a forensic genetic marker, has gained widespread recognition in criminal investigations and court trials over the past two decades [4, 5]. The utilization of MPS for STR analysis has demonstrated numerous advantages in several previous studies [6, 7]. Compared to conventional capillary electrophoresis (CE), MPS can analyze more STR loci simultaneously without limitation by restrictions in size-based separation or fluorescence dye detection [8, 9]. Besides, the sequence variations within the core repeat and flanking regions of STRs could be identified by sequencing and significantly improved the identification ability of STRs [10,11,12,13,14]. Currently, the mainstream MPS platforms used in forensic were usually the long-reads sequencing platforms, including Novaseq™ (PE250, Illumina, San Diego, CA, USA), MiSeq™ System (PE300, Illumina), MiSeq FGx™ System (PE300, Verogen, San Diego, CA, USA), Ion torrent PGM (SE400, Thermo Fisher, Foster City, CA, USA), Ion torrent S5 (SE400, Thermo Fisher), and MGISEQ-2000 platform (SE400, MGI Tech, Shenzhen, Guangdong, China) [15]. Due to the current technical limitations, the quality of the MPS sequencing read at its end tends to be comparatively lower so that there are restrictions on the length of reads in the flanking regions of STRs. Despite the mention of flanking region polymorphisms in several previous studies [16,17,18,19,20,21,22,23,24,25,26], the majority of research has primarily focused on the core repeat regions of STRs or variations in the adjacent flanking regions. Consequently, the current state of research lacks systematic investigations into the variations in flanking regions, particularly those occurring at more distal positions from the core repeat regions. The sequence variants could also improve the STR performance in forensic application [13] and were likely associated with bioinformatics analysis [14], even caused discordances between MPS and CE methods in some cases [12]. Thus, the flanking region variations in commonly used forensic STRs need to be further explored.

For current commercial MPS STR panels, such as ForenSeq™ DNA Signature Prep Kit (Verogen) and Precision ID GlobalFiler™ NGS STR Panel v2 (Thermo Fisher), the coverage range in the flanking regions of these panels was limited. Thus, we developed a MPS system to detect not only the STR core repeat regions, but also variations located in distal positions in the flanking regions. For the system, DNA samples were amplified using two sets of multiplex PCR assay, respectively. The amplification of each STR locus was performed using two pairs of primers, with each primer pair encompassing a wider flanking region on one side (ensuring coverage of the core repeat region). Consequently, the flanking region variations of each STR loci were maximally detectable within the length range compatible with current MPS technology reads. Hence, the alleles observed in the present study covered more flanking regions. The concordance of the system was confirmed by comparing with CE method. The allele frequencies and forensic parameters were calculated in Chinese Han population. Moreover, we statistically analyzed the sequence variations in the flanking regions and evaluated their impacts on forensic application.

Materials and methods

Sample collection and DNA extraction

This study was approved by the Ethical Committee of Fudan University (2020016). Peripheral blood samples were collected using FTA cards from 350 unrelated Chinese Han male individuals. All individuals provided their written informed consent for the collection of blood samples. Genomic DNA was extracted using the BioRobotEZ1 Advanced XL and EZ1 DNA Investigator kits (Qiagen, Hilden, Germany) according to the manufacturer’s instructions. DNA was then quantified using a Qubit 2.0 Fluorometer and a Qubit dsDNA HS Assay Kit (Thermo Fisher).

Selection of STR loci

In this study, twenty-one autosomal STRs (A-STR) were selected (CSF1PO, FGA, TH01, TPOX, vWA, D1S1656, D2S1338, D2S441, D3S1358, D5S818, D6S1043, D7S820, D8S1179, D10S1248, D12S391, D13S317, D16S539, D18S51, D19S433, D21S11 and D22S1045), which were FBI CODIS core STR loci or frequently used in forensic. Besides, twenty-one Y-chromosomal STRs (Y-STR) were selected (DYS19, DYS385a-b, DYS389I-II, DYS390, DYS391, DYS392, DYS437, DYS438, DYS439, DYS444, DYS447, DYS448, DYS456, DYS533, DYS549, DYS570, DYS576, DYS635 and GATA-H4), which were commonly used for the construction of Chinese Y-STR database.

Construction of multiplex PCR systems

In order to enhance the detection of flanking regions surrounding STRs within MPS reads, two pairs of primers were meticulously designed for each STR, ensuring comprehensive coverage of the core region containing the longest known allele. For each pair of primers, one primer binding site was designed in close proximity to the core repeat region of STR, while the other was extended towards the flanking region to encompass the maximum possible length range suitable for MPS reads (considering primer length, sequencing joint, and barcode; typically reaching approximately 200 bp). It is advisable to avoid utilizing the identified SNP within the adjacent region of STR for primer binding. The schematic diagram of primers design was shown in Fig. 1. The primer pair for each locus was designed using Oligo Primer Analysis Software Version 7 (Molecular Biology Insights, Cascade, USA) (https://oligo.net/), and subsequently assessed for potential non-specific hybridizations through the NCBI Basic Local Alignment Search Tool (BLAST) (http://blast.ncbi.nlm.nih.gov/Blast.cgi). The positions and sizes of the amplicons are presented in Table 1A and 1B, while the primer sequences can be found in Supplementary Table S1.

The designed primers were divided into two sets of multiplex PCR systems (pool1 and pool2) for amplification to avoid interaction. The KAPA 2G Fast Multiplex PCR kit (Roche Diagnostics, Rotkreuz, Switzerland) was employed for multiplex amplification. The amplification reaction comprised of 10 µl of KAPA 2G Fast Multiplex PCR Mix, 5 ng of DNA templates, and 1 µl of primer Mix. Amplification was conducted on a GeneAmp™ 9700 PCR System (Thermo Fisher) with the following parameters: initial denaturation at 96℃ for 3 min; followed by 20 cycles of denaturation at 96℃ for 30 s, annealing at 60℃ for 4 min; final extension step at 72℃ for 20 min and hold at 4℃. Amplicons were purified using the Agencourt AMPure XP PCR Purification System (Beckman Coulte, Fullerton, CA, USA) as recommended by the manufacturers.

Fig. 1
figure 1

Schematic diagram of primers design

Table 1A The amplicon positions and sizes of 21 A-STRs (GRCh38)
Table 1B The amplicon positions and sizes of 21 Y-STRs (GRCh38)

Library preparation, normalized and sequencing

The purified amplicons were ligated to adapters with barcodes using the KAPA 2G Fast Multiplex PCR kit for each library. The ligation reaction included 10 µl of KAPA 2G Fast Multiplex PCR Mix, 1 µl of 4 μm TRA-70xR, 1 µl of 4 μm TRA-50xF, and 8 ng of purified amplicons. The enrichment reaction was performed under the following conditions: initial denaturation at 96℃ for 3 min; followed by a total of ten cycles consisting of denaturation at 96℃ for 30 s, annealing at 58℃ for 30 s, extension at 72℃ for 30s; final extension step at 72℃ for 2 min; and holding temperature set to 4℃. Library purification was carried out using the Agencourt AMPure XP PCR Purification System according to the manufacturer’s recommendations. Subsequently, pool1 and pool2 from the same sample were combined.

The quantity of each library was determined using the Qubit 2.0 Fluorometer with the Qubit dsDNA HS Assay Kit. The concentration unit ng/µl was converted to nM using the formula nM = ng/µl / (660 g/mol ×300 bp) ×106 (https://www.science.smith.edu/wp-content/uploads/sites/36/2015/09/Converting-ng-to-nM.pdf). Each library was normalized and pooled to an equimolar concentration of 10 nM, as recommended by the manufacturer. Pooled libraries were sequenced on the MiSeq FGx™ System (Verogen) with Miseq V3 reagent (PE300 strategy), following standard protocol. A total of five sequencing runs were performed, each consisting of 70 samples along with positive and negative controls.

Data analysis

The raw reads were filtered by removing the uncompleted ends and sequencing errors. The reads were then screened and only the ones contain the primer sequences for STRs were used for STR genotyping. A custom Perl script was used to annotate the STRs by directly comparing the sequence in each read with core repeat region for each STR allele. The structure of the STR core repeat region was compiled based on data from the STR Sequencing Project [27] or previous studies’ reports [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]. Reads which contain the primer sequences for STRs but can’t annotate to any collected STR allele were used for variation calling. These reads were firstly mapped to human reference genome (GRCh38) using bowtie (version 2-2.2.5) with the “--very-sensitive-local” model [28]. Sequences that differ by up to 1 bp substitution or 1–4 bp insertion/deletion were tolerated in order to get more information of variations. Variations with support of both forward and reverse reads were recorded. The selected reads were split into 5’ flanking, 3’ flanking and core repeat sequences depending on the STR nomenclature recommended by the DNA commission of the International Society for Forensic Genetics (ISFG) [29, 30]. A custom Perl script was used to annotate the sequence variation in each read. Alleles with a minimum of 30× coverage were used for further analysis. The balance threshold of alleles within the locus was set at 60%, and the stutter threshold was uniformly set at 30%. The core repeat sequences from one sample should be consistent between pool1 (5’ flanking) and pool2 (3’ flanking). Thus, alleles from two pools were merged according to core repeat sequences manually. All sequence data were exported and reviewed manually. All alleles identified in the sequencing analysis were then integrated for further analysis.

Capillary electrophoresis (CE) assessing concordance

Ninety-two samples were randomly chosen for the purpose of assessing concordance (The samples with allele dropout due to quality control were excluded). The overlap between STR genotypes generated using MPS and CE methods was assessed by employing two commercially available CE-based STR kits. The GlobalFiler™ PCR Amplification kit (Thermo Fisher) was employed for typing A-STRs, while the Yfiler™ Platinum PCR Amplification kit (Thermo Fisher) was used for typing Y-STRs. These two kits encompassed all 42 STRs except for D6S1043. The ABI 3500XL Genetic Analyzer (Thermo Fisher) was utilized to separate and detect the PCR products in accordance with the manufacturer’s guidelines. The electrophoretic results were subsequently analyzed using GeneMapper® ID-X software v1.4 (Thermo Fisher). To assess any discrepancies in the CE-based typing, we utilized the binary sequence alignment (BAM) file and examined it with the Integrative Genomics Viewer (IGV) [31].

Identification of sequence variations

All merged alleles were integrated following the ISFG recommendations, enabling comparison of these data with the STR Sequencing Project records [27] and previous studies [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]. Sanger sequencing was conducted to verify sequence variations not present in the STR Sequencing Project or previous studies, utilizing a BigDye1 Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher).

Forensic parameters and Y-haplotype frequencies

The length and sequence variants per locus were tallied and categorized based on length (length-based, LB), sequence without flanking region (core repeat regions sequence-based, RSB), and sequence with flanking region (core repeat and flanking regions sequence-based, FSB), respectively. For A-STRs, the forensic parameters were calculated using STRAF [32], which encompassed allele count (Nall), genotype count (N), genetic diversity (GD), expected heterozygosity (Hexp), random match probability (RMP), discrimination power (DP), polymorphism information content (PIC), power of exclusion (PE), typical paternity index (TPI) and observed heterozygosity (Hobs). Allele frequencies for LB, RSB, and FSB were calculated in Excel by determining the ratio of the allele count to the total count. HWE was tested using STRAF with a Bonferroni correction [33] applied for multiple comparisons. For Y-STRs, the frequencies of Y-haplotypes were determined using a direct counting method that incorporated LB, RSB, and FSB.

Results

Sequencing performance and quality control

The analysis of 350 samples involved a total of five sequencing runs. A total of 93,185,964 reads were obtained. There was a good balance among the samples, and at least 50,000 reads were obtained for each sample, indicating that the library was pooled well. After removing invalid data (reads with a coverage of less than 30× and reads of stutters) and primer dimers, the average depth of coverage (Doc) of each STR locus was 330×. The average Doc of A-STR was 456×, which was significantly higher than that of Y-STR (191×). The average Doc of pool1 was 313×, and 347× for pool2. For A-STRs, the highest average Doc locus was D5S818 pool1 (1891×) and the lowest was D21S11 pool2 (175×). For Y-STRs, the highest average Doc was DYS391 pool2 (592×), and the lowest was DYS447 pool2 (67×). The Doc of each STR locus was shown in Supplementary Table S2.

Sequence merging and concordance analysis

The sequence merging was performed manually. For A-STRs, all profiles of 350 samples were obtained. The core repeat sequences of each A-STR from pool1 and pool2 of the same sample were compared and confirmed to be consistent. The minimum sequencing depth was higher than 30×. The Y-STRs exhibited a lower depth of coverage compared to the A-STRs. A limited number of Y-STR alleles (DYS19 (6/350), DYS389I-II (2/350), DYS437 (8/350), DYS447 (11/350), DYS448 (16/350), DYS533 (2/350), and GATA-H4 (2/350)) either experienced dropout or had sequencing depths below 30× in single or double pools, rendering the MPS-STR typing for these loci invalid and excluding them from the statistical analysis. Ninety-two random samples were genotyped using two commercial CE-based STR kits and concordance between two methods was confirmed (alleles which were dropout due to quality control were not taken into consideration).

Variations in the flanking regions

A total of 58 variations in the flanking regions (53 SNPs/SNVs and 5 InDels) were observed at 25 loci in 350 samples. The majority of these variations (48/58) were distributed in A-STRs. With the exception of CSF1PO, D3S1358, FGA, and D12S391, variations were observed in the flanking regions. Among these loci, D7S820 exhibited the highest number of variations (seven), while four variations each were observed in the flanking regions of D1S1656, TH01, and TPOX. Other variations (10/58) were distributed in 8 Y-STRs, of which two variations were observed in DYS437 and DYS438, respectively. Among 58 variations in the flanking regions, 7 were not recorded in the Single-Nucleotide Polymorphism database (dbSNP, https://www.ncbi.nlm.nih.gov/snp/). All variations were verified using Sanger sequencing (data not provided) and were listed in Supplementary Table S3.

Diversity of observed alleles

For all 350 samples, all A-STR alleles (14,700) and a total of 7,301 Y-STR alleles were identified (49 Y-STR alleles were excluded due to quality control measures or dropout). The implementation of MPS analysis resulted in a significant enhancement in allele diversity. The magnitude of this enhancement varied across different loci in terms of the number of unique alleles observed.

For the A-STRs, a total of 228 unique LB alleles, 348 unique RSB alleles, and 480 unique FSB alleles were observed. The application of MPS resulted in an increase in the number of identified unique alleles for all A-STRs compared to CE. Notably, twelve loci exhibited increased allele diversity due to variations in both core repeat and flanking region sequences (D1S1656, D2S1338, D2S441, D5S818, D6S1043, D7S820, D8S1179, D13S317, D18S51, D19S433, D21S11, and vWA), while four loci showed increased allele diversity solely due to variations in the core repeat regions (D3S1358, D12S391, FGA, and CSF1PO) and five loci displayed an increase in allele diversity only due to variations in the flanking regions (D10S1248, D16S539, D22S1045, TH01, and TPOX). It was found that D2S1338 and D21S11exhibited the highest levels of genetic diversity with 49 unique alleles each, whereas CSF1PO showed relatively low levels with only eight unique alleles observed. D2S1338 demonstrated the highest growth rate (276.92%), whereas D18S51 exhibited the lowest growth rate (5.88%). The comparison of unique alleles per A-STR locus among LB, RSB, and FSB was conducted, as depicted in Table 2A and Fig. 2A.

Table 2A The number of unique alleles observed at 21 A-STRs by LB, by RSB, and by FSB
Fig. 2A
figure 2

Allele diversity of A-STRs based on length or sequence (with or without flanking regions)

For the Y-STRs, a total of 148 unique LB alleles, 258 unique RSB alleles, and 271 unique FSB alleles were observed. The increase in diversity for Y-STRs was comparatively smaller than that observed for the A-STRs. Approximately 61.9% of Y-STRs (13/21) exhibited an augmented allelic diversity. Among these loci, seven Y-STRs demonstrated enhanced allele diversity due to variations in both core repeat and flanking region sequences (DYS385ab, DYS390, DYS437, DYS438, DYS447, and GATA-H4). Four Y-STRs (DYS389II, DYS392, DYS448, and DYS635) displayed an increase in allele diversity exclusively owing to differences in core repeat region sequences, whereas two Y-STRs (DYS391 and DYS570) showed enhanced diversity solely due to variations in the flanking regions. For the remaining eight Y-STR loci (DYS19, DYS389I, DYS439, DYS444, DYS456, DYS533, DYS549, and DYS576), no variation was detected either in the core repeat or flanking regions indicating a lack of increase in allele diversity. Notably, among all Y-STRs, DYS447 exhibited not only the highest level of diversity with 53 unique alleles but also showcased a remarkable growth rate of 562.50%. The comparison of unique alleles per Y-STR locus among LB, RSB, and FSB was conducted, as depicted in Table 2B and Figure 2B.

Fig. 2B
figure 3

Allele diversity of Y-STRs based on length or sequence (with or without flanking regions)

Table 2B The number of unique alleles observed at 21 Y-STRs by LB, by RSB, and by FSB

Allele frequencies and forensic parameters

The frequencies of LB, RSB, and FSB alleles were determined for all STR loci in 350 individuals using the counting method. The results are summarized in Supplementary Tables S4 and S5. Following Bonferroni correction (α = 0.05/21), all A-STRs allele data demonstrated adherence to HWE expectations, as indicated in Supplementary Table S6. Therefore, the population data obtained from this study can be considered representative.

The A-STR loci were analyzed using STRAF software, and the forensic parameters (as described in the Materials and Methods section) are presented in Supplementary Table S6. The average GD/Hexp values for A-STRs, when analyzed based on length, were 0.7911; however, when assessed based on the repeat region sequence and accounting for sequence variation in the flanking regions, they increased to 0.8106 and 0.8346, respectively. The combined RMP for the 21 A-STRs were calculated as 6.42 × 10–25, 4.45 × 10–27, and 2.48 × 10–29 for LB, RSB, and FSB, respectively. By solely incorporating sequence variations in the repeat regions, the combined RMP decreased significantly by more than a hundredfold compared to length-based alleles alone. Furthermore, inclusion of sequence variations in both repeat and flanking regions enabled MPS analysis to reduce the combined RMP drastically by over 2.5887 × 104 times compared to CE method implementation. The observed trends were consistent across other forensic parameters resulting from a significant increase in the availability of unique alleles.

A total of 340 distinct Y-STR haplotypes were observed in 350 individuals. Despite an increase in allele diversity, sequence variations did not contribute to the augmentation of unique haplotypes. Out of the 340 unique Y-STR haplotypes, one-time occurrence was noted for 333 haplotypes, twice for 5 haplotypes, thrice for 1 haplotype, and four times for another single haplotype. The random match probability (RMP) was calculated as 0.0031, while the discrimination power (DP) and heterozygosity coefficient (HD) were determined as 0.9969 and 0.9998 respectively.

Discussion

Over the past decade, there has been significant advancement in sequencing technology. Various techniques, including clonal nucleic acid amplification followed by sequencing by synthesis (SBS) [34], nanopore sequencing [35], and real-time single-molecule sequencing [36], have laid the groundwork for novel sequencing approaches. Among these, the advantages of MPS over conventional CE methods have positioned it as a focal point in forensic genetics [1]. MPS-based STR genotyping research has been performed in a large number of previous studies, in which the flanking region variations have been mentioned several times. Silva et al. [21] analyzed MPS-STR data of 59 individuals from the Brazilian population by using the PowerSeq™ Auto System and 17 flanking region variations were observed, of which 16 were located in the flanking regions of 9 A-STRs and another was in Y-STR. Kim et al. [19] reported 15 variations in the flanking regions of 8 A-STR in a Korean population involving 250 individuals. Gettings et al. [14] used the PowerSeq™ Auto System to perform MPS-STR studies on 183 individuals from different ethnic groups and summarized all polymorphisms within 500 bp upstream and downstream of commonly used STR in forensic genetics in the 1000 Genomes Phase 3 database. It was demonstrated that a large number of population genetic data involving STR flanking region variations was needed to support MPS forensic application. Herein, we have developed this MPS system to systematically investigate the flanking polymorphisms of common STRs.

Currently, the majority of MPS-STR studies have utilized commercially available MPS-STR panels, with the ForenSeq™ DNA Signature Prep Kit being widely acknowledged as the most frequently employed panel in MPS-STR analysis. It has consistently demonstrated remarkable efficacy in numerous studies. With the utilization of this commercial MPS kit, Delest et al. [24] conducted a comprehensive MPS-STR investigation in a French population consisting of 169 individuals. Similarly, Khubrani et al. [23] performed an extensive MPS-STR analysis on the Saudi Arabian population, while Wendt et al. [18] carried out an insightful MPS-STR study on Yavapai Native Americans. Precision ID GlobalFiler™ NGS STR Panel v2 is another widely used commercial MPS-STR panel. Dash et al. [26] employed this panel to present a comprehensive investigation on sequence variations, flanking region variations, and allele frequencies at 31 STRs within the Indian population. Similarly, Wang et al. [25] conducted an investigation encompassing Chinese Tibetan and Han populations with Precision ID GlobalFiler™ NGS STR Panel v2, while Barrio et al. [22] presented a study focused on MPS data of 31 STRs from 496 Spanish individuals. In terms of loci overlap, ForenSeq™ DNA Signature Prep Kit encompasses 41 STRs (except for DYS447) investigated in this study, whereas Precision ID GlobalFiler™ NGS STR Panel v2 includes all A-STRs examined. Compared with previous studies, the emphasis of our study was primarily on the detection of variations in the flanking regions. With the premise of ensuring that the core region was detected, two primer pairs per STR were designed to maximize the detectability of flanking region variations within the length range compatible with current MPS technology reads. Among the 58 variations observed in the flanking regions in our study, 37 were located sufficiently distal from the core repeat region to remain undetected by using ForenSeq™ DNA Signature Prep Kit, despite successful detection of STR core repeat region and proximal flanking variations. Although the detected range of Precision ID GlobalFiler™ NGS STR Panel v2 is greater than ForenSeq™ DNA Signature Prep Kit (SE400 > PE300), it should be noted that among the 48 observed variations in the flanking regions of A-STRs, 23 were located outside of the detectable range within the overlapped loci. The detail was shown in Supplementary Table S3. By contrast, all reported variations in the flanking regions of overlapped loci from previous studies above fell within the range of our developed MPS system (listed in Table 1), although certain variants were not detected in this study due to population disparities or their low frequency.

Utilizing MPS to analyze STR can not only distinguish different alleles with the same length, but also identify the variations in the core repeat regions and flanking regions. Integration of the core repeat regions with adjacent flanking regions variations enables the formation of alleles that serve as compound markers, encompassing STR-SNP/SNV/InDel variations. The MPS-STR alleles can significantly enhance the performance of STR in forensic applications by increasing allelic diversity, aiding kinship interpretation, and facilitating mixed sample separation. In this study, the diversity of observed alleles based on LB, RSB, and FSB indicated that the sequence variants had a greater impact on A-STRs and could clearly increase MPS-STR genotyping’s ability to identify individuals. The findings were in line with the reports of previous studies [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]. However, our study revealed a higher diversity of observed alleles based on FSB compared to previous studies, primarily due to enhanced detection of variations in the flanking regions. In contrast, it appears that the impact of sequence variants on Y-STR haplotype typing was comparatively diminished in this study. Comparing the MPS and CE method, there was no increase in the number of unique Y-STR haplotypes despite of the increase Y-STR allelic diversity. Except for the SNP in the flanking region of GATA-H4 (no rs, position: ChrY: 16631756, frequency 68/350), the variations of the detected Y-STR flanking regions were generally low in the population of this study, which did not improve the ability to identify Y-STR haplotypes.

The CE method currently represents the gold standard in forensic genetics [37]. Prior to implementing MPS in forensic casework, it is imperative to ensure the compatibility of MPS data with existing CE-based forensic databases. Therefore, it is crucial to evaluate the concordance between MPS and CE. The presence of variations in the flanking regions not only poses challenges to STR genotype analysis but also contributes to occasional discordances. Although we have confirmed the concordance between our MPS system and CE method through a sample set of ninety-two randomly selected individuals in this study, sporadic discrepancies have been previously reported [12, 13]. In some loci, the variations in the adjacent flanking regions may cause the flanking sequence being structurally the same as the core repeat motifs. For example, in D13S317, the core repeat motif is TATC and the reference sequence (RefSeq) of 3’ flanking region adjacent to the core repeat region is AATCAATC. Two SNP (rs9546005 and rs202043589) are frequently observed so that 3’ flanking region sequence is TATCAATC (rs9546005 was underlined) or TATCTATC (rs202043589 was underlined). The one or two additional repeats are not involved by length-based CE typing, but may be counted by MPS if the bioinformatic algorithms only focus on the core repeat motif. In the East Asia (EAS) population from the 1000 Genomes Phase 3 dataset, the frequency of the rs9546005 T allele is 48.5%, and the frequency of the rs202043589 T allele is 5.9%, whereas the frequencies of rs9546005 and rs202043589 were 50.9% (356/700) and 6.4% (45/700), respectively, in this study. Analogously, another example in D5S818, the 5’ flanking region contains CTCT adjacent to the core repeat motif (ATCT). The variant A allele of rs73801920 can construct an additional ATCT repeat which may be counted by sequence-based typing, whereas would not be counted by CE typing. The 1000 Genomes does not include rs73801920 frequency, and in this study, the frequency of the A allele was 17.6% (123/700). Such potential discordances have been resolved by improvements in bioinformatics analysis. In addition, there were other three scenarios that could cause discordance between MPS and CE as Gettings et al. [11] summarized: InDels in the flanking region: a discordance would occur if an InDel was located inside the CE amplified region but outside the MPS bioinformatic recognition sites. Bioinformatic null allele I: one allele would drop out in the bioinformatic pipeline due to variants in the same region as a bioinformatic recognition site, which led to the incorrect appearance of a homozygote. Bioinformatic null allele II: there was not a bin in the bioinformatic configuration file that matched the observed allele. The challenges presented in Scenarios and can be addressed through advancements in bioinformatics analysis, as discussed earlier. However, Scenario remains unresolved due to the current limitations of MPS read length. In this study, Among the 58 variations in the flanking regions observed, five were flanking region InDels. Especially in D18S51 and D19S433, the variations observed were all InDels. Because these InDels were both within the detection range of MPS and CE, no discordance was observed. In summary, it is very helpful for forensic researchers to understand the flanking region polymorphisms in STR analysis, since discordances between MPS and CE are mainly caused by the variations in the flanking regions.

The detectable lengths of the flanking regions within the range of MPS reads primarily depend on the lengths of the core repeat regions, given that the overall length detected by MPS remains constant. The lengths of the core repeat regions of STRs vary from locus to locus. For locus with relatively short core repeat regions, such as TPOX and TH01, the range of the flanking regions that can be detected is large. The sequence variation in TPOX has rarely been reported in previous studies. In a large population study, Novroski et al. [16, 17] using the MiSeq FGx analyzed 777 unrelated individuals from four major population groups (US Caucasian n = 210, African American n = 200, Hispanic n = 198, and East Asian n = 169). In 27 A-STRs, Only TPOX did not have any sequence variation to be observed. It is indicated that few sequence variations are in the core repeat region or adjacent flanking regions of TPOX. In our study, a total of four variations were observed in the distal flanking region of TPOX (rs149212737, rs115970091, rs1449872726, rs13413321), in which the frequency of rs13413321 was more than half (368/700). Similarly, four variations (rs535300047, rs1472955972, rs1279211197, and rs369097987) were observed in the distal flanking region of TH01 which was rarely reported with sequence variations as well. By contrast, for locus with relatively long core repeat regions, such as FGA and D21S11, the range of the flanking regions that can be detected is limited. In our study, no variation was found in the flanking region of FGA, and only one variation (rs1051967683) was observed in D21S11.

The 58 flanking region variations observed in this study could be divided into three categories: The variations with high polymorphism, such as rs9546005 and rs202043589 of D13S317, rs58390469 and rs79534691 of D2S441, rs1728369 and rs11642858 of D16S539. This category of variations has a great impact on MPS-STR typing and can significantly improve the identification ability of STRs, which are reflected in the forensic parameters (Supplementary Table S6). The variations with low frequency (< 1%), such as rs1336239361 of D2S1338 and rs1030964212 of D8S1179. The increased number of alleles with this kind of flanking region variations has a negligible effect on the observed increase in heterozygosity. Although this category of variations is of limited help in improving the discriminative power of MPS-STR typing, it may play an important role in some cases such as separating mixture samples. The variations in linkage disequilibrium, such as rs75219269, rs11063969, and rs11063971 of vWA. The three SNPs exhibit linkage with each other, despite the relatively high frequency of the mutant type (130/700). Furthermore, these SNPs were consistently observed within the same allele, vWA [CE14]-GRCh38-Chr12-5983868-5984169 [TAGA]3[TGGA][TAGA]3[CAGA]4[TAGA][CAGA][TAGA] 5,983,970-G; 5,984,116-T; 5,984,121-T; 5,984,134-C, and the sequence variations were also detected in the core repeat regions (Supplementary Table S4). This category of variations is essentially equivalent to only one variation.

Conclusion

The effectiveness of MPS as a supplementary tool to the conventional CE method in forensic routine work has been consistently demonstrated by numerous studies. However, there are still a few pending tasks that need to be accomplished prior to its complete implementation. In this study, we presented a multiplex MPS system designed to selectively amplify the flanking regions of STRs, aiming to investigate the potential variations in these regions. The concordance and characterization of MPS were conducted on 21 A-STRs and 21 Y-STRs in a Chinese Han population. The compatibility between MPS and CE methods was reaffirmed. A total of 58 variations, including 53 SNPs/SNVs and 5 InDels, were observed in the flanking regions. Comparisons with previous studies primarily focus on the detection range of our developed system in the overlapped STR loci. Furthermore, the analysis and comparison of allele diversity, allele frequencies, and forensic parameters per locus by LB, RSB, and FSB effectively highlighted the advantages of MPS as well as the significance of polymorphisms in flanking regions.

In conclusion, revealing additional sequence variations in the core repeat and flanking regions of STRs can yield numerous advantages beyond a mere enhancement in discrimination power when utilizing these markers for direct matching or relationship calculations. The identification of variations in the flanking regions can play important rule in both MPS and CE method. For CE, it is advantageous to avoid the presence of polymorphic primer binding sites during assay design, to elucidate the underlying reasons for discordance observed among different primer sets amplifying identical STR loci, and to comprehend the origin of alleles exhibiting atypical sizes. In the case of MPS, it proves beneficial in enhancing allelic diversity, facilitating kinship interpretation, and separating mixture samples, as well as refining both amplification primer design and bioinformatic algorithm development. This study delved deeper into the genetic information of forensic commonly used STR and advanced the application of massively parallel sequencing in forensic genetics.

Data availability

Data is provided within the manuscript or supplementary information files.

Abbreviations

MPS:

Massively parallel sequencing

NGS:

Next-generation sequencing

STRs:

Short tandem repeats

CE:

Capillary electrophoresis

A-STRs:

Autosomal STRs

Y-STRs:

Y-chromosomal STR

LB:

Length-based

RSB:

Core repeat regions sequence-based (sequence without flanking region)

FSB:

Core repeat and flanking regions sequence-based (sequence with flanking region)

Hexp:

Expected heterozygosity

Hobs:

Heterozygosity

GD:

Genetic diversity

PIC:

Polymorphism information content

RMP:

Random match probability

DP:

Discrimination power

PE:

Power of exclusion

TPI:

Typical paternity index

HWE:

Hardy–Weinberg equilibrium

HD:

Haplotype diversity

Doc:

Depth of coverage

SBS:

Sequencing by synthesis

References

  1. Børsting C, Morling N. Next generation sequencing and its applications in forensic genetics. Forensic Sci Int Genet. 2015;18:78–89.

    Article  PubMed  Google Scholar 

  2. Aly SM, Sabri DM. Next generation sequencing (NGS): a golden tool in forensic toolkit. Arch Med Sadowej Kryminol. 2015;65(4):260–71.

    PubMed  CAS  Google Scholar 

  3. Ballard D, Winkler-Galicki J, Wesoły J. Massive parallel sequencing in forensics: advantages, issues, technicalities, and prospects. Int J Legal Med. 2020;134(4):1291–303.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Thompson R, Zoppis S, McCord B. An overview of DNA typing methods for human identification: past, present, and future. Methods Mol Biol. 2012;830:3–16.

    Article  PubMed  CAS  Google Scholar 

  5. Butler JM, Forensic DNA, Typing. Biology, Technology, and Genetics of STR markers. second ed. New York: Elsevier Academic; 2015.

    Google Scholar 

  6. Butler JM, Willis S. Interpol review of forensic biology and forensic DNA typing 2016–2019. Forensic Sci Int Synerg. 2020;2:352–67.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Butler JM. Recent advances in forensic biology and forensic DNA typing: INTERPOL review 2019–2022. Forensic Sci Int Synerg. 2022;6:100311.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Li H, Zhao X, Ma K, Cao Y, Zhou H, Ping Y, Shao C, Xie J, Liu W. Applying massively parallel sequencing to paternity testing on the Ion Torrent Personal Genome Machine. Forensic Sci Int Genet. 2017;31:155–9.

    Article  PubMed  CAS  Google Scholar 

  9. Bruijns B, Tiggelaar R, Gardeniers H. Massively parallel sequencing techniques for forensics: a review. Electrophoresis. 2018;39(21):2642–54.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  10. Gettings KB, Aponte RA, Vallone PM, Butler JM. STR allele sequence variation: current knowledge and future issues. Forensic Sci Int Genet. 2015;18:118–30.

    Article  PubMed  CAS  Google Scholar 

  11. Gettings KB, Kiesler KM, Faith SA, Montano E, Baker CH, Young BA, Guerrieri RA, Vallone PM. Sequence variation of 22 autosomal STR loci detected by next generation sequencing. Forensic Sci Int Genet. 2016;21:15–21.

    Article  PubMed  CAS  Google Scholar 

  12. Devesse L, Ballard D, Davenport L, Riethorst I, Mason-Buck G, Syndercombe Court D. Concordance of the ForenSeq™ system and characterisation of sequence-specific autosomal STR alleles across two major population groups. Forensic Sci Int Genet. 2018;34:57–61.

    Article  PubMed  CAS  Google Scholar 

  13. Devesse L, Davenport L, Borsuk L, Gettings K, Mason-Buck G, Vallone PM, Syndercombe Court D, Ballard D. Classification of STR allelic variation using massively parallel sequencing and assessment of flanking region power. Forensic Sci Int Genet. 2020;48:102356.

    Article  PubMed  CAS  Google Scholar 

  14. Davenport L, Devesse L, Syndercombe Court D, Ballard D. Forensic identity SNPs: Characterisation of flanking region variation using massively parallel sequencing. Forensic Sci Int Genet. 2023;64:102847.

    Article  PubMed  CAS  Google Scholar 

  15. Li R, Shen X, Chen H, Peng D, Wu R, Sun H. Developmental validation of the MGIEasy Signature Identification Library Prep Kit, an all-in-one multiplex system for forensic applications. Int J Legal Med. 2021;135(3):739–53.

    Article  PubMed  Google Scholar 

  16. Novroski NMM, King JL, Churchill JD, Seah LH, Budowle B. Characterization of genetic sequence variation of 58 STR loci in four major population groups. Forensic Sci Int Genet. 2016;25:214–26.

    Article  PubMed  CAS  Google Scholar 

  17. Churchill JD, Novroski NMM, King JL, Seah LH, Budowle B. Population and performance analyses of four major populations with Illumina’s FGx Forensic Genomics System. Forensic Sci Int Genet. 2017;30:81–92.

    Article  PubMed  CAS  Google Scholar 

  18. Wendt FR, King JL, Novroski NMM, Churchill JD, Ng J, Oldt RF, McCulloh KL, Weise JA, Smith DG, Kanthaswamy S, Budowle B. Flanking region variation of ForenSeq™ DNA signature Prep Kit STR and SNP loci in Yavapai native americans. Forensic Sci Int Genet. 2017;28:146–54.

    Article  PubMed  CAS  Google Scholar 

  19. Kim EH, Lee HY, Kwon SY, Lee EY, Yang WI, Shin KJ. Sequence-based diversity of 23 autosomal STR loci in koreans investigated using an in-house massively parallel sequencing panel. Forensic Sci Int Genet. 2017;30:134–40.

    Article  PubMed  CAS  Google Scholar 

  20. Phillips C, Devesse L, Ballard D, van Weert L, de la Puente M, Melis S, Álvarez Iglesias V, Freire-Aradas A, Oldroyd N, Holt C, Syndercombe Court D, Carracedo Á, Lareu MV. Global patterns of STR sequence variation: sequencing the CEPH human genome diversity panel for 58 forensic STRs using the Illumina ForenSeq DNA signature Prep Kit. Electrophoresis. 2018;39(21):2708–24.

    Article  PubMed  CAS  Google Scholar 

  21. Silva DSBS, Sawitzki FR, Scheible MKR, Bailey SF, Alho CS, Faith SA. Genetic analysis of Southern Brazil subjects using the PowerSeq™ AUTO/Y system for short tandem repeat sequencing. Forensic Sci Int Genet. 2018;33:129–35.

    Article  PubMed  CAS  Google Scholar 

  22. Barrio PA, Martín P, Alonso A, Müller P, Bodner M, Berger B, Parson W, Budowle B, DNASEQEX Consortium. Massively parallel sequence data of 31 autosomal STR loci from 496 Spanish individuals revealed concordance with CE-STR technology and enhanced discrimination power. Forensic Sci Int Genet. 2019;42:49–55.

    Article  PubMed  CAS  Google Scholar 

  23. Khubrani YM, Hallast P, Jobling MA, Wetton JH. Massively parallel sequencing of autosomal STRs and identity-informative SNPs highlights consanguinity in Saudi Arabia. Forensic Sci Int Genet. 2019;43:102164.

    Article  PubMed  CAS  Google Scholar 

  24. Delest A, Godfrin D, Chantrel Y, Ulus A, Vannier J, Faivre M, Hollard C, Laurent FX. Sequenced-based French population data from 169 unrelated individuals with Verogen’s ForenSeq DNA signature prep kit. Forensic Sci Int Genet. 2020;47:102304.

    Article  PubMed  CAS  Google Scholar 

  25. Wang Z, Wang L, Liu J, Ye J, Hou Y. Characterization of sequence variation at 30 autosomal STRs in Chinese Han and Tibetan populations. Electrophoresis. 2020;41(3–4):194–201.

    Article  PubMed  Google Scholar 

  26. Dash HR, Kaitholia K, Kumawat RK, Singh AK, Shrivastava P, Chaubey G, Das S. Sequence variations, flanking region mutations, and allele frequency at 31 autosomal STRs in the central Indian population by next generation sequencing (NGS). Sci Rep. 2021;11(1):23238.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Gettings KB, Borsuk LA, Ballard D, Bodner M, Budowle B, Devesse L, King J, Parson W, Phillips C, Vallone PM. STRSeq: a catalog of sequence diversity at human identification short Tandem repeat loci. Forensic Sci Int Genet. 2017;31:111–7.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Parson W, Ballard D, Budowle B, Butler JM, Gettings KB, Gill P, Gusmão L, Hares DR, Irwin JA, King JL, Knijff P, Morling N, Prinz M, Schneider PM, Neste CV, Willuweit S, Phillips C. Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements. Forensic Sci Int Genet. 2016;22:54–63.

    Article  PubMed  CAS  Google Scholar 

  30. Phillips C, Gettings KB, King JL, Ballard D, Bodner M, Borsuk L, Parson W. The devil’s in the detail: release of an expanded, enhanced and dynamically revised forensic STR sequence guide. Forensic Sci Int Genet. 2018;34:162–9.

    Article  PubMed  CAS  Google Scholar 

  31. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92.

    Article  PubMed  Google Scholar 

  32. Gouy A, Zieger M. STRAF-A convenient online tool for STR data evaluation in forensic genetics. Forensic Sci Int Genet. 2017;30:148–51.

    Article  PubMed  CAS  Google Scholar 

  33. Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Opt. 2017;34(5):502–8.

    Article  Google Scholar 

  34. Yoshinaga Y, Daum C, He G, O’Malley R. Genome sequencing. Methods Mol Biol. 2018;1775:37–52.

    Article  PubMed  CAS  Google Scholar 

  35. Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. Rhoads A, Au KF. PacBio Sequencing and its applications. Genomics Proteom Bioinf. 2015;13(5):278–89.

    Article  Google Scholar 

  37. Mehta B, Daniel R, Phillips C, McNevin D. Forensically relevant SNaPshot® assays for human DNA SNP analysis: a review. Int J Legal Med. 2017;131(1):21–37.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors wish to thank all the volunteers who provided the samples used in this study.

Funding

This study was supported by grants from the National Natural Science Foundation of China (81930056 and 82293654), and the Public Interest Research Grant Programs of National Research Institutes (Grant No. GY2024G-3). The funders had no role in study design, data analysis, or manuscript preparation.

Author information

Authors and Affiliations

Authors

Contributions

H.L. and J.X. designed the experiments. H.L. and B.L. were major contributors in conducting experiments. B.L. collected samples. Y.L., F.Y., and Y.C. performed DNA extraction, library preparation, normalized, and sequencing. X.L. performed data analysis. H.L. wrote the manuscript. C.L. and Z.Z. revised the manuscript. All authors reviewed and approved the submitted version of the manuscript.

Corresponding authors

Correspondence to Zhenmin Zhao or Chengtao Li.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the Ethical Committee of Fudan University (2020016). All individuals provided their written informed consent for the collection of blood samples. All the experimental procedures were performed by the standards of the Declaration of Helsinki 1964.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Li, B., Liu, Y. et al. Characterization of sequence variations in the extended flanking regions using massively parallel sequencing in 21 A-STRs and 21 Y-STRs. BMC Genomics 25, 841 (2024). https://doi.org/10.1186/s12864-024-10762-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-024-10762-9

Keywords