In silico and in vitro comparative analysis to select, validate and test SNPs for human identification
© Giardina et al; licensee BioMed Central Ltd. 2007
Received: 30 July 2007
Accepted: 12 December 2007
Published: 12 December 2007
The recent advances in human genetics have recently provided new insights into phenotypic variation and genome variability. Current forensic DNA techniques involve the search for genetic similarities and differences between biological samples. Consequently the selection of ideal genomic biomarkers for human identification is crucial in order to ensure the highest stability and reproducibility of results.
In the present study, we selected and validated 24 SNPs which are useful in human identification in 1,040 unrelated samples originating from three different populations (Italian, Benin Gulf and Mongolian). A Rigorous in silico selection of these markers provided a list of SNPs with very constant frequencies across the populations tested as demonstrated by the Fst values. Furthermore, these SNPs also showed a high specificity for the human genome (only 5 SNPs gave positive results when amplified in non-human DNA).
Comparison between in silico and in vitro analysis showed that current SNPs databases can efficiently improve and facilitate the selection of markers because most of the analyses performed (Fst, r2, heterozigosity) in more than 1,000 samples confirmed available population data.
Genetic variation in the human genome takes many forms, ranging from large, microscopically visible chromosome anomalies to single-nucleotide changes. In general, forensic DNA analysis compares biological samples searching for genetic similarities and differences by typing a small number of genetic variable segments in each sample. Current methods used for performing forensic DNA analysis mainly focus on typing several STR (S hort T andem Repeats) markers. STRs are smaller than VNTR sequences and can easily discriminate between both related and unrelated individuals . At present, different optimized multiplex assays are used to analyze unique STRs loci located on different chromosomes with the advantage of providing extremely low random match probability (the probability of finding the same DNA profile in a randomly selected, unrelated individual). The only drawback in current forensic DNA typing systems is that sometimes PCR amplicons are excessively large (ranging from 100–400 bases in length). Sometimes DNA samples appear degraded into smaller fragments and therefore, the higher molecular-weight STR loci can be scarcely amplified so creating incomplete DNA profiles with lower discrimination power . The presence of alternative types of DNA polymorphic markers in the human genome i.e. Single Nucleotide Polymorphisms (SNPs), abundantly spread along the chromosomes, make it possible to develop different DNA profiling techniques able to type smaller fragments of DNA (lower than 100 bp) compared to those detectable with STR markers (100–400 bp). In addition, due to the biallelic nature of SNPs, a size-based separation is not necessary so in turn making a higher level of multiplexing and automation possible compared to when using STRs. The latter are extremely important in the implementation of large criminal DNA databases [3, 4]. Furthermore, SNPs have smaller mutation rate (about 10-8)  compared to STRs (ranging from 10-3 to 10-5) [6, 7] and the number of chromosomes which need typing to assess allele frequencies for SNPs are lower compared to STRs because of the smaller number of alleles. Nonetheless, several important factors have to be considered for an accurate selection of the SNPs to be used in forensic analysis. The first issue concerns the correct evaluation of the frequency of each SNP among the populations. STRs markers have many alleles and each of them show worldwide low frequency and consequently the random match probabilities are not usually strongly dissimilar among the populations. Conversely, SNP markers can show very dissimilar frequencies among different populations, causing a very large dependence of the match probability on the population frequencies used for the calculation . An incorrect evaluation of SNPs frequencies may give rise to ambiguity in genetic results under specific conditions or in isolated populations. A second issue originates from the necessity to facilitate the stability and reproducibility of genetic typing for which forensic SNPs should be exclusively found in the human sequence and mapped in a single locus (single copy SNP). Another important issue to consider for a correct SNPs selection follows the discovery reported in several recent studies regarding the presence of an abundance of submicroscopic copy number variation of DNA segments ranging from kilobases (kb) to megabases (Mb) which include deletions, insertions, duplications and complex multi-site variants in all humans and other mammals . These number variable regions (CNVRs) cover 12% of the genome. It is thus preferable that SNPs used in human identification are located outside the copy number variable regions. In addition, the SNPs used in forensic analysis should referentially not be located in coding/regulatory regions in order to avoid the possibility of obtaining unnecessary information concerning health status and/or disease susceptibility or resistance of the individual analyzed thus giving the courtroom the possibility to reject DNA tests . It follows that forensic markers should not be published as susceptibility/resistance/prognostic factors in scientific publications. The above mentioned issues highlight the importance of the availability of genetic databases to help facilitate SNPs selection. In this work, we present a combination of in silico and in vitro genetic analysis to validate and test a panel of 24 SNPs for human identification, selected on the basis of very stringent criteria. A total of 24,960 genotypes of individuals originating from three different continents (Europe, Africa and Asia) were analysed to assess allele frequencies and the results were compared with current available SNPs databases. The specificity of selected SNPs for the human genome was tested by combining in silico selection and in vitro testing of non-human DNA. In this study, we provide a list of 24 SNPs specifically designed for human identification and recommend an intensive use of current genetic database which can strongly improve the a priori selection of the SNPs because most analyses performed (Fst, r2, heterozigosity) in our samples confirmed in silico analysis.
Chromosome localization, nucleotide position, allele variations, average heterozygosity and Fst values of the selected markers. Minor allele frequency (MAF) of HapMap data are reported in the brackets.
Hap map Heterozygosity
Minor all. frequency
Minor all. frequency
Minor all. frequency
We carried out a study regarding the selection of a panel of 24 SNPs for forensic analysis. We mainly focused on in silico selection of SNPs to avoid unspecific non human amplification and unbalanced allele frequencies among the populations. Through in silico searches, we selected SNPs located in only human-specific sequences. In vitro analysis showed that only five SNPs gave positive results when amplified in DNA from the closest living relative of Homo sapiens (Pan troglodytes) and from Macaca fascicularis. We would like to emphasise that this number is lower than figures reported do date in other studies. The latter demonstrates how useful it is to carefully select SNPs by comparing genomic sequences of different species [12–15]. We also excluded an interdependence of selected markers (to both autosomes and Y chromosome) by calculating the extention of LD (r2 values were not significant).
When considering the average values (heterozygosity; allele frequencies, Fst values, RMP) of the entire panel of SNPs calculated on the HapMap data, we observed no significant deviation. Thus, the availability of genetic databases strongly facilitates SNPs selection as to their informative capacity: a priori selection criteria were confirmed by the genotyping of our populations. In the present study, we report SNPs showing very high heterozygosity values: the mean was 0.473 ± 0.01. Although it is possible to observe significant differences between allele frequencies for specific SNPs, on average, the deviation in allele frequencies of a panel of SNPs between "expected" (HapMap) and "observed" (genotyping) seems not to be significant. These results may be due to divergences among the populations analyzed in this work and those typed in HapMap, or from the limited size of samples analyzed in the HapMap project. In fact, the mean heterozygosity calculated on the basis of genotype frequencies reported in the HapMap database was quite similar to the one calculated in this work (0.484 ± 0.04). there is a similar situation regarding differences in Fst values. Accordingly, the differences in random match probabilities calculated on the basis of both observed and expected frequencies are not relevant. In particular, the RMP arising from data (CEU) reported in HapMap was 2.4 × 10-11 quite similar to the one calculated in this work 5.7 × 10-11 (Italian). The same situation is evident in African and Asian populations where the RMP observed were similar to those calculated with the HapMap data (in the brackets): 7.8 × 10-10 (8.1 × 10-10) in Africans; 3.4 × 10-11 (2.4 × 10-10). These data suggest that the main effort in developing a SNP panel for forensic purposes should be mainly directed towards the selection of universal forensic SNPs to be used and on typing admixed populations in order to confirm the worldwide balance of allele frequencies of SNPs. The efficiency of the programs used for the alignment of genome sequences can strongly improve the process of selection of human-specific SNPs.
In this paper, we selected and validated 24 SNPs useful for human identification. To generate allele frequencies, we used the Real-Time PCR. However, it should be noted that this technique cannot produce a multiplexing of the PCR and therefore, it could be of only limited use in forensic practice. We showed that current SNPs databases can efficiently improve and facilitate the selection of markers since most of the reported frequencies have been confirmed in recent works. The most important aspect of SNPs selection criteria is the choice of those SNPs with low Fst among populations and with a single locus location exclusively in the human genome. Rigorous marker selection criteria have been fundamental for performing a SNP-base human identification. Compared to previous reported SNPs selection , here we report a much higher percentage of SNPs (80%) specific for the human genome. We also recommend selecting markers located outside regions containing copy number variations. So far, the availability of several maps of copy number variations in the human genome has shown that we are genetically more diverse than expected and that our genome appears to be even more amorphous and changeable than expected. Consequently, the selection of SNPs for human identification needs to consider all these variables in order to ensure the highest stability and reproducibility of results. During the final phases of this work, a CNV map of the human genome was published through the study of 270 individuals from four of the populations typed in the HapMap project. Three new segmental duplications were observed within the RP11-420I5, RP11-530H6 and RP11-469G12 clones which contain respectively selected markers rs478347, rs8033863 and rs2317225. It is important to underline, however, that forensic markers (WVA and FGA) in current use are also located in CNV regions but no genotyping errors have been reported to date and today we do not know how many genomic regions may have CNVs because the variability of the human genome is much higher than previously supposed  and new CNV regions are constantly being reported. Despite their informative capacity and the absence of typing errors observed in this work, the use of the SNPs rs478347, rs8033863 and rs2317225 as markers for forensic purposes should be considered with caution due to the fact that typing errors arising from the complexity of the surrounding genomic regions cannot be totally excluded.
Samples and DNA extraction
A total of 1,040 samples from Italy, the Benin gulf and Mongolia were collected. The QIAamp DNA Blood Mini Kit (QIAGEN Inc., Valencia, CA) was used to extract genomic DNA from whole blood. DNA concentration was determined by real-time PCR using the Quantifiler™ Human DNA Quantification Kit (Applied Biosystems).
A number of selection criteria were used to identify the SNPs considered in this work (Table 1):
(i) SNPs location outside coding regions; not published as resistance/susceptibility allele;
(ii) average minimal allele frequency (MAF) reported in population databases of at least 0.4 by searching in population databases .
(iii) SNPs location in a single locus in the human genome;
(iv) SNPs specificity for the predicted locus evaluated by comparison of the sequences flanking the selected markers to those available in the Genebank database. A heuristic search using the Genome BLAST (blastn) option was performed using default parameters. BLAST (Basic Local Alignment Search Tool) is a sequence similarity search program that finds matches between a tag sequence and human and non human sequences available in databases . SNPs whose surrounding sequence was found in different sites other than the one predicted, or which showed too much similarity to sequences belonging to other species' genomes, were not considered candidate markers for human identification.
(v) examination of the flanking regions of the SNPs to ensure that regions surrounding the polymorphic site contained no additional variations. For two selected SNPs (rs1922807 and rs886528), potential interfering additional polymorphisms at a distance of 46 bp upstream and 40 bp downstream respectively were present but they are not relevant in most genetic typing;
(vi) SNPs location outside the copy number variation regions by search in the Database of Genomic Variants . In the final phase of this work, novel CNVs were published encompassing regions involving three selected SNPs rs478347; rs8033863; rs2317225 . The three segmental duplications within the RP11-420I5, RP11-530H6 and rp11-469g12. clones contain respectively SNPs rs478347, rs8033863 and rs2317225.
Primers and probes sequences of the SNPs tested.
All the statistical analyses of forensic data were performed as described by Brenner  and using DNAVIEW™ 27.19. Statistical independence of selected markers was assessed by calculating linkage disequilibrium (LD) as r2  using LDplotter software. Inter-population variability in allele frequencies was assessed by calculating Fst for each marker  To test the null hypothesis of no differences in allele frequencies between the HapMap and typed populations, the frequency distribution of alleles was analysed for our samples as well as for the HapMap samples, using X2 test.
This work was supported by financing from EU FP6 projects NACBO (contract no. NMP4-CT-2004-500804).
- Cotton EA, Allsop RF, Guest JL, Frazier RR, Koumi P, Callow IP, Seager A, Sparkes RL: Validation of the AMPFlSTR SGM plus system for use in forensic casework. Forensic Sci Int. 2000, 112: 151-161. 10.1016/S0379-0738(00)00182-1.PubMedView ArticleGoogle Scholar
- Golenberg EM, Bickel A, Weihs P: Effect of highly fragmented DNA on PCR. Nucleic Acids Res. 1996, 24: 5026-5033. 10.1093/nar/24.24.5026.PubMed CentralPubMedView ArticleGoogle Scholar
- Schneider PM, Martin PD: Criminal DNA databases: the European situation. Forensic Sci Int. 2001, 119: 232-8. 10.1016/S0379-0738(00)00435-7.PubMedView ArticleGoogle Scholar
- Martin PD, Schmitter H, Schneider PM: A brief history of the formation of DNA databases in forensic science within Europe. Forensic Sci Int. 2001, 119: 225-31. 10.1016/S0379-0738(00)00436-9.PubMedView ArticleGoogle Scholar
- Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D: Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet. 2002, 32: 135-140. 10.1038/ng947.PubMedView ArticleGoogle Scholar
- Huang QY, Xu FH, Shen H, Deng HY, Liu YJ, Liu YZ, Li JL, Recker RR, Deng HW: Mutation patterns at dinucleotide microsatellite loci in humans. Am J Hum Genet. 2002, 70: 625-634. 10.1086/338997.PubMed CentralPubMedView ArticleGoogle Scholar
- Dupuy BM, Stenersen M, Egeland T, Olaisen B: Y-chromosomal microsatellite mutation rates: differences in mutation rate between and within loci. Hum Mutat. 2004, 23: 117-124. 10.1002/humu.10294.PubMedView ArticleGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature. 2006, 444: 444-54. 10.1038/nature05329.PubMed CentralPubMedView ArticleGoogle Scholar
- Brenner CH: Forensic Genetics: Mathematics. Encyclopedia of Life Sciences. 2006, John Wiley & Sons, LtdGoogle Scholar
- LDplotter software. [http://www.pharmgat.org/Tools/pbtoldplotform]
- Devlin B, Risch N: A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics. 1995, 29: 311-22. 10.1006/geno.1995.9003.PubMedView ArticleGoogle Scholar
- Sanchez JJ, Phillips C, Borsting C, Balogh K, Bogus M, Fondevila M, Harrison CD, Musgrave-Brown E, Salas A, Syndrome-Court D, Schneider PM, Carracedo A, Morling N: A multiplex assay with 52 single nucleotide polymorphisms for human identification. Electrophoresis. 2006, 27: 1713-24. 10.1002/elps.200500671.PubMedView ArticleGoogle Scholar
- Gusmão L, González-Neira A, Alves C, Lareu M, Costa S, Amorim A, Carracedo A: Chimpanzee homologous of human Y specific STRs. Acomparative study and a proposal for nomenclature. Forensic Sci Int. 2002, 126 (2): 129-36. 10.1016/S0379-0738(02)00046-4.PubMedView ArticleGoogle Scholar
- Lazaruk K, Wallin J, Holt C, Nguyen T, Walsh PS: Sequence variation in humans and other primates at six short tandem repeat loci used in forensic identity testing. Forensic Sci Int. 2001, 119 (1): 1-10. 10.1016/S0379-0738(00)00388-1.PubMedView ArticleGoogle Scholar
- Mulero JJ, Chang CW, Calandro LM, Green RL, Li Y, Johnson CL, Hennessy LK: Development and validation of the AmpFlSTR Yfiler PCR amplification kit: a male specific, single amplification 17 Y-STR multiplex system. J Forensic Sci. 2006, 51 (1): 64-75. 10.1111/j.1556-4029.2005.00016.x.PubMedView ArticleGoogle Scholar
- Pakstis AJ, Speed WC, Kidd JR, Kidd KK: Candidate SNPs for a universal individual identification panel. Hum Genet. 2007, 121: 305-17. 10.1007/s00439-007-0342-2.PubMedView ArticleGoogle Scholar
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, Macdonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The Diploid Genome Sequence of an Individual Human. Plos Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMed CentralPubMedView ArticleGoogle Scholar
- HapMap database. [http://www.hapmap.org]
- Cummings L, Riley L, Black L, Souvorov A, Resenchuk S, Dondoshansky I, Tatusova T: BLAST: custom-defined virtual databases for complete and unfinished genomes. FEMS Microbiol Lett. 2002, 216: 133-8. 10.1111/j.1574-6968.2002.tb11426.x.PubMedView ArticleGoogle Scholar
- Database of Genomic Variants. [http://projects.tcag.ca/variation/]
- Genepop. [http://genepop.curtin.edu.au/genepop_op6.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.