Silhouette scores for assessment of SNP genotype clusters
© Lovmar et al; licensee BioMed Central Ltd. 2005
Received: 16 November 2004
Accepted: 10 March 2005
Published: 10 March 2005
High-throughput genotyping of single nucleotide polymorphisms (SNPs) generates large amounts of data. In many SNP genotyping assays, the genotype assignment is based on scatter plots of signals corresponding to the two SNP alleles. In a robust assay the three clusters that define the genotypes are well separated and the distances between the data points within a cluster are short. "Silhouettes" is a graphical aid for interpretation and validation of data clusters that provides a measure of how well a data point was classified when it was assigned to a cluster. Thus "Silhouettes" can potentially be used as a quality measure for SNP genotyping results and for objective comparison of the performance of SNP assays at different circumstances.
We created a program (ClusterA) for calculating "Silhouette scores", and applied it to assess the quality of SNP genotype clusters obtained by single nucleotide primer extension ("minisequencing") in the Tag-microarray format. A Silhouette score condenses the quality of the genotype assignment for each SNP assay into a single numeric value, which ranges from 1.0, when the genotype assignment is unequivocal, down to -1.0, when the genotype assignment has been arbitrary. In the present study we applied Silhouette scores to compare the performance of four DNA polymerases in our minisequencing system by analyzing 26 SNPs in both DNA polarities in 16 DNA samples. We found Silhouettes to provide a relevant measure for the quality of SNP assays at different reaction conditions, illustrated by the four DNA polymerases here. According to our result, the genotypes can be unequivocally assigned without manual inspection when the Silhouette score for a SNP assay is > 0.65. All four DNA polymerases performed satisfactorily in our Tag-array minisequencing system.
"Silhouette scores" for assessing the quality of SNP genotyping clusters is convenient for evaluating the quality of SNP genotype assignment, and provides an objective, numeric measure for comparing the performance of SNP assays. The program we created for calculating Silhouette scores is freely available, and can be used for quality assessment of the results from all genotyping systems, where the genotypes are assigned by cluster analysis using scatter plots.
High-throughput single nucleotide polymorphism (SNP) genotyping assays generate large amounts of data, which usually is presented as scatter plots of signals corresponding to the two SNP alleles. A robust SNP genotyping assay is characterized by large distances between the three clusters that define the genotypes and small distances between the data points within each cluster. Numeric quality measures for the scatter plots would allow objective and automatic assessment of the success of a SNP assay.
"Silhouettes" were introduced in 1987 as a general graphical aid for interpretation and validation of cluster analysis . In a Silhouettes calculation, the distance from each data point in a cluster to all other data points within the same cluster and to all data points in the closest cluster are determined. Thus Silhouettes provides a measure of how well a data point was classified when it was assigned to a cluster by according to both the tightness of the clusters and the separation between them. This feature renders Silhouettes potentially well suited for assessing cluster quality in SNP genotyping methods. In high-throughput SNP genotyping, Silhouettes could be used for assessing the quality of automatic genotype assignment by alerting the operator if the quality of the genotype clusters fall below a certain limit. During assay development and optimization, Silhouettes could be used to compare the performance of a genotyping assay at different reaction conditions. It could also be applied for comparing the robustness of different SNP genotyping technologies.
In this study we created a program (ClusterA) to calculate numeric Silhouettes for assessing the quality of genotype clusters obtained in SNP genotyping assays. We show the utility of Silhouettes and the program by applying it to our "in-house" developed four-color fluorescence minisequencing system for SNP genotyping in a microarray format . Single nucleotide primer extension ("minisequencing") is the reaction principle underlying several of the commonly used systems for genotyping single nucleotide polymorphisms (SNPs) [3–8]. In minisequencing a DNA polymerase is employed to specifically extend a detection primer designed to anneal directly adjacent to the SNP position in the complementary DNA strand with a single labelled nucleotide analogue. The DNA polymerase is the most important factor that determines the efficiency and specificity of the primer extension reaction, irrespectively of the assay format. We used Silhouettes to compare the performance of three new commercially available DNA polymerases to the ThermoSequenase DNA polymerase, which is routinely used in minisequencing assays in many laboratories, including our own. We found Silhouettes to provide a relevant measure, in addition to signal-to-noise ratios and genotyping success, for selecting the most favourable enzyme for our assay.
Results and Discussion
The examples in panels E, F and G of Figure 2 illustrate how different clustering patterns can yield similar Silhouette scores. Based on the results from the scatter plots used to assign genotypes in this study, our recommendation is to accept the results from SNP assays with Silhouette scores >0.65 and to fail the whole assays if the Silhouette scores is <0.25. Individual genotype calls for assays where the Silhouette score falls between 0.25–0.65 may be accepted or failed after visual inspection. Excluding some of the outliers will then increase the Silhouette score. Our recommendations is in line with Liu et al., who have included silhouette calculations in the complex algorithm used to interpret the data from the Affymetrix 10K HuSNP hybridization microarray .
Here we exemplify the use of Silhouette scores by comparing the performance of the TERMIPol, Therminator, KlenThermase and ThermoSequenase DNA polymerases in the Tag-array minisequencing system . Twenty-six SNPs were analyzed in both polarities in 16 DNA samples in two independent experiments. As our Tag-array genotyping system utilizes an "array of arrays" format  with 80 subarrays on each microscope slide, we were able to test all four enzymes in all samples on the same slide at exactly the same conditions, to facilitate a fair comparison between the enzymes.
Silhouette scores, signal to noise ratios and genotyping performance for four DNA polymerases in Tag-array minisequencing1
Silhouette score 2
Genotype calls 4
In the comparison between the enzymes, KlenThermase displayed the highest average Silhouette score, ThermoSequenase had the highest median Silhouette score and also obtained the highest Silhouette score most frequently (Table 1). In addition to the Silhouettes scores, that represent a measure of the robustness of a SNP assay, the signal to noise ratios (S/N) and the genotyping success was assessed (Table 1). All four enzymes performed satisfactorily in our minisequencing assay taking into account the non-stringent genotyping criteria used. However, performance varied between the evaluated features with high error rates for Therminator and ThermoSequenase. KlenThermase showed the best results over all and, also taking into account the cost, would be the enzyme of choice based on the results from this study.
We conclude that "Silhouette scores" for assessing the cluster quality is well suited for comparing the performance of SNP assays. Here we used a one-dimensional calculation of the Silhouette scores, by measuring the distances between the data-points along the x-axis only. A two-dimensional Silhouette calculation using vectors should be applied when genotypes are assigned by scatter plots with the fluorescence signals corresponding to the two alleles on the y- and x-axis. Both options are available in the ClusterA program that also calculates mean, variance and F-statistic for the input data set. The program is freely available through our website http://www.medsci.uu.se/molmed/software.htm. We believe that the ClusterA program for calculating Silhouette scores created in the present study is a useful and general tool for any genotyping system, where the genotypes are called by cluster analysis with the aid of scatter plots.
Genomic DNA was extracted from blood samples from 16 volunteer blood donors using the Wizard genomic DNA purification kit (Promega, Madison, WI).
Twenty-six SNPs, selected to be located in unique PCR amplicons, were included in the test panel. For information on the single nucleotide polymorphisms and oligonucleotides used, see the Additional file 1: SNPinformation.pdf. PCR primers were designed and combined in multiplex PCR reactions. Minisequencing primers with 20 bp 5'-Tag sequences were designed for both DNA polarities. The experimental details of the genotyping procedure have been described in detail previously . In short it included the following steps: The regions containing the sequence variations were amplified in six optimized multiplex PCRs. For each sample the PCR products were pooled and divided into four aliquots, one for each enzyme. The remaining dNTPs and primers from the PCR reaction mixture were removed by treatment with Exonuclease I and shrimp alkaline phosphatase. The cyclic minisequencing reactions were performed in solution as described below, and the extended minisequencing primers were hybridized to microarrays carrying immobilized covalently coupled oligonucleotides (cTags) complementary to the Tag-sequences of the minisequencing primers. The cTags had been immobilized to CodeLinkTM Activated Slides (Amersham Biosciences, Uppsala, Sweden) via their 3'-end NH2-groups to form 80 subarrays per slide, each with 60 cTags as duplicate spots. Finally the microarray slides were scanned, and the fluorescent signals were measured.
Cyclic minisequencing reactions were performed in solution with 10 nM of each of the 52 tagged minisequencing primers using 0.1 μM ddATP-Texas Red, ddCTP-Tamra and ddGTP-R110 and 0.15 μM ddUTP-Cy5 (Perkin-Elmer Life Sciences, Boston, MA), and 0.064 U/μl of one of the four DNA polymerases in 15μl of 0.02% Triton-X, 4.1 mM MgCl2 and 33.6 mM Tris-HCl pH 9.5. The cyclic extension reactions were performed on a Thermal Cycler PTC-225 (MJ Research, Watertown, MA) with an initial 96°C for 3 min followed by 55 cycles of 95°C and 55°C for 20 s each. The DNA polymerases were; TERMIPol (Solis BioDyne, Tartu, Estonia), Therminator (New England BioLabs Inc., Beverly, MA, USA), KlenThermase (Gene Craft, Lüdinghausen, Germany), or ThermoSequenase (Amersham Biosciences, Uppsala, Sweden). A custom made reaction rack holding the arrayed slides with a silicon grid to give 80 separate reaction chambers was used during capture of the minisequencing reaction products on the Tag-arrays.
Data analysis and genotype assignment
The fluorescence signals were measured from the microarray slides using a ScanArray Express® instrument (Perkin-Elmer Life Sciences, Boston, MA). The excitation lasers were: Blue Argon 488 nm for R110; Green HeNe 543.8 nm for Tamra; Yellow HeNe 594 nm for Texas Red and Red HeNe 632.8 nm for Cy5. The fluorescence signal intensities were determined using the QuantArray®analysis 3.1 software (Perkin-Elmer Life Sciences, Boston, MA). The QuantArray file was exported to the SNPSnapper v4.0 software http://www.bioinfo.helsinki.fi/SNPSnapper/) for genotype assignment. Raw data as fluorescence signals and signal ratios are provided as supplementary material, see Additional file 2: Rawdata.txt. Genotypes were assigned based on scatter plots with the logarithm of the sum of both fluorescence signals (SignalAllele1+SignalAllele2) plotted on the y-axis, and the fluorescence signal fraction, obtained by dividing the fluorescence signals from one allele by the sum of the fluorescence signal from both SNP alleles (SignalAllele2/ (SignalAllele1+SignalAllele2), on the x-axis . The result file with the assigned genotypes and the corresponding signal ratios were exported as a text file and used to calculate Silhouettes scores using the ClusterA program. ClusterA is implemented in Microsoft Visual Basic 6.0, and can be run on PCs with the Microsoft Windows operating system. The ClusterA program also provides the mean, variance and F-statistic for the input data.
The study was supported by grants from the Swedish Research Council (VR-NT). The array spotter instrument was purchased by funding from the Wallenberg Consortium North (WCN). We thank Raul Figueroa for producing the Tag-arrays
- Rousseeuw PJ: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987, 20: 53-65. 10.1016/0377-0427(87)90125-7.View ArticleGoogle Scholar
- Lindroos K, Sigurdsson S, Johansson K, Rönnblom L, Syvänen AC: Multiplex SNP genotyping in pooled DNA samples by a four-colour microarray system. Nucleic Acids Res. 2002, 30: e70-10.1093/nar/gnf069.PubMedPubMed CentralView ArticleGoogle Scholar
- Syvanen AC: Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet. 2001, 2: 930-942. 10.1038/35103535.PubMedView ArticleGoogle Scholar
- Chen X, Levine L, Kwok PY: Fluorescence polarization in homogeneous nucleic acid analysis. Genome Res. 1999, 9: 492-498.PubMedPubMed CentralGoogle Scholar
- Tully G, Sullivan KM, Nixon P, Stones RE, Gill P: Rapid detection of mitochondrial sequence polymorphisms using multiplex solid-phase fluorescent minisequencing. Genomics. 1996, 34: 107-113. 10.1006/geno.1996.0247.PubMedView ArticleGoogle Scholar
- Fan JB, Chen X, Halushka MK, Berno A, Huang X, Ryder T, Lipshutz RJ, Lockhart DJ, Chakravarti A: Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Res. 2000, 10: 853-860. 10.1101/gr.10.6.853.PubMedPubMed CentralView ArticleGoogle Scholar
- Bell PA, Chaturvedi S, Gelfand CA, Huang CY, Kochersperger M, Kopla R, Modica F, Pohl M, Varde S, Zhao R, Zhao X, Boyce-Jacino MT, Yassen A: SNPstream UHT: ultra-high throughput SNP genotyping for pharmacogenomics and drug discovery. Biotechniques. 2002, Suppl: 70-2, 74, 76-7.PubMedGoogle Scholar
- Kurg A, Tonisson N, Georgiou I, Shumaker J, Tollett J, Metspalu A: Arrayed primer extension: solid-phase four-color DNA resequencing and mutation detection technology. Genet Test. 2000, 4: 1-7. 10.1089/109065700316408.PubMedView ArticleGoogle Scholar
- Liu WM, Di X, Yang G, Matsuzaki H, Huang J, Mei R, Ryder TB, Webster TA, Dong S, Liu G, Jones KW, Kennedy GC, Kulp D: Algorithms for large-scale genotyping microarrays. Bioinformatics. 2003, 19: 2397-2403. 10.1093/bioinformatics/btg332.PubMedView ArticleGoogle Scholar
- Pastinen T, Raitio M, Lindroos K, Tainola P, Peltonen L, Syvänen AC: A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res. 2000, 10: 1031-1042. 10.1101/gr.10.7.1031.PubMedPubMed CentralView ArticleGoogle Scholar
- Lovmar L, Fredriksson M, Liljedahl U, Sigurdsson S, Syvänen AC: Quantitative evaluation by minisequencing and microarrays reveals accurate multiplexed SNP genotyping of whole genome amplified DNA. Nucleic Acids Res. 2003, 31: e129-10.1093/nar/gng129.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.