Our aim was to find gene changes with fully penetrant effects that would cause pain phenotypes. We designed our approach based on three tenets. Firstly, the SNP alleles that would individually cause fully penetrant (and detectable) effects were most likely to be exonic and protein changing (mis-sense, non-sense, indels, or splicing). Secondly, an exome would detect the majority of these SNPs. Thirdly, smaller cohorts are easier to collect, phenotype and curate.
We wrote the fSNPd program to analyse multiple exomes simultaneously with the purpose of identifying common SNP alleles that exist in a population of interest at a significantly different frequency than that of the general population. To do this, the program first establishes a catalogue of common SNPs based on an existing cohort; we used the 1000 Genome Project or the Exome Variant Server (EVS) as both were available at the time to download. Confounding variables can be limited by choosing a subset of such a cohort based on ethnicity/location to match the demographics of the population of interest.
In our experiments, we tested the European subset of the 1000 Genome Project (the UK subset being unfortunately too small for this purpose – about 100 individuals) and the EVS, and considered SNPs that differed from at least one of these at a significant level to be relevant. The 1000 Genome Project had the advantage of being a better ethnic match for our cohort (predominantly white Caucasian), while the EVS had data from a greater number of individuals (approximately 20,000 individuals). In the final program fSNPd produces results for both.
The program calculates rare allele frequencies of SNPs in the population of interest. Individuals’ data is handled in vcf format (post variant-calling) versus raw sequencing reads, so that each locus is assigned two possible alleles. Population allele frequencies are determined by tallying the counts of each allele at each SNP locus across all of the individuals in the population of interest and dividing by the total number of presumed alleles at each locus (two per individual for loci on autosomal chromosomes and one (male) or two (female) per individual for those on sex chromosomes).
Finally, allele frequencies per each SNP in the starting catalogue are compared between the population of interest and the general population using a two-tailed chi-squared test without Yates’ correction. SNPs with significantly higher or lower allele frequencies in the population of interest are identified. Significance is characterized by frequencies that differ by a p-value of 2% or more, with Bonferroni and false discovery rate (FDR) corrections applied, according to the total number of SNPs compared (the number in the initial catalogue from the existing cohort) [17].
Additionally, for SNPs that do not exist in particular individuals in the population of interest, raw sequencing data can be accessed to confirm coverage of the area. If no coverage exists, fSNPd can calculate upper and lower bounds of allele frequencies and subsequent p-values.
The data input for the final fSNPd program are each individual’s vcf file, and the number of males and females. As an option the fSNPd program can check the actual read depth and quality of each SNP called, if this option is selected, each cohort individuals’ bam and bam.bai files are required. See Additional file 1: Supplementary data for fSNPd set up, and URL’s for downloading programs.
We wanted to determine how many SNPs were accurately sequenced in an exome, and so analysed 40 exomes by fSNPd to determine base by base and exon by exon coverage. This analysis sought to determine whether particular genes or exons of genes would be predictably included in, or missed from, our results.
We assess the performance of fSNPd for allele frequency determination by analysis of two patient cohorts (one of 116 individuals, the other of 34). From the results we selected a sample of SNPs from each where allele frequencies were found to be statistically altered and checked the results “by hand” by use of the Integrated Genome Viewer and by Sanger sequencing of all of the cohort individuals.
After generating results for all SNPs encompassed by an exome we chose to examine only those SNPs that would unequivocally alter proteins (nonsense mutations, start and stop codon mutations, canonical splice site mutations, and missense mutations predicted to be potentially pathogenic). This was for two reasons; such mutations are easier to assess by bio-informatics analysis, and are more easily amenable to functional testing to determine their pathogenicity. Others, however, may seek to examine all of the SNPs identified by the pipeline.
We simulated the performance of fSNPd under a variety of conditions, see Additional file 1: Table S2.
fSNPd is freely downloadable, instructions are in the Supplement [18].