We began our analysis with the Sanger sequence data generated as part of the original rhesus macaque genome project . A single animal (animal ID 17573) from the Southwest National Primate Research Center (San Antonio, TX) Indian-origin rhesus colony was sequenced to approximately 5.2X coverage (18.4 Gb of raw sequence) using Sanger whole genome shotgun methods. Additional details of that reference genome analysis have been published previously . From this Sanger data, approximately 4.3 million basepair positions were called as potentially heterozygous using SNPdetector  (Houston, TX). We obtained additional genomic DNA from animal 17573 and created two libraries for SOLiD sequencing, one library for fragment sequencing and one for mate-pair sequencing (Table S1, Additional file 3). Using published methods [41, 42] (50bp fragment (SRA accession: SRX029055) and 25 × 25bp mate pair data (SRA accession: SRX029056 )) we generated 5X fragment sequence coverage and approximately 8X mate-pair sequence coverage. The proportion of matching beads to total beads was: 49.52% for the fragment library and 55.22% for the mate pair library (F3+R3). Following the re-sequencing of the original reference animal, we selected two unrelated Indian-origin rhesus macaques (animal ID's r1766 (SRA accession: SRX029057) and r02120 (SRA accession: SRX029058)) from the population at the Wisconsin National Primate Research Center (University of Wisconsin, Madison). Each DNA sample was used to produce fragment sequencing libraries, and each library sequenced to low coverage (50bp HiDRA, r1766 to 5.8X, r02120 to 3.3X). The ratio of matching beads to total beads was 58.64% and 49.89%, respectively. The library for r02120 exhibited a decreased degree of unique beads relative to r1766 and 17573, but did produce a substantial amount of high-quality data.
We also downloaded additional 454 sequence reads from NCBI (26.2 Mb) produced during the original rhesus sequencing project  from a comparison of rhesus macaque sub-species (8 Chinese-origin and 8 Indian-origin unrelated rhesus macaques). SNPs were called using these reads and the default settings for AtlasSNP2  without filtering for coverage. Sub-species comparison read data was provided at 0.5X coverage . 142,781 potential SNPs were identified in the sub-species comparison data set.
SNP calling and validation methods
Identification of specific bases that are heterozygous in a given individual was accomplished using different methods for different sequencing technologies. The original Sanger read data for the reference animal was searched for SNPs (i.e. heterozygous basepair positions) using SNPdetector . For the SOLiD read data, corona_lite v4.0 r2.0 (Life Technologies, Carlsbad, CA) was used. SNPs were called in the sub-species comparison data using AtlasSNP2  as described above.
We considered all basepair positions called as heterozygous in any one dataset as "potential SNPs." In order to classify a given SNP as "validated," we required the specific base pair position be called as an identical heterozygote in two independent datasets. This could mean either observing that position as heterozygous in the same individual using two different sequencing methods, or observing that position to be heterozygous, with identical alleles, in two different animals. Two complimentary methods were used to identify SNPs in the data sets described above: positional-allele comparison and e-genotyping.
Positional-allele comparison relies upon parameters of sequence quality and read coverage as implemented in various SNP calling programs (in this study we used SNPdetector, corona_lite, and AtlasSNP2). This method compares lists of identified potential SNPs by chromosome, base pair location, and both alleles. The comparison method can identify multiple SNPs clustered close together or those near to breakpoints in the reference sequence. But this method has a potentially higher false negative rate compared with e-genotyping, due to the relative stringency of the SNP calling programs used upstream. To reduce false negative calls, we used stringency thresholds somewhat lower than those used in some other studies. Our settings generally retained potential SNPs with read quality scores equivalent to a Phred score of 20 and a minimum of 2 reads covering each variant allele call.
E-genotyping is a novel approach that takes advantage of an a priori defined set of potential SNPs, and then tests raw reads (color-space or sequence space data) directly for exact probe matches to the previously defined SNP location and sequence. A probe region consists of 15 bases on each side of the potential SNP. These flanking regions must be pre-defined, and must match the target reads in the new sequence dataset exactly, with different reads providing exact matches to the two expected alleles in the potential SNP position. This novel method of validating potential SNPs in a new read set is restricted by the exact matching parameter that confers the high specificity. When e-genotype was used to screen human 1000 Genomes read data for known SNPs (from dbSNP) in the same individual, the error call rates were extremely low for miscalled homozygotes (called as wrong homozygote (0.002%)) and heterozygotes (called as wrong heterozygote (1.01%) or homozygote (1.52%)) (Figure S3, Additional file 4). dbSNP concordance in human data from the 1000 Genomes pilot study  using e-genotype indicates that the total miscall rate is ~2.6% (Figure S3, Additional file 4). When extremely high coverage or low coverage SNPs (delineated by dotted lines on figure) are removed from consideration (high coverage SNPs are typically non-specific and low coverage SNPs are often miscalled hets where coverage was simply insufficient to detect both alleles), the miscall rate is calculated to be 1.6%. Because of the stringency of probe placement, e-genotype has much reduced effective read coverage for any SNP, but of the SNPs that are called by e-genotype, the probability of false positive calls is extremely low. E-genotype has significantly reduced power to detect clustered SNPs, especially in linkage disequilibrium, when they fall within the 31 base pair probe region. E-genotype is unable to identify SNPs located within 15 base pairs of gap regions in the reference sequence or chromosome ends.
We developed a perl pipeline to compare chromosome, base pair locations, and both alleles for all pair-wise comparisons of SNP lists. At least one read of sufficient quality (equivalent to a minimum Phred score of 20) for each allele in each data set was required for a positive result, and all instances with more than 200 reads covering an allele were removed irrespective of average read coverage for the data set, due to presumed mis-mapping or non-unique mapping. Once these comparisons were completed, all of the validated SNPs were appended in a single file and duplicates were removed by chromosome/base-pair determination.
E-genotyping (http://is04607.com/~drio/egenotype/) was performed on all the SOLiD read sets from animals 17573, r1766 and r02120. The probe sets were created from: a) the Sanger data from the original reference animal (17573), b) the novel SNP calls from corona_lite for r1766 and r02120, or c) the reads produced using Roche 454 methods on unrelated animals in the original rhesus genome paper . All the probe sets were checked for unique mapping locations by mapping to the reference genome assembly prior to running e-genotype on experimental read sets. E-genotyping results with at least one read for each allele were considered to be positive when comparing data from different chemistries.
We annotated the SNPs using a Java tool that accessed Ensembl (build 57, March 2010). Annotation of SNPs included the genomic placement (intronic, intergenic, synonymous coding, nonsynonymous coding, 5'UTR, 3'UTR, or splice site) for all applicable transcripts. If a SNP was determined to fall in a genic region (coding and non-coding), the following annotations were added as applicable: gene name, gene function, codon, reference amino acid, variant amino acid, protein position.
All of the validated SNPs have been submitted to dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/) (ss numbers available in Table S2, Additional files 5, 6, 7, 8, 9 and 10). Unvalidated potential SNPs will be available at our lab Genboree site (http://genboree.org/java-bin/project.jsp?projectName=Rhesus%20SNPs%20using%20Next-Gen%20Sequencing&isPublic=Yes) .