High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs

Background Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses. Results We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 – 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs. Conclusion Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.


Conclusion:
Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.

Background
With the recent completion of the Perlegen/NIEHS mouse resequencing [1], over ten million mouse single nucleotide polymorphisms (SNPs) are now annotated in the public databases, resulting in a dramatic increase in the genome-wide knowledge of variation among 16 of the most widely used mouse strains. Importantly, this is a lower bound estimate because the C57BL/6J (B6) strain, used for the mouse genome reference sequence (NCBI m37, Apr 2007), is the only mouse strain sequenced in its entirety. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the recent insurgence of gene expression microarray and phenotypic studies using these mouse models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing (HTS), we are now in a position to determine to what extent polymorphisms remain cryptic (either undiscovered or not previously annotated as a SNP in a specific strain comparison) in various mouse models and assess their impact on downstream analyses. As an example, we demonstrate that cryptic SNPs are prevalent, even between two of the most commonly used and well-annotated inbred mouse strains, B6 and DBA/2J (D2).
The onset of next-generation HTS enabled us to obtain full sequence coverage of a region of chromosome 1 in the D2 and B6 mouse strains. There are several platforms for massively parallel DNA sequencing currently on the market [2], and we took this opportunity to directly compare the same dataset on two of the three most widely used platforms: Illumina (Genome Analyzer) and Applied Biosystems (SOLiD). The Genome Analyzer implements a version of cyclic reversible termination chemistry [3], and similarly, the SOLiD platform uses a self-checking ligation chemistry that maps into color space [4]. Both methods generated short reads that were then realigned to a reference sequence.
The present analyses were limited to a region of chromosome 1 from 171.6 -174.6 megabases (Mb). This interval was selected for four reasons. First, it is representative of the genome in that it spans discrete regions of high and low SNP densities. Second, it is a gene dense region con-taining 79 protein coding genes, 2 retrotransposed genes, and 6 noncoding RNA genes. Third, it harbors numerous quantitative trait loci (QTLs) affecting a wide variety of physiological and behavioral phenotypes [for summary, see [5]]. And finally, the low incidence of annotated polymorphisms in a SNP-sparse block has hindered high-resolution mapping in this region [6].
We report that comparisons utilizing two well-annotated mouse genetic models, D2 and B6, predict that more than half of naturally-occurring SNPs remain unknown or not annotated. These cryptic SNPs lead to a high incidence of gene expression microarray false-positive and false-negative results and lead to failures in identifying allelicallyvariant genes that can underlie QTL phenotypic effects.

D2 BAC contig and sequence
Using 32 PCR probes spanning the 171.6 -174.6 Mb interval of chromosome 1, we screened a commercially available D2 strain bacterial artificial chromosome (BAC) library, MM_DBa. End-sequencing of the BACs identified by these PCR probes allowed us to assemble a minimal overlapping contig of 27 D2 BACs ( Figure 1). The resulting D2 BAC contig spanned a total of 3.1 Mb from 171,509,721 -174,625,201 bp on chromosome 1 (Build 37). Contig sequence data from the SOLiD and Genome Anlayzer platforms were assembled via realignment to the Ensembl reference sequence [7] and covered the region surveyed without any large gaps.

B6 BAC contig and sequence
To evaluate our realignment strategy, we sequenced the corresponding region from the B6 strain, for which full reference sequence is available [7]. We prepared a B6 contig based on public end-sequence data for BAC clones from the RPCI-23 library [8]. The resulting B6 BAC contig spanned 170,806,384 -174,768,169 bp on chromosome 1 ( Figure 1). Using both the SOLiD and Genome Analyzer datasets we attained realignment coverage with the exception of two gaps. The first gap (11 kb) was expected since the RPCI-23 BAC library map lacks annotated coverage in this region. The second gap (100 kb) could be due to an error in the mapping of one or two of the RPCI-23 BACs, since we relied upon reported locations of the B6 BAC library clones, or, alternatively, could be due to one or two clones missing from our pools. These gaps were present in both assemblies indicating a template problem rather than a sequencing discrepancy.

Comparison of Applied Biosystems and Illumina sequence realignments
Applied Biosytems and Illumina each performed realignments of the D2 and B6 datasets to the public B6 reference sequence [7]. The SOLiD platform produces short reads (35 bp) encoded in a color-space format, and Applied Biosystems used their realignment pipeline specially designed to take advantage of this format for color-space read mapping and downstream analysis including SNP detection. The Genome Analyzer platform produces short read (33 bp) datasets with bases encoded in the standard letter representations, so Illumina carried out a different realignment approach and ran their dataset through the Maq [9] pipeline which includes both realignment and SNP calling procedures. Because both datasets included only short reads, we did not assess potential insertion/ deletions.
Comparison of our SOLiD and Genome Analyzer sequence data to the Ensembl reference sequence confirmed that the sequence from both platforms is complete. SOLiD generated 31.8 million reads for the pooled D2 BACs, with 12.4 million (39%) of those mapping to the chromosome 1 target interval. SOLiD generated 32.7 million reads of pooled B6 BACs with 10.5 million (32%) mapping to the chromosome 1 interval. Genome Analyzer generated 14.5 million D2 reads with 9.6 million (66%) mapping to the chromosome 1 interval, and 14.0 million B6 reads with 6.2 million (44%) mapping to the chromosome 1 region. Reads that did not align to the chromosome 1 interval were mostly BAC vector sequence (which was not removed when preparing the DNA), adapters from sequencing chemistry, and bacterial contamination. These are template specific problems and do not reflect upon differences between the sequencing platforms. Additional purification steps would have resulted in fewer unaligned sequences; however, the aligned sequences provided sufficient high read depth coverage, so the exclusion of the unaligned sequences did not adversely affect the present analyses.
Deep sequencing was achieved, with average read depths using the Genome Analyzer of 75 and 51 for D2 and B6, respectively, and 108 and 85, respectively, using SOLiD, for reads aligning to the chromosome 1 interval ( Figure  2). Figure 1 illustrates the BAC contig composition. Each BAC was pooled equimolarly and, as expected, we saw higher coverage (i.e., two or more times as many reads) in the areas where two or more BACs overlap because more template was included for those regions ( Figure 2). We observed differences in the masking of repetitive sequences in Applied Biosystems vs. Illumina realignments, suggesting that repetitive regions are problematic for both platforms with short read sequencing. Paired-end sequencing was not performed in the present analyses, but could decipher potential ambiguous, repetitive regions in future analyses.

B6 reference sequence quality
Although comparison of our custom HTS B6 data to the B6 reference sequence confirmed the high accuracy of the reference sequence (≥ 99.998%), we nonetheless identified a small number of discrepancies in the realignments to the reference sequence. Upon assembly, the Applied Biosystems and Illumina realignments of the B6 sequence reads differed from the B6 reference sequence for 41 and 60 residues (Table 1), respectively; 29 of these differences were consistent in both datasets when compared to the reference sequence ( Figure 3). Subsequently, based upon realignment discrepancies called by one or the other platforms, we detected 43 provisional realignment discrepancies between our HTS and the reference sequence. Sequencing on each the SOLiD and Genome Analyzer platforms was completed independently, with assembly algorithms each requiring a minimal depth of 3 reads in order to confirm an allele call. Quality of the nucleotide in question, as well as the quality of the surrounding base calls, was taken into account. Together, these restraints offer high confidence in the B6 allele calls identified by both platforms that differ from the references sequence. Overall, this indicates an exceptionally low discrepancy rate (only 0.001 -0.002%) between HTS data and the reference sequence and also demonstrates the high quality of the sequence data generated in our analyses. Importantly, our custom HTS sequence data and the B6 reference strain sequence were both from RPCI-23 B6 BACs, which were BAC sequencing coverage  generated from a pool of five female B6 mice from the Jackson Laboratory. Thus, it is extremely unlikely that strain or template inconsistency contributed to the discrepancies in the realignment of our sequence to the B6 reference sequence.

SNP identification and confirmation
In order to compare the SOLiD and Genome Analyzer platforms under optimal conditions for each platform and to compare in a manner the majority of end-users are likely to employ, the manufacturers used parameters determined to be optimal for SNP calling on their particular platform. The Illumina SNP calling method (Maq) relies upon quality scores and applies a filter based on these qualities, read depth, and neighboring SNPs in order to discriminate between a SNP and a base-calling error. Applied Biosystems calls SNPs based on two-base encoding in color-space allowing for more sensitive discrimination between SNPs and base-calling errors. Illumina's method produced fewer no calls (Ns) than Applied Biosystems' method based upon realignments performed by each vendor (Table 1). No nucleotide bias was apparent in the frequency of the calls using the Genome Analyzer and SOLiD platforms. We conclude that because of Maq's probabilistic approach using qualities, more SNPs are called by Illumina/Maq than by Applied Biosystems; however, it remains to be confirmed if Illumina has more false positive calls. It is important to keep in mind that because the vendors used independent mapping and SNP calling approaches, differential results due to the platform cannot be distinguished from those due to the analysis pipeline.
Currently, the Mouse Phenome Database (MPD), which includes dbSNP and Perlegen data among other resources, offers the most inclusive SNP queries for mouse strains, including comparisons of the D2 and B6 strains (see Methods for further explanation of SNP Databases). While, MPD currently annotates 4,527 D2 vs. B6 SNPs in the chromosome 1 interval (171.6-174.6 Mb), custom HTS identified 11,824 SNPs ( Figure 3) for the same interval (referred to as PARC SNPs because they were sequenced by one or both platforms in work supported in part by the Portland Alcohol Research Center, PARC). 9,152 (77%) of the 11,824 PARC SNPs identified by custom HTS were identified using both the Applied Biosystems and Illumina realignments (i.e., were identified by two independent experiments) and are therefore of very high quality ( Figure 3). and a more distal SNP dense region (172.9 -174.6 Mb) ( Figure 4). The SNP sparse region contains only 16 PARC SNPs based on our custom sequencing, and the SNP dense region harbors 11,808 D2 vs. B6 PARC SNPs. This data further defines the haplotype break between D2 and B6 and, importantly, provides additional genetic markers for fine mapping within the SNP-sparse region which previously was not possible [6]. Additionally, full SNP annotation will inform future SNP array chips allowing for more precise genotyping.

SNP impact on protein function or expression
We assessed non-synonymous SNPs that result in amino acid changes in proteins because of their clear impact on protein activity. Based on our custom HTS, we found 76 missense (non-synonymous coding) PARC SNPs that changed an amino acid residue between the D2 and B6 strains. 36 of these were new missense SNPs (Table 2). Confirmation experiments (e.g., transcriptome sequencing and PCR directed sequencing) and interrogation of additional datasets [6,10] already have confirmed 30 of the 36 new missense SNPs (Table 2), with the remaining six lacking additional data. In an effort to determine whether the 36 new missense SNPs affect protein function, we used SIFT (Sorting Intolerant From Tolerant) [11], which uses sequence homology to predict whether an amino acid substitution affects protein function, and PolyPhen (Polymorphism Phenotyping) [12], which uses multiple sequence alignments and protein 3D structure to predict protein sequence effects on function. Based on these two prediction methods, eight of the new missense SNPs were predicted to affect protein function, although only one of these amino acid changes was predicted to be damaging by both methods (Table 2). Our data also confirmed 40 missense SNPs annotated in MPD. Finally, five missense SNPs in MPD were not confirmed by our data, including three that had full sequence coverage in our data and were determined to have identical D2 and B6 alleles. Our data indicate that these three are false positive missense SNPs in the public dataset: Fcgr2b (K203Q), The other two had ambiguous coverage in our alignments and could not be evaluated: Fcgr3 (G72W) and Itln1 (N162K).
In addition to the new missense SNPs, we identified 7,627 new PARC SNPs in non-coding regions. These SNPs were not previously known or annotated and may regulate gene and/or protein expression. For example, in the Kcnj9 gene, which demonstrates differential transcript expression between the D2 and B6 strains [13], we identified 18 new SNPs within 2 kb upstream of the transcriptional start site as well as nine new SNPs in the 3' untranslated region ( Figure 5). We also identified one false-positive SNP annotated in MPD in the 3' untranslated region.
Thus, for Kcnj9 and other differentially expressed genes, elucidation of cryptic SNPs could identify nucleotide variation that underlies QTL phenotypic effects.

SNP impact on gene expression microarrays
Naturally occurring genetic polymorphisms dramatically impact hybridization based techniques, including gene expression microarray analyses [14]. With the ability to assess alternative transcript expression, exon microarrays have ten times as many probes as previous gene expression microarrays, so eliminating hybridization bias using SNP masks is increasingly critical. We have developed and applied a complete SNP mask using all of the PARC SNPs found by our custom HTS, which allowed us to rigorously assess differential expression between the D2 and B6 strains. We assessed the impact of 124 SNPs that lie within core probesets on the detection of differential (genotypedependent) exon expression for genes in the chromosome 1 region of interest using

Conclusion
We report that comparisons utilizing even two of the most commonly used mouse genetic models, D2 and B6, predict that more than half of naturally-occurring SNPs remain unknown or not annotated. This is particularly striking given that the present comparison is between the B6 strain, upon which the mouse reference is based [7], and the D2 strain, which is one of the best annotated mouse strains with sequence from Celera and extensive SNP detection data primarily from Perlegen [1]. There are approximately 1.8 million SNPs currently annotated between the D2 and B6 strains in MPD [14]. Thus, cryptic SNPs would have been even more prevalent had we used more divergent or poorly annotated genetic models.
We compare two methods for next-generation HTS. By sequencing the same templates on both Applied Biosystems' SOLiD and Illumina's Genome Analyzer sequencing platforms, we determined that the platforms offer comparable results at a high read depth. More SNPs were called by Illumina/Maq than by Applied Biosystems, but because the vendors used independent mapping and SNP calling approaches, differential results are likely due to Read depth for Applied Biosystems' SOLiD (red) and Illu-mina's Genome Analyzer (blue) realignments Mouse models are an invaluable tool for identifying allelic variation that contributes to genetically determined differences in physiology and behavior. However, allelic variation is problematic for follow-up analyses, and removing technical bias resulting from naturally-occurring sequence variation is critical. Previously, we illustrated the impact of SNPs on gene expression microarray analyses [14] and argued that complete SNP masks for gene expression microarray and other hybridization techniques are essential to appropriately interpret these data.
Here, we have taken the next step and sequenced a region of two of the most widely used mouse strains in order to determine the comprehensiveness of SNP data. What we have found is that the mouse SNP data currently available is incomplete. In fact, for the D2 vs. B6 strains, we predict that less than half of the true SNPs are currently annotated. As more divergent mouse strains, harboring even more cryptic SNPs, are used in studies, the impact of SNPs on interpreting results will become increasingly problematic. This glimpse at complete sequence data for two strains demonstrates that full genomic sequencing of the mouse strains used as research models is warranted.

D2 BAC library screen
Using 32 chemiluminescent labeled PCR probes designed across a 3 Mb region of chromosome 1 (see Additional File 1 for PCR probe primer pairs), we systematically probed a D2 BAC library consisting of 215,040 clones spotted on 12 nylon filters. The MM_DBa BAC library was generated at Clemson University Genome Institute [15] in 2002 using a single male D2 strain mouse from the Jackson Laboratory where their Genetic Stability Program uses a cryopreservation approach to effectively limit genetic drift and ensure strain stability. 90 BACs were identified in the library screen using the standard Roche DIG chemiluminescent protocol for probing DNA library filters.

D2 BAC end sequencing
BAC ends were sequenced in order to determine if we had overlap. This was done in several rounds, in which we   designed new probes as needed to fill in gaps in our 3 Mb contig. In total, we sequenced the ends of the 99 BAC clones and aligned these BAC end-sequences to the reference mouse sequence. Given that the stringency of the filter hybridizations was variable, that interpretation of positive filter spots was not always straightforward, and that the quality of BAC end-sequencing was not always consistent, we proceeded to confirm each clone we identified. We confirmed a BAC as positive for our region if the end-sequence mapped uniquely to the region of chromosome 1, and the PCR probe mapped in between the end sequences. We confirmed 57 positive BACs for the chromosome 1 region and aligned these to create a minimal contig of 27 overlapping MM_DBa BACs: clones 75C22,  329K2, 163C23, 109M5, 438J2, 138A14, 372P6, 264L1,  69J19, 220P1, 196G24, 250H1, 17F16, 374A8, 271K8,  467L22, 405D20, 431P3, 359G4, 259B4, 246D12,  248E11, 270C10, 440I3, [24]. Ensembl protein IDs and respective amino acid differences are shown with the B6 amino acid residue indicated first and the D2 residue given last. Each amino acid change was assessed using two programs to determine whether or not the amino acid change is predicted to damage protein function. SIFT uses sequence conservation for predictions and PolyPhen predicts damaging changes using both multiple sequence alignments (A) and/or protein 3D structures (S).  [17]. Briefly, the sequencing process relies on the amplification of fragmented DNA to form clusters. The sequence content of these clusters is then queried using cycles of fluorescently labeled dNTP addition, detection, and fluorescent groupremoval. The data was optimally assembled by Illumina using Maq (Mapping and Assembly with Quality) [18] with the default parameters in the easyrun script [9]. We used the cns.final.snp output file with minimum read depth of 3 and minimum consensus quality of 30 required to call a SNP for these analyses.

Applied Biosystems (SOLiD) sequence and assembly
Equimolarly pooled BAC DNA samples were sent to Applied Biosystems as per their requirements: 49.7 ug of B6 BACs and 69.3 ug of D2 BACs. Applied Biosystems prepared two fragment libraries for sequencing from each pool of genomic DNA via standard methods on their SOLiD platform [19]. In short, the SOLiD process attaches clonally-amplified template DNA to a bead and adapter substrate and queries cyclically by adding fluorescently labeled probes specific to two bases, detecting these dibase-calls, removing the fluorescent group, and repeating. This is followed by a primer-reset which shifts the starting primer, allowing different bases to be queried using the above base-calling cycle.SOLiD data was analyzed by Applied Biosystems using their freely available software package. The SOLiD Analysis Tools (SAT) process the array image, perform data filtering, calculate quality values, align to a reference genome, and generate base calls using default parameters [20]. SNPs were called with a minimum read depth of 3 as a requirement.

Determining non-synonymous SNPs
Non-synonymous SNPs were computed using the Ensembl CDS sequence from UCSC for each transcript. Any SNP falling within coding region and the exon boundaries were used to convert the B6 allele into the D2 allele predicted by the realignments. The coding sequences were then reassembled and translated in silico using CLC Sequence Viewer 5 [21]. The amino acid sequences were compared and any differences were noted.

Transcriptome sequencing
Striatal tissue from D2 (n = 8) and B6 (n = 8) males and females was dissected and total RNA isolated using standard Trizol (Invitrogen) protocol. 3 μg of total RNA was pooled (n = 8) for each strain and sent to Illumina for library construction and standard transcriptome sequencing. The realignment was performed using ELAND with 2 errors allowed in the first 32 bases of a read. Because we had incomplete read coverage for lowly expressed genes, strain-specific sequence results were analyzed for highly expressed genes only. So this dataset offers confirmation of some of the SNPs detected in our DNA sequencing, but did not achieve complete transcriptome coverage. Additionally, individual transcript queries in Ensembl include strain variation data that contain realignments of the original strain specific raw reads from Celera (including D2) to the reference B6 sequence. These SNPs provide independent confirmation of some of the SNPs identified by custom HTS and are annotated as confirmation by "Realignment of Celera raw reads" in Table 2.

Data availability
All of the PARC SNPs discovered from both Applied Biosystems' SOLiD and Illumina's Genome Analyzer sequencing pipelines have been deposited in dbSNP at NCBI under the Handle PARC_SEQ as Computational SNPs (ss#119994841-120015816). D2 BAC end sequence has been submitted to GenBank as GSS (genome survey sequences). Raw sequencing data for SOLiD sequencing has been submitted to GenBank in the SRA (short read archive) without intensity data. Raw sequencing data for the Illumina sequencing is available upon request; however this data was generated before SRA standards were established, and the specific raw files needed for SRF format are no longer available.