This study was approved by the Institutional Animal Care and Use Committee at Lawrence Livermore National Laboratory (Protocol Number 2009–207). The rabies positive brain samples included in this study were not covered by IACUC 2009–207 as they were taken from residual archived diagnostic samples offered by the California Department of Public Health Laboratory, Richmond, CA and were originally collected and tested to inform public health decisions on administering anti-rabies vaccination.
The data described in this paper is available via anonymous ftp as reads, BAM format alignments, consensus sequence, nucleotide frequency profiles at: ftp://gdo144.ucllnl.org/pub/orpdat/.
Rabies: Two brain tissue samples obtained from grey foxes (Urocyon cinereoargenteus) displaying symptoms of rabies were collected in Humboldt County, CA in March 2009 and December 2009 and tested for rabies virus via RT-PCR  using a modified protocol that amplifies a portion of the N gene. Approximately 1 gram of tissue from each brain was sent to LLNL for analysis. RNA was extracted from the tissue sample using TRIzol® LS Reagent (Invitrogen, Carlsbad, CA) following the manufacturer’s protocol.
Bovine coronavirus (BCV): Nasal samples were collected from approximately 100 asymptomatic calves. Samples were collected using sterile polyester swabs, placed in 2-3 mL of Eagle’s Minimum Essential Medium (Gibco) and transported on ice back to the laboratory. Collected nasal swabs were vortexed in Eagle’s Minimum Essential Medium supplemented with 1% antibiotic-antimycotic solution (Gibco). The sample suspensions were clarified by centrifugation at 2000 x g for 30 minutes, filtered through a 0.22 μm filter and aliquots of about 500–1000 μl were stored at −80°C. RNA was extracted using TRIzol® LS Reagent following the manufacturer’s protocol. Samples containing BCV RNA were identified as described in Cho et al. .
Approximately 11 kb of the rabies virus genome and 12 kb of the BCV genome were amplified using reverse transcriptase (RT) PCR. PCR primer candidates were selected based on the combined results of the multiple sequence alignment and sequence searches. This technique is a modified version of the approach outlined in Slezak et al. . Three sets of degenerate PCR primers were tested for the amplification of each overlapping region of the rabies virus and BCV genomes using increments of 1.5-2.5 kb. For each region, the two primer sets that performed best were used to amplify cDNA obtained from the two fox rabies samples and one BCV sample. All primers used for amplification are given in Additional file 1.
RT-PCR and cloning
The rabies and BCV genomes were amplified using two-step RT-PCR using Superscript III RT reverse transcriptase kit and Platinum Pfx polymerase (Invitrogen), following manufacturer's instructions. Reverse transcription was performed using random hexamers the PCR conditions consisted of 94°C for 5 min, followed by 35 cycles of 94°C for 15 s, 54-60°C for 30 s, and 68°C for 2.5 min.
A 1 kb region of the rabies virus and BCV genome were amplified and each cloned into a plasmid vector. The inserts were generated by RT-PCR as described above using rabies and BCV polymerase primers: RVpolyF1 5’ CCCCTGACTCCTTATATCAAAACC, RVpolyR1 5’ GCGAGGTTGACTATTTGGTC, BCVpolyF2 5’ TTTGCAGACAAATTGGTGGA, and BCVpolyR2 5’GGCGTAAATTTCATCCTGCT. Poly 3’ A overhangs were added to the PCR products by incubating the products with Taq polymerase at 72°C for 10 min. TOPO TA Cloning Kit for Sequencing (Illumina) was used to clone the PCR products into One Shot® TOP10 cells (Invitrogen) as per manufacturer’s instructions. Sanger sequencing of the cloned controls was carried out by ELIM Biopharmaceuticals, Inc., Hayward CA. PCR products were prepared for Illumina sequencing using the QIAquick PCR Purification kit (Qiagen).
Sequencing of the three natural samples and the two control plasmids was carried out by Eureka Genomics (Hercules, CA) using an Illumina Genome Analyzer IIx. Each natural viral sample was sequenced in a separate lane of a single flow cell using paired-end reads on short genomic fragment inserts using read lengths of 112 bases. The clonal controls were mixed in a single sample with an approximate concentration ratio of 10:1 (rabies:BCV) and sequenced on a separate lane. Since the PCR primers could potentially introduce false mutations into the amplicon pool due to non-specific binding, primer regions were masked out for the downstream analysis. Table 1 summarizes the output generated in the sequencing runs.
To compare error rates between different sequencers, rabies control plasmid was sequenced a second time at Elim Biopharm (Hayward, CA) using Illumina Genome Analyzer IIx. A shorter fragment length was chosen with more complete overlap of the read pairs. The same amplicon pool was used for both sequencing runs.
Read mapping to reference
The open source software SHRiMP2 was used for read mapping. The tool was chosen for its high read mapping sensitivity  and its ability to map as many reads as possible in the face of individual errors within each read .
A consensus sequence was generated for each sample following an iterative comparative assembly procedure suggested by Willerth et al. . In this approach, an initial reference sequence was chosen, reads were mapped to the reference, then a new consensus sequence was generated and the reads were mapped to the new consensus again. The procedure continued until the consensus converged on a single sequence.
All rabies reads were initially mapped to GenBank rabies reference sequence GI:260063801. This reference sequence was used as the common coordinate system for comparing samples and identifying coding frames. Similarly, GenBank bovine coronavirus GI:15081544 was used as the reference sequence for the BCV samples.
Based on a later observation that our sequenced rabies virus genome differed by approximately 9% relative to the pre-selected reference fox rabies genome, we checked to see if observed error rate (defined below) would increase by introducing random mutations at 9% of the plasmid control reference sequences generated from Sanger sequencing. Increased divergence between the sample and the randomly mutated reference sequence could confound the read mapping program and introduce additional alignment errors, however, no noticeable increase in error rates were observed, suggesting that the read mapping parameters were able to tolerate this rate of divergence.
Estimation of error rate from control plasmids
The two plasmid controls were used to empirically model combined PCR and sequencing errors as well as to evaluate our algorithm for making genetic variant calls. The clone control samples were amplified using the same PCR amplification protocol as was used for sequencing natural samples. A control reference sequence was generated from a separate Sanger sequencing run. Any polymorphisms that deviated from the consensus sequence were taken to be examples of error introduced either through PCR amplification or sequencing.
At every base, any nucleotide called by a read is referred to as a “candidate base call”. Error rates were calculated as a ratio between the total number of candidate base calls differing from the consensus nucleotides summed across the genome and the total number of base calls made across the genome.
Quality control of the sequencing data
The following rules were implemented to maximize the quality of the reads. Quality scores (Q-score) from the sequencer were used to compile nucleotide frequency distribution for every base sequenced. These nucleotide frequency distributions were generated at four quality thresholds for comparative analyses: raw (all reads), Q≥10, Q≥20 and Q≥30. At a given base, a read covering the base contributes to a candidate base call only when the minimum Q-score over an 11-nucleotide window (±5bp) centered on the query base surpasses the quality-score threshold being considered (e.g. the Q30 nucleotide frequency profile). In addition to Q-score filtering, a misalignment filter required that the 11-nt window contained no indels and that the query position must be at least five bases away from the end of the read to avoid misalignment for single reads. To avoid the potential of higher error rates at the 3-prime ends of single reads, only the first 80 bases were used in single read analyses. Furthermore, when calculating error rates and making variant calls, only those bases where greater than 10% of the reads that survive the quality filters are considered for analysis at that quality score. This is to avoid inclusion of non-representative features for select bases in the genome.
Sequencing error analysis
At every base, all overlapping read pairs were separated into two categories: matching and non-matching base pairs. Matching base pairs have two complementary nucleotides and non-matching base pairs have two incongruent nucleotides. Mismatched read pairs were only used to calculate mismatch rates and examine the relationship between mismatch rates and quality scores. Other than that, mismatched read pairs were excluded from all error and variant analyses.
Mismatch rates were calculated two ways, ‘per position’ and ‘per base pair’. Per-position mismatch rate is the fraction of overlapping read pairs that are mismatched at any given location in the genome. Per base pair mismatch rate is the total number of mismatched read pairs summed across all bases in the genome divided by the total number of read pairs.
To generate quality score distributions for the matching and mismatched read pairs, Q-scores for every base pair were compiled as follows. For a matching base pair, the average quality score was used. For a non-matching base pair, the minimum quality score was used. The resulting Q-score distributions were compared and used to generate Q-score receiver-operator characteristic (ROC) and ‘false discovery rate’ curves.
Sequencing errors can occur in two forms in overlapping read pairs. Non-complementarity between the forward and reverse strands at a given base indicates that at least one of the two nucleotides is erroneously incorporated. This type of error is straightforward to exclude -- all non-matching read pairs are excluded from analysis except in the analyses for quality control. A second, more rare but ‘hidden’ form of error is where two complementary errors occur on both the forward and the reverse strands such that the resulting read pair remains complementary.
Making variant calls
Variant calls were made based on the binomial error model used by Eriksson et al. [40
], with modifications to address the non-uniformity in sequencing error rate measured with ORP mismatch rates. Specifically, the probability of observing x or more mutations in N matching read pairs covering the base is given by the survival function of the binomial distribution B (N, p
The error rate, p, is the combined PCR and sequencing error, ε, adjusted by a function of the ORP mismatch rate, δ, at the base in question.
The read-pair mismatch rate, δ
, is the position-dependent
rate at which a nucleotide is mis-incorporated into a single strand. For simplicity, we modeled the probability where two complementary nucleotides are mis-incorporated simultaneously on forward and reverse strands at the same base as δ
. We used the maximum of the clonal control derived error and the square of the ORP mismatch rate at a particular base as the adjusted position-dependent error rate
for binomial error model
P-value = 0.01 with Bonferroni correction was used as the significance threshold for each hypothesis test.
Scripts for the variant detection model were written in Python and R and are available upon request.