High-throughput sequencing: a failure mode analysis

Background Basic manufacturing principles are becoming increasingly important in high-throughput sequencing facilities where there is a constant drive to increase quality, increase efficiency, and decrease operating costs. While high-throughput centres report failure rates typically on the order of 10%, the causes of sporadic sequencing failures are seldom analyzed in detail and have not, in the past, been formally reported. Results Here we report the results of a failure mode analysis of our production sequencing facility based on detailed evaluation of 9,216 ESTs generated from two cDNA libraries. Two categories of failures are described; process-related failures (failures due to equipment or sample handling) and template-related failures (failures that are revealed by close inspection of electropherograms and are likely due to properties of the template DNA sequence itself). Conclusions Preventative action based on a detailed understanding of failure modes is likely to improve the performance of other production sequencing pipelines.


Background
In the past decade, the demand for DNA sequence data has driven the transformation of sequencing from a research activity into a manufacturing process. Highthroughput sequencing facilities are focused on establishing automated procedures that maintain long read length and high overall success rates. It is neither practical nor economical to test each and every DNA template before sequencing [1]. Sequencing centres, therefore, monitor sequencing success on a larger scale referencing overall pass rates and average read lengths, typically in terms of Phred 20 bases [2]. The percentage of "sporadic sequence dropouts" or failed reads that inevitably occur within a pool of high quality data is often overlooked and rarely examined. Failed reads can be a result of numerous variables ranging from pipeline methodology employed to the nature of samples being sequenced. A Failure Mode Analysis (FMA) strategy was developed to determine the likely causes of sporadic unsuccessful sequence reads. We systematically examine these failed reads in the context of a high-throughput sequencing pipeline to establish the mode and frequency of each type of failure. The standard production pipeline at Canada's Michael Smith Genome Sciences Centre (BCCRC, British Columbia Cancer Agency, Vancouver, Canada) has a capacity to generate over 3.6 million reads per year. As of December 8, 2004, we have generated 1,263,904,347 Q 20 bases using our 384-well culturing, DNA preparation, and cycle sequencing procedures. The average Q 20 read length of data generated in the past 12 months (December 2003 to December 2004) from various library types and vector systems is 751 bases. The present study was undertaken to provide insight into the causes of sequencing failures and possible corrective actions. Although our pipeline uses exclusively ABI 3700 and 3730XL automated sequencers, these results should be applicable, in principle, to the improvement of other high throughput sequencing platforms.

Results
We generated 9,216 reads from 2,304 clones selected randomly from two cDNA libraries. For each of the two libraries, 1,152 bacterial colonies containing cDNA inserts were picked and arrayed into 384-well microtiter plates ( Figure 1). To verify loss of DNA due to handling or equipment mishaps (i.e. clogged capillary or tip), each microtiter plate was cultured in duplicate and replicates were processed using the same instrument model but on different physical units where available. A resulting 4,608 reads were generated for the 5' end using the M13Reverse (5'-CAGGAAACAGCTATGAC-3') primer and 4,608 reads were generated from the 3' end using the M13 Forward (5'-TGTAAAACGACGGCCAGT-3') primer. The average Q 20 read length for the entire data set was 771 bases, average pass rate was 87% which was calculated as a percentage of sequencing reactions yielding a minimum of 600 Phred 20 bases. Figure 2 illustrates a break down of Q 20 read lengths from the full data set. The analysis methodology employed to determine the failure mode of each trace is outlined in Figure 3. 1,172 reads (13%) represent the failed portion of the data set (Q 20 < 600) for further analysis to determine failure mode. The electropherograms from the 1,172 failed reads were evaluated and subsequently categorized into failure mode categories. 64 of these reads were yielded from sequencer capillaries that were clogged and therefore were removed from further analysis and categorized into the "Blocked capillary" failure mode. The remaining 1,108 traces were further classified into nine additional failure mode categories including Low signal strength, Mixed clone with vector sequence, Mixed clone-no vector sequence, Low signal to noise ratio, Excess dye peaks, Hardstop, Repetitive sequence, Homopolymer stretch, and Poly A tail. Results and trace characteristics used to classify each read are as described in Table 1. Eight of the classifications described in Table 1, except "Low signal strength", are final failure mode categories and contain 74% of the total reads.
We identified various failure mode trends by examining process versus template-related failures. Figure 4 shows 62.4% of reads with Q 20 <100 bp fail as a result of processrelated problems while 75.5% of reads in the Q 20 : 500-599 bp bin fail due to template-related characteristics. The proportion of failed reads due to process-related problems are more abundant in traces with lower average read lengths while the opposite trend showing template-   Figure 5B.
The most common mode of failure due to process was "low signal, no DNA", there were 146 reads in this category, where no template DNA was present in the sequencing reactions. This was confirmed by agarose gel electrophoresis. 145 of the reads resulted in read lengths less than 100 bp. From these 146 reads, 128 original clones failed to grow in the source glycerol stock microtiter plates. The remaining 18 clones were lost during the DNA preparation/purification procedure. Mixed clone reads containing vector sequence was the second most prevalent process-related failure mode. The 137 reads in this category were distributed broadly across the Q 20 : 100-599 bp range. The more successful reads yielded electropherograms with significantly stronger signal strength from one reaction product compared to the other.
The most common failure mode resulting from templaterelated characteristics was "Poly A tail" or failed reads with attenuated sequence due to poly A tails, there were 484 reads in this category. The distribution of these 484 reads is skewed towards the Q 20 >300 bp bins. Failures due to "Low signal, 5', 3' template characteristics" also contribute to a significant number of template-related failures. These 116 reads were Q 20 <100 bp and make up the majority failure mode within that Q 20 bin.
Our sample set was made up of cDNA clones and therefore contain some 3' template biases inherent in the sequencing of cDNAs. In a best effort to obtain a distribution representing randomly generated end sequence from various library types, we remove all failed reads resulting from 3' template attributes. The two failure modes targeted for removal are "Poly A tail" and the 3' reads within the "Low signal, 5', 3' template characteristics" category. 484 and 94 reads were removed from each respective category representing 49.3% of all failed reads. The resulting template-related failure mode distribution is represented in Figure 6. The process-related distribution of failed reads ( Figure 5A) remains the same. Failed reads due to homopolymer stretches other than those resulting from poly A tails, make up the prominent failure mode within Average read length breakdown Figure 2 Average read length breakdown Distribution of read length (Q 20 bases) for full data set of 9,216 reads. Results were divided into 100 bp bins, failed reads (Q 20 <600 bp) make up 13% (1,172 reads) of overall reads. Reads with Q 20 <100 bp make up the largest proportion of failed reads but contributes to only 4% of overall data set. this new template-related failure distribution with 99 reads. The distribution of these reads is weighted towards Q 20 >300.
We further analyzed the 310 reads (26.5% of failed) in the preliminary "Low signal strength" category in Table 1 to more finely determine each failure mode ( Figure 2, Table  2). We removed the 146 reads and classified them as process-related, "low signal, no DNA" failures as described above. For the remaining 164 reads, DNA was found to be present but there was no evidence from gel images for excess DNA. It is therefore unlikely that the low signal failures were due to overloading of capillaries with excess carryover template DNA.
The 164 reads that appear to have sufficient template DNA are grouped by source clone for further analysis. Four possible reads may have failed for each unique clone: 5' replicate #1 (F1), 3' replicate #1 (R1), 5' replicate #2 (F2), and 3' replicate #2 (R2). There were 49 occurrences where two reads failed from one source clone, 2 instances where three reads from the same clone failed and 4 occurrences where all four reads failed from the same source clone. These results are shown in Table 2 and Table 3 summarizes the overall breakdown of 310 reads in the Low Signal Strength category. A binomial test of our data set indicates that the likelihood of 2, 3, or 4 reads failing from the same source clone is 1.9 × 10 -3 , 2.2 × 10 -5 , and 9.7 × 10 -8 , respectively. Thus, it is unlikely that more than one read failed per clone by chance.
Groupings with two failed reads from the same end of the template, "F1/F2" and "R1/R2" showed trace data with noisy background or no signal trace therefore there was insufficient information to determine actual template characteristic leading to the problematic results. With the low probability of seeing two failed traces from one unique clone, we conclude that these failures are most  likely a result of template DNA with 5' or 3' ends that are problematic for sequencing. Possibilities include secondary structure, mutated priming sites, long poly A tail, or other homopolymer stretches near the priming sites on the template.

Analysis pipeline
The "F1/R1" grouping contained failed reads from both ends of the same replicate but successful end reads from the second replicate. This would indicate that there were no clone related attributes on either end of the template to prevent good sequence thus the failed reads were assessed a process-related failure due to loss of DNA during the precipitation process.
The "F1/F2/R1/R2" grouping indicates that no passing sequence data were obtained from 5' or 3' ends of both replicates from the original source clone, all 16 reads showed presence of DNA at each observational check in the process pipeline. The probability that 4 reaction products from a single clone were lost during precipitation is extremely improbable and it is more likely that the original clones had attributes that prevented successful sequencing from either ends. We therefore assign 5' and 3' template-related failure modes for the 16 reads in this grouping.
The final groupings of "F1/F2/R1" and "F2/R1/R2" contain 6 reads in total. Each include failed reads from both ends of one replicate and a single failed read from the second replicate. We maintain the same logic for our failure mode evaluation by assessing both a template-related failure to the 2 failed reads from the same replicate and a Distribution of overall process vs template failed reads There were 44 reads in the singleton category where only a single read (F1, F2, R1, or R2) failed from each clone. These failures cannot be attributed to the template as three other reads were successfully sequenced from the same source clone. Failure mode was therefore likely to be process related and due to loss of DNA during the precipitation wash steps.

Discussion
Having established a stringent pass criteria of 600 Phred 20 bases or greater, we observed a broad range of failure modes distributed among the failed reads. Reads less than 100 bp did not yield useable data and were usually a result of process-related failure modes. Process-related mishaps in the pipeline would include missed colony picks, blocked capillaries on sequencers, faulty tips on automated liquid handling devices, or lost DNA template during precipitation. The sequences generated have low average read lengths, or in cases where DNA is totally lost, zero read length. 88% of reads within the largest "process" related failure mode category, "low signal, no DNA", were a result of failed cultures at the beginning of the process pipeline. The absence of DNA was verified early on by failed growth in the source microtiter plates and further confirmed by failed growth in the stamped replicate agar plate. Failures are perhaps due to missed bacterial colony picks and adjustments such as recalibration or refining the morphology criteria of the colony picker should help. The second most prevalent failure mode in the "process related" category, "Mixed clone with vector sequence" (30% of reads in this classification), is likely a result of two side by side colonies being picked into the same well or through contamination originating from the use of automated liquid handlers. These problems might be alleviated by increasing the stringency of the proximity criteria on the colony picking device to ensure sufficient distance exists between two bacterial colonies before they are chosen for picking. As we use new tips on the instruments daily, increasing the length or number of wash cycles on the automated liquid handlers between volume transfers may help in reducing carry over contamination from one plate to another. This will need further investigation. Cross contamination between adjacent wells can also arise with the mishandling of plates, or when tips are plunged too deeply into the wells, displacing the volume into adjacent wells. Adjusting the liquid handler by reducing the depth and speed of tips entering each well may help prevent the occurrences of cross contamination within a plate. Despite the presence of mixed clones within one well, traces often yielded discernable sequence with read lengths broadly distributed up to the Q 20 : 500-599 bp bin. This is likely the result of a presence in greater amount of one reaction product over the second leading to proportionately stronger signal strength.
Template-related failures often still yielded reads of several hundred base pairs. Template characteristics that might contribute to failed or truncated reads could be poly A tails, homopolymer stretches, or other highly repetitive regions. Problems in sequencing these types of regions are well cited in the literature and are dependent on chemistry and methodology used in sequencing. These problems are best avoided by using an anchored primer for first strand synthesis of cDNA as the first step in library construction. "Poly A tail" was the single largest cause of failed sequence in both the template-related failure category as well as the entire data set, yet the reads make up Process failure mode distribution Reads with a "Low signal 5', 3' template characteristic" failure mode contribute to the largest proportion of data with read lengths Q 20 < 100.
the majority in the higher failed read length bins. In our regular production pipeline, we often circumvent this problem by using a combination of a 3' primer to resolve sequence immediately following the poly A coupled with an oligo(dT) 23 N (N = A, G, C) anchored primer to resolve the 3' ends of cDNA inserts.

Conclusions
The FMA pipeline described here was tailored to our highthroughput 384-well automated sequencing pipeline but many of the components in this platform are shared within the high-throughput sequencing community such as the alkaline lysis procedure to prepare and purify DNA, DNA sequencing using Big Dye chemistry, ethanol precip-itation to clean up reaction products and equipment such as the Genetix QPIX, Beckman Coulter Biomek FX, and Applied Biosystems 3730 xl DNA analyzer. For this reason our FMA methods can be readily adapted to analyze other similar sequencing platforms. Extending our present study to other library types commonly sequenced in the high-throughput community, such as shotgun or Serial Analysis of Gene Expression (SAGE) should offer further information regarding template specific failures. This failure mode analysis provides information on distinguishing between process and template-related attributes that may lead to downstream failed sequence. It can therefore be a useful tool used to audit overall sequencing procedures and identify key problematic steps in a process pipe-Template failure mode distribution Figure 6 Template failure mode distribution Template-related failure mode distribution excluding reads failing due to poly A tail and 3' template-related attributes. A majority of failed reads are a result of "Homopolymer" or single nucleotide repeat sequence leading to slippage in sequence data, these failed reads are skewed towards read lengths Q20>300.

Methods
Samples are processed in duplicate, with quality assurance and observational checks to account for status of each read after every procedure leading to the final sequence read. These recorded observations facilitate downstream systematic analysis of failed sequence data. The regular core production pipeline was not altered for this study, as the purpose is to assess sequence failure modes under ordinary conditions in a high-throughput sequencing environment. An overview of the observational checkpoints within the FMA pipeline is outlined in Figure 1. Template DNA is considered present in all wells up to the completion of DNA sequencing unless absence is indicated by no growth of bacterial culture or blank lane on gel. Once cycle sequencing has completed, the reaction products are precipitated, pelleted, then washed. As it is very difficult to qualitatively asses the presence of DNA accurately after the precipitation procedures other than by sequencing, we draw conclusions regarding loss of sequencing reaction products during precipitation by a process of elimination. Any failed reads that result from a process-related loss of DNA and have no prior observations indicating absence of DNA are attributed to the precipitation procedure.

Transformation and colony picking
One microliter of ligation mix from each of two Populus trichocarpa cDNA libraries were transformed by electroporation into 40 µl of E. coli DH10B T1 resistant cells (Invitrogen). Transformed cells were recovered using 1 mL of SOC medium (prepared in house) and plated onto 22 cm × 22 cm agar plates (Genetix) containing 2xYT agar and 100 µg/µl Ampicillin. Agar plates were incubated overnight at 37°C for 14 hours. Bacterial colonies were picked from the agar plates and arrayed into 384-well microtiter plates (Genetix) containing 60 µl of 2xYT medium + 7.5% glycerol (made in house) using the Genetix QPIX automated colony picker (Genetix). A total of six 384-well microtiter plates, three plates for each of the two cDNA libraries, were picked. The plates were incubated overnight at 37°C for 16 hours then each microtiter plate was inspected for wells that contained no growth. The positions of all failed cultures were recorded. To further verify growth in each well after the incubation period, a disposable 384-well replicator was used to stamp bacterial culture from each 384-well microtiter plate onto a new 22 cm × 22 cm agar plate containing 2xYT agar and 100 µg/ul ampicillin. Agar plates were incubated overnight at 37°C for 14 hours then inspected the next day for colonies from every well. The positions of all failed growths were recorded.

384-well culturing and DNA purification of plasmid clones
Two microliters of bacterial culture was transferred from the 384-well microtiter plate into a 240 µl 384-well deep well diamond plate (Axygen) containing 60 µl of 2xYT medium and 100 µg/ml ampicillin using a 384-well slot-  DNA sequencing DNA Sequencing reactions were assembled in 384-well clear optical reaction plates (Applied Biosystems) using a Biomek FX workstation (Beckman-Coulter). In each 5 µl reaction (total volume) the following were added: 3 µl of purified plasmid DNA (~45 ng/µl), 0.26 µl of sequencing primer (5 pmol/µl, Invitrogen), 0.43 µl of 5X reaction buffer (Applied Biosystems Big Dye Terminator 5X Sequencing Buffer), 0.77 µl of Ultrapure water (Gibco), and 0.54 µl of BigDye v.3.1 ready reaction mix (Applied Biosystems). Each well of the reaction plate was visually inspected for appropriate volumes after both reaction mix and DNA addition. Volumes were manually adjusted using a single channel pipet (Gilson) where required and all observations were recorded. Sequence data were obtained using universal M13 Forward (5'-TGTAAAAC-GACGGCCAGT-3') and M13 Reverse (5'-CAG-GAAACAGCTATGAC-3') primers on each set of replicate plates. Thermal cycling was performed on PTC-225 thermal cyclers (MJ Research) with parameters of 35 cycles at 96°C for 10 seconds, 52°C for 5 seconds using M13 Forward primer or 43°C for 5 seconds using M13 Reverse, 60°C for 3 minutes, followed by incubation at 4°C. Reaction products were precipitated by adding 2 µl of 125 mM EDTA (pH8) and 18 µl of 95% ethanol per well followed by centrifugationat 2750 × g for 30 minutes in an Eppendorf 5810R centrifuge. The EDTA/ethanol was immediately decanted and reaction products washed with 70% ethanol. The 384-well cycle plates were allowed to dry inverted for 15 minutes. Samples were resuspended in 10 µl of Ultrapure water and analyzed using a 3730XL DNA analyzer (Applied Biosystems). The performance of each capillary on the four DNA analyzers used in this experiment were validated using one 384-well control plate for each instrument. Each 384-well plate contained our inhouse control standard, a full-length human cDNA clone obtained from the I.M.A.G.E. Consortium (I.M.A.G.E. ID #3609158, Lawrence Livermore National Laboratories). Blocked capillaries were recorded into the FMA database and traces originating from these capillaries flagged. Sequence data were evaluated using PHRED software [2] (v.0.020425.c) and the chromatograms were viewed using a java applet based on 'ted' [4] -a publicly available trace file viewer. A relational database "FMAdb" was created using MySQL for flexible querying of results. The FMAdb was populated with sequence data, plus process observations such as absence of bacterial colonies and no-grows, trace evaluations and information, as well as equipment and sequence run details.