Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance
© Bashir et al; licensee BioMed Central Ltd. 2010
Received: 23 December 2009
Accepted: 18 June 2010
Published: 18 June 2010
Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance.
For structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability.
Together, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGS-DesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
Massively parallel sequencing technologies provide precise digital readouts of both static (genomic) and dynamic (expression) cellular information. In genetic variation, whole genome sequencing uncovers a complete catalog of all types of variants including SNPs  and structural variations . Transcript sequencing [3, 4], small RNA sequencing and CHip-Seq  allow a measurement of dynamic cellular processes. These technologies provide unprecedented opportunities for genomics research but also pose significant new challenges in terms of making the optimal use of the sequencing throughput. The individual laboratory might not be equipped to provide correct, and cost-effective designs for the new experiments. By 'design', we refer to questions such as "How much sequencing needs to be done in order to reliably detect all structural variations in the sample to a resolution of 400 bp?" Confounding this further is the proliferation of a large number of sequencing technologies, including three widely used platforms, Roche/454 , Illumina  and ABI SOLiD , and others such as Pacific BioSciences  and Helicos [9, 10]. These technologies offer the end-user a bewildering array of design-parameters, including cost per base, read-length, sequencing error rates, clone/insert lengths, etc. It is not straightforward to make a reasoned choice of technology and design-parameters in conducting a particular experiment. Likewise, the technology developers are faced with difficult choices on which parameters to improve in future development.
For any particular application, the goal of the researcher is to achieve the desired objective in a cost-effective manner. For example, in genome resequencing, the primary objective is the sensitive and accurate identification of various forms of sequence variants. Accurate SNP detection can be achieved even using short 36 bp Illumina reads . However, for other applications such as de novo assembly of genomes, longer reads are significantly better than short reads . RNA-seq is a novel application of sequencing to determine the expression levels of different mRNA transcripts in the cell . However, the exponential variability in transcript expression levels poses new design questions regarding the required depth of sequencing to sample low abundance transcripts. Resolving such design questions can allow one to expand the scope of next-generation sequencing in novel directions. In this paper, we address and resolve some of the common design questions relating to structural variation and transcript profiling.
Modeling SV detection
As detailed below, and in Figures 1a-d, SVs often involve the creation of breakpoints: a pair of coordinates (a, b) in the reference genome, that are brought together to a single location ζ in the query. Consider the deletion event in Figure 1a. A reference segment of length l = b - a + 1 is absent in the query, relative to the reference. For the breakpoint (a, b) to be detected a paired-end insert must span ζ. Note that the insert-size is not fixed, but distributed tightly around a mean (L ± σ). Deletion is confirmed if the breakpoint is spanned and l >> σ. Typically, σ <<L so we simply require that l > L, which is sufficient but not strictly necessary.
This approximation illustrates an important difference between 'algorithm design' for SV detection, and experiment design. Using a clever algorithm based on higher coverage, and variation in insert length (σ), it may be possible to detect smaller deletions (σ <l <L) as well. however, in deciding how much sequencing is done, we simply focus only on l > L. This simplification allows us to handle many different types of SV using identical design criteria. The similarity to other cases is described below.
The case of inversion is shown in Figure 1b. Here, two breakpoints, (a1, b1), and (a2, b2) are fused together in the query. Denote the length of the inversion SV as l = b1 - a1 = b2 - a2. The inversion is detected when both breakpoints are detected. As in the case of deletions, either breakpoint is detected when a shotgun insert of the query spans the corresponding fusion point, and has exactly one end-point inside the inversion. We enforce this by requiring that L <l, even though the condition is sufficient but not strictly necessary.
One or more of the SV breakpoints must be detected.
- 2.For each breakpoint:
A shotgun insert must span the corresponding fusion point.
The reads at the ends of the insert must map unambiguously to the ends.
The insert-size must be dominated by the SV length (l > L).
This abstraction clarifies the design questions considerably. While the algorithmic questions must still deal with each SV separately, the design questions focus on breakpoint detection. We consider 2(b, c) first. For any choice of technology, and insert length, the distribution can be empirically computed by looking at concordantly mapped reads. Using this distribution, we can compute the probability of a randomly picked insert having a specific size.
The choice of a specific read-length is somewhat less important, but the reads must be long and accurate enough to map unambiguously. We model both points by introducing a parameter f, referring to the fraction of reads that map unambiguously. Therefore if N inserts must map unambiguously to satisfy design constraints, then N/f inserts need to be sequenced, on the average. In the remainder, we limit the discussion to detecting breakpoints, considering only the technologies and insert sizes that satisfy the size-constraint (1); and, we assume a mapping parameter f to scale the answers. The issue now is to choose from available insert-sizes, and second, to determine the amount of sequencing. In this paper, we formulate, and resolve design issue 2(a) as:
Given a choice of insert-sizes, and parameter ε , compute the amount of sequencing needed to detect 1 - ε of all breakpoints in the query genome.
We address the questions of breakpoint detection conjunction with the related notion of breakpoint resolution. With most technologies, a breakpoint detected as a pair of regions ([a1, a2], [b1, b2]), such that a ∈ [a1, a2], and b ∈ [b1, b2]. The resolution, defined by |a2 - a1| + |b2 - b1| refers to the uncertainty in determining (a, b). Good resolution is critical elucidating the phenotypic impact of the variation In an earlier work, we described the use of tightly re solved breakpoints in detecting gene fusion events cancer . This framework was extended to form general geometric approach for detecting structural variants . We reformulate and resolve the question
Given a choice of insert-sizes, and parameters ε, s, compute the amount of sequencing needed to detect 1 - ε of all breakpoints in the query genome to a resolution of ≤ s bp.
Intuitively, the likelihood of detection would be maximized by choosing the largest available insert-size. However, the longer insert-sizes increase the uncertainty in resolving the breakpoint. One result of our paper is an explicit trade-off between detection and resolution. We also derive a formula that computes the probability of resolving a breakpoint to within 's' base-pairs, given a fixed number of shotgun reads from a specific paired-end sequencing technology. Another result of our paper is that it is advantageous to use a mix of insert-sizes. For example, we can show that only 1.5× mapped sequence coverage of the human genome using Illumina (Solexa) can help resolve almost 90% of the breakpoints to within 200 bp using a mix of inserts. All other parameters being equal, we show that the best resolution of a structural variant comes from using exactly two possible insert-lengths: one that is as close as possible to the desired resolution, and one that is as long as technologically possible (with reasonable quality).
In summary, the researcher can use our formulae in designing his experiment to (a) select appropriate insert-sizes; (b), the optimum amount of sequencing for each insert library. A web-based tool based on the above is available.
Transcript sequencing is a direct approach for measuring abundance, and variations involving splicing, and SV mediated gene disruptions, and fusions . In most transcript sequencing methods, RNA is fragmented, and converted into cDNA, which is subsequently sequenced and mapped back to a reference . This protocol has shown great promise in detecting aberrant splice forms and SVs that lead to gene disruptions, and fusions .
Often, transcript sequencing is used for gene expression profiling. See Figure 1e. The significant difference in sampled reads (5 to 1) between Samples 1 and 2 suggests that gene A's expression level has changed between the two samples. In measuring relative abundance, RNAseq mimics older technologies like microarrays. However, sequencing stands alone in being able to compute relative abundance between two distinct transcripts. In sample 1, the difference in read coverage between genes A and B suggest that A is more than twice as abundant as B (assuming A and B are approximately the same length).
A typical design question for transcript sequencing is to determine the amount of sequencing required to sample a given fraction (Say, 90%) of the expressed transcripts. The question is particularly difficult to answer because different transcripts have vastly different normalized-expression values. Using empirical and analytical observations, we show that the p.d.f of the normalized-expression can be computed using a small sample. Therefore, a researcher can start with an initial sequencing run (< 500 K reads), and use the mapping data to compute the additional amount of sequencing needed. Formally, we resolve the following:
Given transcript mappings from a small sample of sequences, and parameter ε, compute the amount of additional sequencing needed to detect 1 - ε of all expressed transcripts.
Our results are based on novel extrapolation for the low abundance genes that are not accurately represented in the sample. They allow the researchers to efficiently allocate resources for large RNA sequencing studies. This is particularly relevant when many related samples are being sequenced and one needs to assess the trade-offs between sequencing depth and sample coverage.
Results and Discussion
As discussed in the introduction, we can limit the question of SV detection to detection of SV breakpoints. Let breakpoint (a, b) in the reference genome fuse to a single point ζ in the query genome. Let P ζ , denote the probability that an arbitrary breakpoint is detected. Our goal is to derive an expression for P ζ , given a certain amount of sequencing.
Direct application of breakpoint formulae requires that one selects from insert-sizes that are smaller than the desired SV length. In the following, we work with available inserts, where the mean insert-size ranges from L = 200 bp to L = 10 Kbp. Therefore, a result that says P ζ = 0.9 can be interpreted to mean that 90% of all breakpoints from SVs of length significantly larger than L Kbp can be detected. These specific values are chosen for illustration purposes only. Identical results apply for smaller or larger SVs, except that we would be limited to choosing from appropriate insert-sizes. All analytical results are derived assuming a fixed value for L. However, all results on real data use the natural variation in insert-size, and show excellent concordance with the analytical results.
Equations 2, and 3 provide an SV detection versus resolution trade-off. For a fixed number of sequences N, increasing L increases the probability of detection, but also increases the resolution-ambiguity. The effect decreases for large N. To validate this using experimental data, we used the publicly available Illumina generated human reference sequence from NA18507, a Yoruban male . Using the complete data, we computed a set of "true breakpoints" from SVs of length ≥ 2000 (see METHODS).
Nevertheless, current sequencing capability allows us to detect and resolve a large fraction of breakpoints. For example, with an Illumina run with 2 Kbp inserts and 25 × 106 mappable reads one could detect nearly 100% of breakpoints with an average resolution-ambiguity of less than 500 bp.
Mixing insert lengths
Note that the resolution-ambiguity |Θ| ≤ L1, or |Θ| = L2 can be obtained using single insert libraries, but the likelihood of resolving between L1 and L2 is optimized by using an appropriate mix of the two libraries. Analogous equations can be derived when two overlapping inserts or more are required to detect a breakpoint.
Next, we collected all inserts with mean insert-size either 200 bp, or 2000 bp. For a fixed amount of sequencing, we confirmed the theoretically predicted boost in probability of detecting a breakpoint to within a resolution-ambiguity of 200 bp. The results are in Figure 3. The probability is doubled from 0.15 to over 0.29 using a mix of insert libraries. Similar results are obtained for other sequencing studies, such as an ABI SOLiD sequencing with 600 and 2700 length libraries (data not shown). In a further extension of the analysis, we show that to maximize the likelihood of resolving breakpoints to s bp, we need only two libraries-one with insert-length s, and the other as large as possible (see METHODS). A restatement of these results can be found in Additional file 1. We note that only 1.5× mapped sequence coverage of the human genome using Illumina (Solexa) can help resolve almost 90% of the breakpoints to within 200 bp using a mix of inserts. Similar results were obtained when applied to runs from the ABI SOLiD system .
While our analytical results treat the insert sizes as fixed, empirical data very closely approximates the theoretical curve (Figure 3, dotted lines). Though the theoretical model performs better (mostly due to mapping variation resulting from repeat-like genomic regions), the magnitude of the 'boost' at 200 bp is maintained. The concordance between theoretical and experimental results shows the limited effect of insert-length variation.
It is useful to revisit the case of SVs with very small lengths. Mechanisms such as non-homologous end-joining (NHEJ), often gives rise to small insertions and deletions , that are valuable as genetic markers. If the event size is smaller than the variance in available insert-size, the event will not be detected by paired end mapping (in the case of deletions and insertions). In these situations, detection is improved by longer reads (such as those available in Roche-454). If single reads are used to detect the fusion point, then there is no ambiguity in resolution. In that case, the design question becomes simple, and the desired number of reads can be computed using the Clark-Carbon formula, and scaled using the mapping parameter f.
As transcripts have variable expression, the amount of sequencing needed to detect a transcript is variable. A key design issue is to determine if sufficient sequencing has been performed to sample all transcripts at a certain expression level. For example, in large patient surveys one needs to identify the number of samples that can be sequenced at minimal cost, while ensuring detection of genes at a desired expression level. Similarly, when evaluating a given sample it is important to know whether the required sequencing depth has been reached, or if more sequencing is necessary to detect a given transcript, isoform, or fusion gene. We show here that a relatively low level of transcriptomic sequencing has sufficient information regarding the variability of expression that it can be used to compute the likelihood of a specific transcript being sampled.
We tested the predictive accuracy of Eq. 5 using data from Marioni et al. . An empirical p.d.f was derived (see METHODS) from the total sequence used in each of two tissue studies (kidney and liver, ~35 × 106 reads each). Additional file 2a shows the similarity between the empirical distribution of normalized-expression values between the two studies.
Previous work has indicated that gene expression distributions typically follow a power-law [22, 23]. Nacher et al. extended this idea, accounting for stochastic noise to provide better fits for low expressed genes . We created a novel regression based strategy (METHODS) to correct for the bias, by fitting a power-law to high-expressed genes and using the simplified variant of models proposed by Nacher et al, to accurately approximate genes with low expression levels. The corrected curves (blue, red dotted lines) track the true estimates closely, even when using a sparse set of 100 K reads. With 1 million reads, > 90% of the total observed transcripts were sampled. In this data f is well-conserved across samples (as seen in kidney and liver, Additional file 2a). For example, the expression p.d.f. for kidney can be used to roughly predict the probability of detection for liver (Additional file 2b). This implies that f may not need to be re-estimated independently for related samples.
We present a number of analytic and empirical results on the design of sequencing experiments for uncovering genetic variation. Our study provides a systematic explanation for empirical observations relating to the amount of sequencing, and the choice of technologies. The theoretical analysis is not without caveats, which are discussed below. Nevertheless, the concordance with empirical data illustrates the applicability of our methods. Some of the results, while not counter-intuitive, provide additional insight. For example, we show that the best design for detecting SV to within 's' bp demands the choice of exactly two insert-lengths, one close to s, and the other as large as possible. We explicate the trade-offs between detection and resolution, and provide a method for computing the probability of SV detection as well as the expected resolution-ambiguity for a variety of technology and parameter choices.
Many additional confounding design issues that can be modeled in the context of structural variation. Different technologies have different error rates. This is corrected by introducing a mapping-rate parameter f, defined as the fraction of reads that are mapped unambiguously to the reference. Replacing the number of reads N by fN helps correct somewhat for sequencing errors. New methods have been suggested for dealing with complex scenarios in which it is difficult or impossible to map reads uniquely, such as within recent segmental duplications, using hill climbing  or parsimony  based approaches which try to minimize the number of observed structural variants. Chimerisms in insert-lengths can be controlled by demanding the use of multiple overlapping inserts. We have extended most analyses to requiring two or more inserts (see METHODS).
An important simplification in our analysis is to treat insert-length as constant. However, choosing a distribution on the insert-length does not influence the expected resolution-ambiguity, only its variance. The variance is important for measuring smaller structural variations. Therefore, experiments that aim to detect small structural variations are constrained to using technologies in which the insert-length variation is significantly smaller than the size of the SV itself. The available technologies are constantly reducing the variance in insert-lengths through better library preparation strategies, which might allow the use of larger insert-lengths in the future.
For transcript sequencing, we address the important question of depth of sequencing, given the large variation in transcript abundance. Our results suggest that estimating the distribution of normalized expression values with modest amounts of sequencing can help address design questions for transcript sequencing, even when the transcript abundance varies over many orders of magnitude. This approach has a number of caveats, for example, it assumes unbiased sampling of transcripts. Current library preparations have been shown to have biases (such as 3' and 5' depletion)  as well as biases towards specific RNAs (specifically small RNAs) within a platform . Additionally, though our results indicate a very good empirical fit on human samples, the assumption of a power-law, or other distribution, may not fit all samples. A number of outstanding questions remain, such as the detection of splicing events, and the resolution of breakpoints. While transcript sequencing is a quick way to detect breakpoints, the location of the breakpoint is confounded by trans-splicing. The issues relating to design can be better resolved only after methods are discovered to resolve breakpoints and predict splicing events based on transcriptome sequencing.
We do not address some important applications of next generation sequencing technologies: the detection of rare (and common) sequence variants in re-sequencing studies. Given the relatively high error rates for some of these technologies, reliable and accurate detection of sequence variants (SNPs) is a challenging problem, and general design principles that would be applicable to all technologies will be addressed in future study. The design of sequencing for 'dark-region' identification (i.e. DNA inserts on the sampled genome that are not in the reference) is not addressed. Lastly, there are practical sample preparation issues which demand consideration. Longer insert-lengths consume more sample for equivalent amount of sequencing. Therefore, if the sample is limited (as in tumors), the best design should also seek to optimize a 'sample-cost' versus detection trade-off.
Technological developments all point to the rapid deployment of personalized genomic sequencing. As large populations of individuals are sequenced, and the sequence is analyzed for a variety of applications, design issues relating to the amount of sequencing, the choice of technology, and the choice of technological parameters become paramount. Our paper helps resolve some of these questions. As current technologies mature and new technologies arise it will be critical to further develop a framework to maximize study efficacy.
A set of "true" breakpoints were chosen by mapping Illumina reads for individual NA18507 (obtained from the NCBI short read trace archive) to build 36.1 of the human genome. ELAND alignment tool, where each end mapped separately to detect SVs. Insert libraries were mapped until > 100× insert coverage was reached, in order to obtain a candidate set. To avoid systematic errors within a library (and over-fitting of the test data) at least three distinct libraries were required to span a breakpoint for it to be considered a "true breakpoint". All SV events greater than 2 Kbp were selected to be the final set is considered to be the TRUE BREAKPOINT SET.
To test the theoretical predictions, a 200 bp and a 2 Kbp library were selected at random. For parameter N, we randomly picked N paired-reads in which both ends mapped uniquely to the genome. A true breakpoint was considered to be detected if at least two inserts spanned it. Thus, the fraction of true breakpoints detected was empirically computed. These numbers were compared against theoretical predictions, obtained using Eq. 2,3 respectively. The resolution |Θ ζ | for each detected break-point was computed as follows: for each paired-read that spanned a breakpoint, let x l denote the distance of its left endpoint from the left end of the right-most clone; let x r denote the distance of its right end-point from the left most clone. Then, the resolution is given by L - (x l + x r ). |Θ ζ | was obtained by taking the mean (Θ ζ ) of all overlapping paired-reads. The fraction of "true" breakpoints detected (at least 2 inserts spanning the event) and resolved by these libraries is shown in Figure 3, as a function of N.
Mixing insert lengths
Proof of Optimality of Two Insert Design
We show that it is sufficient to consider exactly two insert lengths for resolving a breakpoint to within 's' bp. We show first that for a given s and N, and a collection of insert-lengths, Pr(|Θ ζ | = s), is maximized using a mixture of ≤ 2 insert lengths.
Assume to the contrary that an optimal mix requires ≥ 3 distinct insert-lengths. This implies that for some insert length L', L ≠ s, and L ≠ L M , where L M is the maximum available insert-length. In other words, either a) L' <s, or, b) s <L' <L M . We consider each case in turn.
L' <s: From earlier discussion, the contribution of the inserts with length L' to Pr(|Θ| ≤ s) is proportional to coverage (c1). Replacing inserts of length L' with inserts of length s will increase coverage without changing N, contradicting optimality.
s <L' <L M : Once again, for inserts larger than the desired resolution-ambiguity s, their contribution to Pr(|Θ| ≤ s) is completely dependent on coverage. Replacing by a insert of length L M improves the resolution probability, a contradiction.
We compute the optimal mix empirically by iterating over N1 ∈ [0, N].
Simulation for mix of inserts
The set of breakpoints, and method for computing mean size of Θ ζ , followed that of the previous simulation. A single 2 Kbp and 200 Kbp library were analyzed, using 4 lanes from each corresponding flow cell. Clusters of invalid pairs were generated by combining the two reduced libraries.
Mapped RNA-seq data, generated by Marioni, et al. , was obtained from http://giladlab.uchicago.edu/data.html. The genomic mappings were converted to a list of overlapping exons in Refseq. For each transcript, a count of the number of reads sampling it was generated. This enabled the estimation of ν t which was calculated as described earlier. To obtain smaller data sets, random sampling of the reads was performed and ν t was re-calculated.
Figure 4a shows a plot of log f(ν) vs. log(ν). Performing a regression analysis on the line reveals the slope α, and the intercept log(β).
Where δ D is a noise parameter (relating to decay of RNA molecules) and N is a normalization constant . Note, that this equation is approximated by the power law at high values of ν.
Generating the fit requires two important steps: fitting a power law at high gene expression and identification of a "reliable point". Note that "high gene expression" can be maintained for all samples (in our simulations we used of overall expression). Performing a regression on gene expression values above this threshold provides δ D . Intuitively, the reliable point can be identified independently for each distribution by determining the point of inflection of the graph log(f(ν)) vs log(ν); the set of points immediately downstream of the inflection are used to fit Equation 6. One can accurately determine a "reliable point", ν r , by computing the gene expression value at which there is a 95% probability of detecting a transcript, , where r is the number of reads. The corrected p.d.f. utilizes the empirically generated p.d.f after this reliable point, and the theoretical p.d.f before then. It is important to note that the empirical p.d.f. derived using all reads implies that there is a drop off in abundance for very low abundance genes, which the fitting procedure would over-predict. However, this could be an artifact of incomplete sampling and a regression of the full data may provide a better estimate.
The research was supported by a grant from the NIH RO1-HG004962-01.
- Bentley DR: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.PubMed CentralPubMedView ArticleGoogle Scholar
- Korbel J, Urban A, Affourtit J, Godwin B, Grubert F, Simons J, Kim P, Palejev D, Carriero N, Du L: Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007, 318 (5849): 420-10.1126/science.1149504.PubMed CentralPubMedView ArticleGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458: 97-101. 10.1038/nature07638.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson D, Mortazavi A, Myers R, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316 (5830): 1497-10.1126/science.1141319.PubMedView ArticleGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452 (7189): 872-876. 10.1038/nature06884.PubMedView ArticleGoogle Scholar
- McKernan K, Blanchard A, Kotler L, Costa G: Reagents, Methods, and Libraries for Bead-based Sequencing. 2006, [U.S. Patent 084132]Google Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjorn-son K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323 (5910): 133-138. 10.1126/science.1162986.PubMedView ArticleGoogle Scholar
- Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, Tabard-Cossa V, Wanunu M, Wiggin M, Schloss JA: The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008, 26 (10): 1146-1153. 10.1038/nbt.1495.PubMed CentralPubMedView ArticleGoogle Scholar
- Pihlak A, Baurén G, Hersoug E, Lönnerberg P, Metsis A, Linnarsson S: Rapid genome sequencing with short universal tiling probes. Nat Biotechnol. 2008, 26 (6): 676-684. 10.1038/nbt1405.PubMedView ArticleGoogle Scholar
- Chaisson MJ, Brinza D, Pevzner PA: De novo fragment assembly with short mate-paired reads: Does the read length matter?. Genome Res. 2009, 19: 336-346. 10.1101/gr.079053.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMed CentralPubMedView ArticleGoogle Scholar
- Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005, 77: 78-88. 10.1086/431652.PubMed CentralPubMedView ArticleGoogle Scholar
- Volik S, Zhao S, Chin K, Brebner J, Herndon D, Tao Q, Kowbel D, Huang G, Lapuk A, Kuo W: End-sequence profiling: Sequence-based analysis of aberrant genomes. Proc Natl Acad Sci USA. 2003, 100 (13): 7696-7701. 10.1073/pnas.1232418100.PubMed CentralPubMedView ArticleGoogle Scholar
- Tuzun E, Sharp A, Bailey J, Kaul R, Morrison V, Pertz L, Haugen E, Hayden H, Albertson D, Pinkel D, Olson M, Eichler E: Fine-scale structural variation of the human genome. Nat Genet. 2005, 37 (7): 727-732. 10.1038/ng1562.PubMedView ArticleGoogle Scholar
- Sindi S, Helman E, Bashir A, Raphael B: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25 (12): i222-10.1093/bioinformatics/btp208.PubMed CentralPubMedView ArticleGoogle Scholar
- Korbel J, Abyzov A, Mu X, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein M: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology. 2009, 10 (2): R23-10.1186/gb-2009-10-2-r23.PubMed CentralPubMedView ArticleGoogle Scholar
- Bashir A, Volik S, Collins C, Bafna V, Raphael B: Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer. PLoS Computational Biology. 2008, 4 (4): 10.1371/journal.pcbi.1000051.Google Scholar
- Clarke L, Carbon J: A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell. 1976, 9: 91-99. 10.1016/0092-8674(76)90055-6.PubMedView ArticleGoogle Scholar
- Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988, 2 (3): 231-239. 10.1016/0888-7543(88)90007-9.PubMedView ArticleGoogle Scholar
- McKernan KJ: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009, 19: 1527-1541. 10.1101/gr.091868.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Furusawa C, Kaneko K: Zipf's law in gene expression. Cell Phys Rev Lett. 2003, 90 (243): 088102-10.1103/PhysRevLett.90.088102.View ArticleGoogle Scholar
- Ueda H, Hayashi S, Matsuyama S, Yomo T, Hashimoto S, Kay S, Hogenesch J, Iino M: Universality and flexibility in gene expression from bacteria to human. Proceedings of the National Academy of Sciences. 2004, 101 (11): 3765-3769. 10.1073/pnas.0306244101.View ArticleGoogle Scholar
- Nacher J, Akutsu T: Sensitivity of the power-law exponent in gene expression distribution to mRNA decay rate. Physics Letters A. 2006, 360: 174-178. 10.1016/j.physleta.2006.07.076.View ArticleGoogle Scholar
- Lee S, Cheran E, Brudno M: A robust framework for detecting structural variations in a genome. Bioinformatics. 2008, 24 (13): i59-10.1093/bioinformatics/btn176.PubMed CentralPubMedView ArticleGoogle Scholar
- Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009, 19: 1270-1278. 10.1101/gr.088633.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Linsen S, de Wit E, Janssens G, Heater S, Chapman L, Parkin R, Fritz B, Wyman S, de Bruijn E, Voest E: Limitations and possibilities of small RNA digital gene expression profiling. Nature Methods. 2009, 6 (7): 474-476. 10.1038/nmeth0709-474.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.