Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance
 Ali Bashir^{1}Email author,
 Vikas Bansal^{2} and
 Vineet Bafna^{1}Email author
DOI: 10.1186/1471216411385
© Bashir et al; licensee BioMed Central Ltd. 2010
Received: 23 December 2009
Accepted: 18 June 2010
Published: 18 June 2010
Abstract
Background
Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genomewide discovery of transcriptionfactor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from pairedend sequencing and estimating mRNA transcript abundance.
Results
For structural variation, our results provide explicit tradeoffs between the detection and resolution of rearrangement breakpoints, and the optimal mix of pairedread insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability.
Conclusions
Together, our results form a generic framework for many design considerations related to highthroughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGSDesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
Background
Massively parallel sequencing technologies provide precise digital readouts of both static (genomic) and dynamic (expression) cellular information. In genetic variation, whole genome sequencing uncovers a complete catalog of all types of variants including SNPs [1] and structural variations [2]. Transcript sequencing [3, 4], small RNA sequencing and CHipSeq [5] allow a measurement of dynamic cellular processes. These technologies provide unprecedented opportunities for genomics research but also pose significant new challenges in terms of making the optimal use of the sequencing throughput. The individual laboratory might not be equipped to provide correct, and costeffective designs for the new experiments. By 'design', we refer to questions such as "How much sequencing needs to be done in order to reliably detect all structural variations in the sample to a resolution of 400 bp?" Confounding this further is the proliferation of a large number of sequencing technologies, including three widely used platforms, Roche/454 [6], Illumina [1] and ABI SOLiD [7], and others such as Pacific BioSciences [8] and Helicos [9, 10]. These technologies offer the enduser a bewildering array of designparameters, including cost per base, readlength, sequencing error rates, clone/insert lengths, etc. It is not straightforward to make a reasoned choice of technology and designparameters in conducting a particular experiment. Likewise, the technology developers are faced with difficult choices on which parameters to improve in future development.
For any particular application, the goal of the researcher is to achieve the desired objective in a costeffective manner. For example, in genome resequencing, the primary objective is the sensitive and accurate identification of various forms of sequence variants. Accurate SNP detection can be achieved even using short 36 bp Illumina reads [1]. However, for other applications such as de novo assembly of genomes, longer reads are significantly better than short reads [11]. RNAseq is a novel application of sequencing to determine the expression levels of different mRNA transcripts in the cell [12]. However, the exponential variability in transcript expression levels poses new design questions regarding the required depth of sequencing to sample low abundance transcripts. Resolving such design questions can allow one to expand the scope of nextgeneration sequencing in novel directions. In this paper, we address and resolve some of the common design questions relating to structural variation and transcript profiling.
Structural variation
Modeling SV detection
As detailed below, and in Figures 1ad, SVs often involve the creation of breakpoints: a pair of coordinates (a, b) in the reference genome, that are brought together to a single location ζ in the query. Consider the deletion event in Figure 1a. A reference segment of length l = b  a + 1 is absent in the query, relative to the reference. For the breakpoint (a, b) to be detected a pairedend insert must span ζ. Note that the insertsize is not fixed, but distributed tightly around a mean (L ± σ). Deletion is confirmed if the breakpoint is spanned and l >> σ. Typically, σ <<L so we simply require that l > L, which is sufficient but not strictly necessary.
This approximation illustrates an important difference between 'algorithm design' for SV detection, and experiment design. Using a clever algorithm based on higher coverage, and variation in insert length (σ), it may be possible to detect smaller deletions (σ <l <L) as well. however, in deciding how much sequencing is done, we simply focus only on l > L. This simplification allows us to handle many different types of SV using identical design criteria. The similarity to other cases is described below.
The case of inversion is shown in Figure 1b. Here, two breakpoints, (a_{1}, b_{1}), and (a_{2}, b_{2}) are fused together in the query. Denote the length of the inversion SV as l = b_{1}  a_{1} = b_{2}  a_{2}. The inversion is detected when both breakpoints are detected. As in the case of deletions, either breakpoint is detected when a shotgun insert of the query spans the corresponding fusion point, and has exactly one endpoint inside the inversion. We enforce this by requiring that L <l, even though the condition is sufficient but not strictly necessary.
 1.
One or more of the SV breakpoints must be detected.
 2.For each breakpoint:
 (a)
A shotgun insert must span the corresponding fusion point.
 (b)
The reads at the ends of the insert must map unambiguously to the ends.
 (c)
The insertsize must be dominated by the SV length (l > L).
 (a)
This abstraction clarifies the design questions considerably. While the algorithmic questions must still deal with each SV separately, the design questions focus on breakpoint detection. We consider 2(b, c) first. For any choice of technology, and insert length, the distribution can be empirically computed by looking at concordantly mapped reads. Using this distribution, we can compute the probability of a randomly picked insert having a specific size.
The choice of a specific readlength is somewhat less important, but the reads must be long and accurate enough to map unambiguously. We model both points by introducing a parameter f, referring to the fraction of reads that map unambiguously. Therefore if N inserts must map unambiguously to satisfy design constraints, then N/f inserts need to be sequenced, on the average. In the remainder, we limit the discussion to detecting breakpoints, considering only the technologies and insert sizes that satisfy the sizeconstraint (1); and, we assume a mapping parameter f to scale the answers. The issue now is to choose from available insertsizes, and second, to determine the amount of sequencing. In this paper, we formulate, and resolve design issue 2(a) as:

Given a choice of insertsizes, and parameter ε , compute the amount of sequencing needed to detect 1  ε of all breakpoints in the query genome.
We address the questions of breakpoint detection conjunction with the related notion of breakpoint resolution. With most technologies, a breakpoint detected as a pair of regions ([a_{1}, a_{2}], [b_{1}, b_{2}]), such that a ∈ [a_{1}, a_{2}], and b ∈ [b_{1}, b_{2}]. The resolution, defined by a_{2}  a_{1} + b_{2}  b_{1} refers to the uncertainty in determining (a, b). Good resolution is critical elucidating the phenotypic impact of the variation In an earlier work, we described the use of tightly re solved breakpoints in detecting gene fusion events cancer [18]. This framework was extended to form general geometric approach for detecting structural variants [16]. We reformulate and resolve the question

Given a choice of insertsizes, and parameters ε, s, compute the amount of sequencing needed to detect 1  ε of all breakpoints in the query genome to a resolution of ≤ s bp.
Intuitively, the likelihood of detection would be maximized by choosing the largest available insertsize. However, the longer insertsizes increase the uncertainty in resolving the breakpoint. One result of our paper is an explicit tradeoff between detection and resolution. We also derive a formula that computes the probability of resolving a breakpoint to within 's' basepairs, given a fixed number of shotgun reads from a specific pairedend sequencing technology. Another result of our paper is that it is advantageous to use a mix of insertsizes. For example, we can show that only 1.5× mapped sequence coverage of the human genome using Illumina (Solexa) can help resolve almost 90% of the breakpoints to within 200 bp using a mix of inserts. All other parameters being equal, we show that the best resolution of a structural variant comes from using exactly two possible insertlengths: one that is as close as possible to the desired resolution, and one that is as long as technologically possible (with reasonable quality).
In summary, the researcher can use our formulae in designing his experiment to (a) select appropriate insertsizes; (b), the optimum amount of sequencing for each insert library. A webbased tool based on the above is available.
Transcript sequencing
Transcript sequencing is a direct approach for measuring abundance, and variations involving splicing, and SV mediated gene disruptions, and fusions [3]. In most transcript sequencing methods, RNA is fragmented, and converted into cDNA, which is subsequently sequenced and mapped back to a reference [12]. This protocol has shown great promise in detecting aberrant splice forms and SVs that lead to gene disruptions, and fusions [4].
Often, transcript sequencing is used for gene expression profiling. See Figure 1e. The significant difference in sampled reads (5 to 1) between Samples 1 and 2 suggests that gene A's expression level has changed between the two samples. In measuring relative abundance, RNAseq mimics older technologies like microarrays. However, sequencing stands alone in being able to compute relative abundance between two distinct transcripts. In sample 1, the difference in read coverage between genes A and B suggest that A is more than twice as abundant as B (assuming A and B are approximately the same length).
A typical design question for transcript sequencing is to determine the amount of sequencing required to sample a given fraction (Say, 90%) of the expressed transcripts. The question is particularly difficult to answer because different transcripts have vastly different normalizedexpression values. Using empirical and analytical observations, we show that the p.d.f of the normalizedexpression can be computed using a small sample. Therefore, a researcher can start with an initial sequencing run (< 500 K reads), and use the mapping data to compute the additional amount of sequencing needed. Formally, we resolve the following:

Given transcript mappings from a small sample of sequences, and parameter ε, compute the amount of additional sequencing needed to detect 1  ε of all expressed transcripts.
Our results are based on novel extrapolation for the low abundance genes that are not accurately represented in the sample. They allow the researchers to efficiently allocate resources for large RNA sequencing studies. This is particularly relevant when many related samples are being sequenced and one needs to assess the tradeoffs between sequencing depth and sample coverage.
Results and Discussion
Structural Variation
As discussed in the introduction, we can limit the question of SV detection to detection of SV breakpoints. Let breakpoint (a, b) in the reference genome fuse to a single point ζ in the query genome. Let P_{ ζ }, denote the probability that an arbitrary breakpoint is detected. Our goal is to derive an expression for P_{ ζ }, given a certain amount of sequencing.
Direct application of breakpoint formulae requires that one selects from insertsizes that are smaller than the desired SV length. In the following, we work with available inserts, where the mean insertsize ranges from L = 200 bp to L = 10 Kbp. Therefore, a result that says P_{ ζ }= 0.9 can be interpreted to mean that 90% of all breakpoints from SVs of length significantly larger than L Kbp can be detected. These specific values are chosen for illustration purposes only. Identical results apply for smaller or larger SVs, except that we would be limited to choosing from appropriate insertsizes. All analytical results are derived assuming a fixed value for L. However, all results on real data use the natural variation in insertsize, and show excellent concordance with the analytical results.
DetectionResolution tradeoff
Equations 2, and 3 provide an SV detection versus resolution tradeoff. For a fixed number of sequences N, increasing L increases the probability of detection, but also increases the resolutionambiguity. The effect decreases for large N. To validate this using experimental data, we used the publicly available Illumina generated human reference sequence from NA18507, a Yoruban male [1]. Using the complete data, we computed a set of "true breakpoints" from SVs of length ≥ 2000 (see METHODS).
Nevertheless, current sequencing capability allows us to detect and resolve a large fraction of breakpoints. For example, with an Illumina run with 2 Kbp inserts and 25 × 10^{6} mappable reads one could detect nearly 100% of breakpoints with an average resolutionambiguity of less than 500 bp.
Mixing insert lengths
Note that the resolutionambiguity Θ ≤ L_{1}, or Θ = L_{2} can be obtained using single insert libraries, but the likelihood of resolving between L_{1} and L_{2} is optimized by using an appropriate mix of the two libraries. Analogous equations can be derived when two overlapping inserts or more are required to detect a breakpoint.
Next, we collected all inserts with mean insertsize either 200 bp, or 2000 bp. For a fixed amount of sequencing, we confirmed the theoretically predicted boost in probability of detecting a breakpoint to within a resolutionambiguity of 200 bp. The results are in Figure 3. The probability is doubled from 0.15 to over 0.29 using a mix of insert libraries. Similar results are obtained for other sequencing studies, such as an ABI SOLiD sequencing with 600 and 2700 length libraries (data not shown). In a further extension of the analysis, we show that to maximize the likelihood of resolving breakpoints to s bp, we need only two librariesone with insertlength s, and the other as large as possible (see METHODS). A restatement of these results can be found in Additional file 1. We note that only 1.5× mapped sequence coverage of the human genome using Illumina (Solexa) can help resolve almost 90% of the breakpoints to within 200 bp using a mix of inserts. Similar results were obtained when applied to runs from the ABI SOLiD system [21].
While our analytical results treat the insert sizes as fixed, empirical data very closely approximates the theoretical curve (Figure 3, dotted lines). Though the theoretical model performs better (mostly due to mapping variation resulting from repeatlike genomic regions), the magnitude of the 'boost' at 200 bp is maintained. The concordance between theoretical and experimental results shows the limited effect of insertlength variation.
It is useful to revisit the case of SVs with very small lengths. Mechanisms such as nonhomologous endjoining (NHEJ), often gives rise to small insertions and deletions [2], that are valuable as genetic markers. If the event size is smaller than the variance in available insertsize, the event will not be detected by paired end mapping (in the case of deletions and insertions). In these situations, detection is improved by longer reads (such as those available in Roche454). If single reads are used to detect the fusion point, then there is no ambiguity in resolution. In that case, the design question becomes simple, and the desired number of reads can be computed using the ClarkCarbon formula, and scaled using the mapping parameter f.
Transcript sequencing
As transcripts have variable expression, the amount of sequencing needed to detect a transcript is variable. A key design issue is to determine if sufficient sequencing has been performed to sample all transcripts at a certain expression level. For example, in large patient surveys one needs to identify the number of samples that can be sequenced at minimal cost, while ensuring detection of genes at a desired expression level. Similarly, when evaluating a given sample it is important to know whether the required sequencing depth has been reached, or if more sequencing is necessary to detect a given transcript, isoform, or fusion gene. We show here that a relatively low level of transcriptomic sequencing has sufficient information regarding the variability of expression that it can be used to compute the likelihood of a specific transcript being sampled.
We tested the predictive accuracy of Eq. 5 using data from Marioni et al. [3]. An empirical p.d.f was derived (see METHODS) from the total sequence used in each of two tissue studies (kidney and liver, ~35 × 10^{6} reads each). Additional file 2a shows the similarity between the empirical distribution of normalizedexpression values between the two studies.
Previous work has indicated that gene expression distributions typically follow a powerlaw [22, 23]. Nacher et al. extended this idea, accounting for stochastic noise to provide better fits for low expressed genes [24]. We created a novel regression based strategy (METHODS) to correct for the bias, by fitting a powerlaw to highexpressed genes and using the simplified variant of models proposed by Nacher et al, to accurately approximate genes with low expression levels. The corrected curves (blue, red dotted lines) track the true estimates closely, even when using a sparse set of 100 K reads. With 1 million reads, > 90% of the total observed transcripts were sampled. In this data f is wellconserved across samples (as seen in kidney and liver, Additional file 2a). For example, the expression p.d.f. for kidney can be used to roughly predict the probability of detection for liver (Additional file 2b). This implies that f may not need to be reestimated independently for related samples.
Conclusions
We present a number of analytic and empirical results on the design of sequencing experiments for uncovering genetic variation. Our study provides a systematic explanation for empirical observations relating to the amount of sequencing, and the choice of technologies. The theoretical analysis is not without caveats, which are discussed below. Nevertheless, the concordance with empirical data illustrates the applicability of our methods. Some of the results, while not counterintuitive, provide additional insight. For example, we show that the best design for detecting SV to within 's' bp demands the choice of exactly two insertlengths, one close to s, and the other as large as possible. We explicate the tradeoffs between detection and resolution, and provide a method for computing the probability of SV detection as well as the expected resolutionambiguity for a variety of technology and parameter choices.
Many additional confounding design issues that can be modeled in the context of structural variation. Different technologies have different error rates. This is corrected by introducing a mappingrate parameter f, defined as the fraction of reads that are mapped unambiguously to the reference. Replacing the number of reads N by fN helps correct somewhat for sequencing errors. New methods have been suggested for dealing with complex scenarios in which it is difficult or impossible to map reads uniquely, such as within recent segmental duplications, using hill climbing [25] or parsimony [26] based approaches which try to minimize the number of observed structural variants. Chimerisms in insertlengths can be controlled by demanding the use of multiple overlapping inserts. We have extended most analyses to requiring two or more inserts (see METHODS).
An important simplification in our analysis is to treat insertlength as constant. However, choosing a distribution on the insertlength does not influence the expected resolutionambiguity, only its variance. The variance is important for measuring smaller structural variations. Therefore, experiments that aim to detect small structural variations are constrained to using technologies in which the insertlength variation is significantly smaller than the size of the SV itself. The available technologies are constantly reducing the variance in insertlengths through better library preparation strategies, which might allow the use of larger insertlengths in the future.
For transcript sequencing, we address the important question of depth of sequencing, given the large variation in transcript abundance. Our results suggest that estimating the distribution of normalized expression values with modest amounts of sequencing can help address design questions for transcript sequencing, even when the transcript abundance varies over many orders of magnitude. This approach has a number of caveats, for example, it assumes unbiased sampling of transcripts. Current library preparations have been shown to have biases (such as 3' and 5' depletion) [12] as well as biases towards specific RNAs (specifically small RNAs) within a platform [27]. Additionally, though our results indicate a very good empirical fit on human samples, the assumption of a powerlaw, or other distribution, may not fit all samples. A number of outstanding questions remain, such as the detection of splicing events, and the resolution of breakpoints. While transcript sequencing is a quick way to detect breakpoints, the location of the breakpoint is confounded by transsplicing. The issues relating to design can be better resolved only after methods are discovered to resolve breakpoints and predict splicing events based on transcriptome sequencing.
We do not address some important applications of next generation sequencing technologies: the detection of rare (and common) sequence variants in resequencing studies. Given the relatively high error rates for some of these technologies, reliable and accurate detection of sequence variants (SNPs) is a challenging problem, and general design principles that would be applicable to all technologies will be addressed in future study. The design of sequencing for 'darkregion' identification (i.e. DNA inserts on the sampled genome that are not in the reference) is not addressed. Lastly, there are practical sample preparation issues which demand consideration. Longer insertlengths consume more sample for equivalent amount of sequencing. Therefore, if the sample is limited (as in tumors), the best design should also seek to optimize a 'samplecost' versus detection tradeoff.
Technological developments all point to the rapid deployment of personalized genomic sequencing. As large populations of individuals are sequenced, and the sequence is analyzed for a variety of applications, design issues relating to the amount of sequencing, the choice of technology, and the choice of technological parameters become paramount. Our paper helps resolve some of these questions. As current technologies mature and new technologies arise it will be critical to further develop a framework to maximize study efficacy.
Methods
Breakpoint Resolution
Simulation
A set of "true" breakpoints were chosen by mapping Illumina reads for individual NA18507 (obtained from the NCBI short read trace archive) to build 36.1 of the human genome. ELAND alignment tool, where each end mapped separately to detect SVs. Insert libraries were mapped until > 100× insert coverage was reached, in order to obtain a candidate set. To avoid systematic errors within a library (and overfitting of the test data) at least three distinct libraries were required to span a breakpoint for it to be considered a "true breakpoint". All SV events greater than 2 Kbp were selected to be the final set is considered to be the TRUE BREAKPOINT SET.
To test the theoretical predictions, a 200 bp and a 2 Kbp library were selected at random. For parameter N, we randomly picked N pairedreads in which both ends mapped uniquely to the genome. A true breakpoint was considered to be detected if at least two inserts spanned it. Thus, the fraction of true breakpoints detected was empirically computed. These numbers were compared against theoretical predictions, obtained using Eq. 2,3 respectively. The resolution Θ_{ ζ } for each detected breakpoint was computed as follows: for each pairedread that spanned a breakpoint, let x_{ l }denote the distance of its left endpoint from the left end of the rightmost clone; let x_{ r }denote the distance of its right endpoint from the left most clone. Then, the resolution is given by L  (x_{ l }+ x_{ r }). Θ_{ ζ } was obtained by taking the mean (Θ_{ ζ }) of all overlapping pairedreads. The fraction of "true" breakpoints detected (at least 2 inserts spanning the event) and resolved by these libraries is shown in Figure 3, as a function of N.
Mixing insert lengths
Proof of Optimality of Two Insert Design
We show that it is sufficient to consider exactly two insert lengths for resolving a breakpoint to within 's' bp. We show first that for a given s and N, and a collection of insertlengths, Pr(Θ_{ ζ } = s), is maximized using a mixture of ≤ 2 insert lengths.
Assume to the contrary that an optimal mix requires ≥ 3 distinct insertlengths. This implies that for some insert length L', L ≠ s, and L ≠ L_{ M }, where L_{ M }is the maximum available insertlength. In other words, either a) L' <s, or, b) s <L' <L_{ M }. We consider each case in turn.
L' <s: From earlier discussion, the contribution of the inserts with length L' to Pr(Θ ≤ s) is proportional to coverage (c_{1}). Replacing inserts of length L' with inserts of length s will increase coverage without changing N, contradicting optimality.
s <L' <L_{ M }: Once again, for inserts larger than the desired resolutionambiguity s, their contribution to Pr(Θ ≤ s) is completely dependent on coverage. Replacing by a insert of length L_{ M }improves the resolution probability, a contradiction.
We compute the optimal mix empirically by iterating over N_{1} ∈ [0, N].
Simulation for mix of inserts
The set of breakpoints, and method for computing mean size of Θ_{ ζ }, followed that of the previous simulation. A single 2 Kbp and 200 Kbp library were analyzed, using 4 lanes from each corresponding flow cell. Clusters of invalid pairs were generated by combining the two reduced libraries.
Transcript Sequencing
Mapped RNAseq data, generated by Marioni, et al. [3], was obtained from http://giladlab.uchicago.edu/data.html. The genomic mappings were converted to a list of overlapping exons in Refseq. For each transcript, a count of the number of reads sampling it was generated. This enabled the estimation of ν_{ t }which was calculated as described earlier. To obtain smaller data sets, random sampling of the reads was performed and ν_{ t }was recalculated.
Figure 4a shows a plot of log f(ν) vs. log(ν). Performing a regression analysis on the line reveals the slope α, and the intercept log(β).
Where δ_{ D }is a noise parameter (relating to decay of RNA molecules) and N is a normalization constant [24]. Note, that this equation is approximated by the power law at high values of ν.
Generating the fit requires two important steps: fitting a power law at high gene expression and identification of a "reliable point". Note that "high gene expression" can be maintained for all samples (in our simulations we used of overall expression). Performing a regression on gene expression values above this threshold provides δ_{ D }. Intuitively, the reliable point can be identified independently for each distribution by determining the point of inflection of the graph log(f(ν)) vs log(ν); the set of points immediately downstream of the inflection are used to fit Equation 6. One can accurately determine a "reliable point", ν_{ r }, by computing the gene expression value at which there is a 95% probability of detecting a transcript, , where r is the number of reads. The corrected p.d.f. utilizes the empirically generated p.d.f after this reliable point, and the theoretical p.d.f before then. It is important to note that the empirical p.d.f. derived using all reads implies that there is a drop off in abundance for very low abundance genes, which the fitting procedure would overpredict. However, this could be an artifact of incomplete sampling and a regression of the full data may provide a better estimate.
Declarations
Acknowledgements
The research was supported by a grant from the NIH RO1HG00496201.
Authors’ Affiliations
References
 Bentley DR: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 5359. 10.1038/nature07517.PubMed CentralPubMedView ArticleGoogle Scholar
 Korbel J, Urban A, Affourtit J, Godwin B, Grubert F, Simons J, Kim P, Palejev D, Carriero N, Du L: Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007, 318 (5849): 42010.1126/science.1149504.PubMed CentralPubMedView ArticleGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 15091517. 10.1101/gr.079558.108.PubMed CentralPubMedView ArticleGoogle Scholar
 Maher CA, KumarSinha C, Cao X, KalyanaSundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM: Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009, 458: 97101. 10.1038/nature07638.PubMed CentralPubMedView ArticleGoogle Scholar
 Johnson D, Mortazavi A, Myers R, Wold B: Genomewide mapping of in vivo proteinDNA interactions. Science. 2007, 316 (5830): 149710.1126/science.1141319.PubMedView ArticleGoogle Scholar
 Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452 (7189): 872876. 10.1038/nature06884.PubMedView ArticleGoogle Scholar
 McKernan K, Blanchard A, Kotler L, Costa G: Reagents, Methods, and Libraries for Beadbased Sequencing. 2006, [U.S. Patent 084132]Google Scholar
 Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S: Realtime DNA sequencing from single polymerase molecules. Science. 2009, 323 (5910): 133138. 10.1126/science.1162986.PubMedView ArticleGoogle Scholar
 Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, TabardCossa V, Wanunu M, Wiggin M, Schloss JA: The potential and challenges of nanopore sequencing. Nat Biotechnol. 2008, 26 (10): 11461153. 10.1038/nbt.1495.PubMed CentralPubMedView ArticleGoogle Scholar
 Pihlak A, Baurén G, Hersoug E, Lönnerberg P, Metsis A, Linnarsson S: Rapid genome sequencing with short universal tiling probes. Nat Biotechnol. 2008, 26 (6): 676684. 10.1038/nbt1405.PubMedView ArticleGoogle Scholar
 Chaisson MJ, Brinza D, Pevzner PA: De novo fragment assembly with short matepaired reads: Does the read length matter?. Genome Res. 2009, 19: 336346. 10.1101/gr.079053.108.PubMed CentralPubMedView ArticleGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.PubMed CentralPubMedView ArticleGoogle Scholar
 Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copynumber variation in the human genome. Am J Hum Genet. 2005, 77: 7888. 10.1086/431652.PubMed CentralPubMedView ArticleGoogle Scholar
 Volik S, Zhao S, Chin K, Brebner J, Herndon D, Tao Q, Kowbel D, Huang G, Lapuk A, Kuo W: Endsequence profiling: Sequencebased analysis of aberrant genomes. Proc Natl Acad Sci USA. 2003, 100 (13): 76967701. 10.1073/pnas.1232418100.PubMed CentralPubMedView ArticleGoogle Scholar
 Tuzun E, Sharp A, Bailey J, Kaul R, Morrison V, Pertz L, Haugen E, Hayden H, Albertson D, Pinkel D, Olson M, Eichler E: Finescale structural variation of the human genome. Nat Genet. 2005, 37 (7): 727732. 10.1038/ng1562.PubMedView ArticleGoogle Scholar
 Sindi S, Helman E, Bashir A, Raphael B: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25 (12): i22210.1093/bioinformatics/btp208.PubMed CentralPubMedView ArticleGoogle Scholar
 Korbel J, Abyzov A, Mu X, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein M: PEMer: a computational framework with simulationbased error models for inferring genomic structural variants from massive pairedend sequencing data. Genome Biology. 2009, 10 (2): R2310.1186/gb2009102r23.PubMed CentralPubMedView ArticleGoogle Scholar
 Bashir A, Volik S, Collins C, Bafna V, Raphael B: Evaluation of PairedEnd Sequencing Strategies for Detection of Genome Rearrangements in Cancer. PLoS Computational Biology. 2008, 4 (4): 10.1371/journal.pcbi.1000051.
 Clarke L, Carbon J: A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell. 1976, 9: 9199. 10.1016/00928674(76)900556.PubMedView ArticleGoogle Scholar
 Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988, 2 (3): 231239. 10.1016/08887543(88)900079.PubMedView ArticleGoogle Scholar
 McKernan KJ: Sequence and structural variation in a human genome uncovered by shortread, massively parallel ligation sequencing using twobase encoding. Genome Res. 2009, 19: 15271541. 10.1101/gr.091868.109.PubMed CentralPubMedView ArticleGoogle Scholar
 Furusawa C, Kaneko K: Zipf's law in gene expression. Cell Phys Rev Lett. 2003, 90 (243): 08810210.1103/PhysRevLett.90.088102.View ArticleGoogle Scholar
 Ueda H, Hayashi S, Matsuyama S, Yomo T, Hashimoto S, Kay S, Hogenesch J, Iino M: Universality and flexibility in gene expression from bacteria to human. Proceedings of the National Academy of Sciences. 2004, 101 (11): 37653769. 10.1073/pnas.0306244101.View ArticleGoogle Scholar
 Nacher J, Akutsu T: Sensitivity of the powerlaw exponent in gene expression distribution to mRNA decay rate. Physics Letters A. 2006, 360: 174178. 10.1016/j.physleta.2006.07.076.View ArticleGoogle Scholar
 Lee S, Cheran E, Brudno M: A robust framework for detecting structural variations in a genome. Bioinformatics. 2008, 24 (13): i5910.1093/bioinformatics/btn176.PubMed CentralPubMedView ArticleGoogle Scholar
 Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in highthroughput sequenced genomes. Genome Res. 2009, 19: 12701278. 10.1101/gr.088633.108.PubMed CentralPubMedView ArticleGoogle Scholar
 Linsen S, de Wit E, Janssens G, Heater S, Chapman L, Parkin R, Fritz B, Wyman S, de Bruijn E, Voest E: Limitations and possibilities of small RNA digital gene expression profiling. Nature Methods. 2009, 6 (7): 474476. 10.1038/nmeth0709474.PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.