Direct observation of genomic heterogeneity through local haplotyping analysis
© Gulukota et al.; licensee BioMed Central Ltd. 2014
Received: 25 November 2013
Accepted: 8 May 2014
Published: 2 June 2014
It has been an abiding belief among geneticists that multicellular organisms’ genomes can be analyzed under the assumption that a single individual has a uniform genome in all its cells. Despite some evidence to the contrary, this belief has been used as an axiomatic assumption in most genome analysis software packages. In this paper we present observations in human whole genome data, human whole exome data and in mouse whole genome data to challenge this assumption. We show that heterogeneity is in fact ubiquitous and readily observable in ordinary Next Generation Sequencing (NGS) data.
Starting with the assumption that a single NGS read (or read pair) must come from one haplotype, we built a procedure for directly observing haplotypes at a local level by examining 2 or 3 adjacent single nucleotide polymorphisms (SNPs) which are close enough on the genome to be spanned by individual reads. We applied this procedure to NGS data from three different sources: whole genome of a Central European trio from the 1000 genomes project, whole genome data from laboratory-bred strains of mouse, and whole exome data from a set of patients of head and neck tumors. Thousands of loci were found in each genome where reads spanning 2 or 3 SNPs displayed more than two haplotypes, indicating that the locus is heterogeneous. We show that such loci are ubiquitous in the genome and cannot be explained by segmental duplications. We explain them on the basis of cellular heterogeneity at the genomic level. Such heterogeneous loci were found in all normal and tumor genomes examined.
Our results highlight the need for new methods to analyze genomic variation because existing ones do not systematically consider local haplotypes. Identification of cancer somatic mutations is complicated because of tumor heterogeneity. It is further complicated if, as we show, normal tissues are also heterogeneous. Methods for biomarker discovery must consider contextual haplotype information rather than just whether a variant “is present”.
In cancer biology, it is well established that histological, ploidy and genomic heterogeneity can occur within different regions of a single tumor [1, 2]. Such cellular diversity is generally assumed to be characteristic of (or caused by) tumor pathology . However, recent reports of genome mosaicism  in humans have raised the possibility that such heterogeneity is physiological and can occur without any pathology. Here we report that such cellular heterogeneity at the genomic level is ubiquitous. We introduce the technique of Local Haplotyping Analysis (LHA) which shows that evidence for heterogeneity is strong and directly observable in Next Generation Sequencing (NGS) data.
Single nucleotide polymorphisms (SNPs) are typically deduced from NGS data using a statistical framework which examines the genome site by site . For example, of the NGS reads mapped to a particular position, if half the reads show a C and the other half show a T, a SNP may be “called” at this position. Software packages that implement such SNP-calling procedures like SAMtools  and the GATK  generally assume a uniform diploid genome. Therefore in this example, a C/T heterozygous SNP would be called.
Mathematically speaking, an alternative explanation is also consistent with the data. Instead of having a C/T heterozygous SNP uniformly, the sequenced tissue might be heterogeneous and consist of two different cell lineages: one of which is homozygous for C and the other homozygous for T. Direct evidence to support such an alternate hypothesis cannot be found when examining a single genomic site. Instead, combinations of sites must be examined and haplotypes must be deduced. However, published methods for haplotype assembly  also assume a uniform diploid genome and simply attempt to identify the two most likely haplotypes. In this paper, we break the uniformity assumption. Instead, we examine all possible haplotypes with the explicit aim of evaluating evidence for heterogeneity in the tissue.
When the two heterozygous SNPs are far apart from each other, as is usually the case, there are no NGS reads that span both and hence haplotypes are not observable. However, when they are close enough they may be so spanned by reads (or read-pairs in the case of paired-end sequencing) and haplotypes may be directly observable. A single read (pair) can, by definition, be derived only from a single haplotype out of a possible heterogeneous mix. Therefore, listing what each read (pair) shows while it spans neighboring SNPs, is a way to enumerate haplotypes that are directly observed. A region of the genome where two or more SNPs are close enough that a single read (pair) might span it is called a Block. Our Local Haplotype Analysis (LHA) pipeline lists reads mapped to blocks to see if there is evidence for more than two haplotypes i.e. for the proposition that the tissue sequenced is a heterogeneous mix of genetically diverse cells.
Overall strategy of LHA
The starting point of the LHA pipeline is the list of SNPs called from any sequenced genome or exome. SNP calling procedures map the sequenced genome to a reference and examine positions of variation from the reference. Then they routinely apply filters to minimize calling a SNP based on variations observed due to poor base quality, poor mapping quality, nearness to a gap, strand bias in the observed variant and other bioinformatics artifacts. We use the final list of SNPs produced by such a procedure to identify blocks in the genome i.e. regions where 2 or more heterozygous SNPs fall within a 500 base region. For each block, we list all read pairs that overlap it and enumerate the local haplotype exhibited by each read pair (Figure 1). Thus starting from a set of filtered SNPs, this procedure examines the underlying read sequences to list a set of observed read-based haplotypes.
Next, we apply several data filters to minimize calling artifactual haplotypes. We ignore reads with mapping quality less than 30 and we ignore bases whose quality score is less than 30; quality score of 30 represents 1/1000 probability of error. We also ignore any read-based haplotypes that are supported by fewer than three reads.
We used the LHA procedure to identify the haplotypes in three different public data sets. All are from diploid organisms and hence, observation of more than two haplotypes is prima facie evidence that the tissue sequenced is heterogeneous. The three data sets are:
A set of three whole genomes recommended  for benchmarking purposes belonging to a Central European (CEU) trio (NA12878, NA12891 and NA12892) from the Thousand Genomes project  were downloaded by FTP as aligned BAM files (approximately 71× coverage, mapped to HG19 version of the human genome) from the European Bioinformatics Institute . Without re-mapping, we called variants in each sample as explained below.
SRA files relating to 31 tumor and 31 matched normal (total 62 samples) tissues from patients with head and neck tumors  were downloaded from the Sequence Read Archive (SRA) . We extracted fastq sequences from the SRA files in this exome sequencing data, and then used BWA  to map them to the HG19 reference human genome and to create a SAM alignment file. Next we used SAMtools  on each sample to generate a Binary Alignment and Mapping (BAM) file, to sort it and to remove Polymerase Chain Reaction (PCR) duplicates. We then used the Realigner-Target-Creator and Indel-Realigner modules of the GATK version 2.1.9 to refine alignments near all indels . Finally, we called SNPs for all 62 samples as explained below.
We downloaded the whole genome data for pure-bred laboratory strains of mouse . We used the published BAM files (which mapped the reads to GRCm38_68 version of mouse genome) and called SNPs in the 12 strains.
All SNP calling was done with the UnifiedGenotyper module of the GATK version 2.1.9  using a minimum base quality threshold of 30 (‘-mbq 30’). The GATK caps the quality score of a base at its mapping quality and hence this also forces GATK to ignore any reads mapped with a quality less than 30. All samples in each data set were analyzed together but each data set was called separately. Thus, there were three separate runs of the UnifiedGenotyper for: (a) the three samples in CEU_TRIO, (b) 62 samples in HNC_62 and (c) and 12 samples in MUR_12.
Our program scanned the resulting Variant Call Format (VCF) files from each SNP-calling run to identify all blocks with 2 or 3 heterozygous SNPs within 500 bases of each other. Then, using the SAMtools application programming interface , our program read the BAM files to determine the base sequence at each SNP position for all reads overlapping any portion of the block. Reads that mapped with a quality score less than 30 are ignored. Likewise, if a read had a base with quality of less than 30 at a position, that read was considered to have skipped that position. Thus, we record in a file the high quality bases observed at each SNP position for every read mapping with high quality to any portion of the block.
Next we clustered together the read-based haplotypes for each block using the parsimony assumption i.e. if two read-based haplotypes overlapped without contradicting each other, they were combined into a possibly longer haplotype.
All called SNPs were annotated using Annovar  to determine their overlaps with genes, exons and segmental duplications.
An illustrative example
Figure 1 shows a block of 2 SNPs from CEU_TRIO member NA12878. This block is on chromosome 3, overlapping the gene EPHA6. Given that the first SNP is heterozygous A/G and the second is G/T, there are four possible haplotypes i.e. A..G, A..T, G..G and G..T. (Theoretically, other haplotypes are also possible if a read has a base other than A or G at first SNP and/or other than G or T at the second. However, such instances are negligibly rare). If NA12878 were a uniform diploid genome, the data is expected to show two of these four haplotypes. However, examining the reads that span both these SNPs, we find evidence for three of the four haplotypes indicating that multiple cell types are present in the NA12878 sample.
It might be tempting to ignore the least populous haplotype i.e. to dismiss all the A..T reads as artifacts of erroneous mapping or sequencing. Note that, if this is done, the second SNP would not be called since only G would be seen mapped to that position. And the first SNP would also have lowered significance (and might not even be called) because of the removal of 7 out of 20 A’s from this position.
2-SNP blocks of CEU_TRIO
Frequencies of haplotypes directly observed by LHA in 2-SNP blocks from the CEU_TRIO (whole genome) and HNC_62 (whole exome; aggregated into normal and tumor tissues)
Number of haplotypes in the block
The largest proportion of the blocks (about 40%), show two haplotypes, complying with the expectation from a uniform diploid genomic sample. For about a third of the blocks no haplotypes can be deduced and in another 20% of the blocks reads spanning both SNPs show only one haplotype. Since both SNPs in the block were heterozygous we should always expect to see two haplotypes. Therefore these 0- and 1- haplotype blocks illustrate the fact that, for nearly half the blocks, read depth is not sufficient to reveal all underlying haplotypes using the conservative LHA procedure. So, with higher depth (than 71x), these blocks should show 2 haplotypes and some of them may show more.
About 2% of the blocks in each sample show 3 or more haplotypes. Since this data is from a normal human tissue with a diploid genome (at most 2 haplotypes expected), this observation of more than 2 haplotypes is prima facie evidence that the underlying genome is non-uniform or heterogeneous at these loci. Though it is only 2% of the blocks, this evidence cannot be ignored because (i) this still amounts to more than four thousand blocks (genomic loci) in each sample and (ii) this is a conservative estimate or lower bound of the number of loci showing heterogeneity.
2-SNP blocks for HNC_62 exome data
The last two rows of Table 1 show haplotype frequencies for HNC_62 whole exome data, aggregated for normal and tumor samples. Comparable to the number in CEU_TRIO, here also about 1.5% of the blocks display heterogeneity (have more than 2 haplotypes). These observations are also valid at a per-sample level (Additional file 1: Table S1).
2-SNP blocks for MUR_12
Presuming that inbred laboratory strains are homozygous, Keane et al.  analyzed MUR_12 genomes with a pipeline that assumed homozygosity. Specifically, they set prior probability of heterozygosity to be a hundred fold lower than the default with the result that they called very few heterozygous variants; their pipeline called about 6 million variants in each strain, of which only about 6 thousand were heterozygous (Additional file 2: Table S2). Using the authors’ variant list, no blocks were found i.e. heterozygous SNPs were rare enough that no two of them were within 500 bases of each other.
We independently re-analyzed Keane et al’s BAM files and called SNPs without requiring that all SNPs be homozygous, i.e. we used default value for the prior probability of heterozygosity. This resulted in up to 15% of the called SNPs being heterozygous (Additional file 2: Table S2).
2-SNP blocks from 12 inbred mouse strains divided into blocks with 0, 1, 2, 3 or 4 different haplotypes directly observed using LHA
Number of haplotypes in the 2-SNP block
Blocks with 3 SNPs each
Frequencies of haplotypes directly observed by LHA in 3-SNP blocks from the CEU_TRIO and HNC_62 (aggregated into normal and tumor tissues)
Number of haplotypes in the block
In each sample, more than 4% of the 3-SNP blocks display 3 or more haplotypes, indicating directly observable heterogeneity. This pattern is preserved at a per sample level for HNC_62 (Additional file 3: Table S3).
3-SNP blocks in MUR_12 genomes of inbred mouse strains divided into blocks with 0, 1, . 8 different haplotypes directly observed using LHA
Number of haplotypes in the block
Blocks with 4 or more SNPs each
In all three data sets, we also found blocks with 4 or more SNPS. However, analysis of such blocks is complicated by the presence of partial haplotypes (Figure 2) and generally lower mapping scores assigned to reads with multiple mismatches. We are formulating a statistical framework more robust than parsimonious clustering for properly analyzing such blocks.
Where do the blocks occur?
Blocks can be classified into two categories: (i) Homogeneous blocks display 2 (or fewer haplotypes) and are not inconsistent with genomic homogeneity. (ii) Heterogeneous blocks display 3 (or more) haplotypes and are inconsistent with genomic homogeneity i.e. they cannot be explained without resorting to genomic heterogeneity.
We used local haplotyping analysis to examine sequencing reads that span 2 or 3 adjacent heterozygous SNPs. If the sequenced tissue has a uniform genome, sequencing reads in a block should only display two haplotypes in a diploid organism. Instead we found thousands of blocks where mapped sequencing read sets support three or more haplotypes. Evidence for heterogeneity of the underlying genome was directly observable in ordinary NGS data obtained from normal tissue of healthy individuals, as well as from normal and tumor tissues in patients with head and neck tumors.
Different blocks show different numbers of haplotypes. This is to be expected if different regions of the genome have different propensities for heterogeneity. Thus, LHA could provide a way to map heterogeneity hot spots in the genome. The nature and location of such hot spots might have important implications for predilection to disease. Also, it is noteworthy that this analysis was superimposed on SNPs that were already called in the traditional way. The bioinformatics procedures used for calling SNPs have their own assumptions (including that the underlying genome is diploid) which may unduly constrain regions marked as potentially heterogeneous. What is needed is a way to call haplotypes from raw NGS data without depending on called SNPs (manuscript in preparation).
LHA observed haplotypes are all local to within a block. We could consider global haplotypes at the whole genome level and ask: how many different genome-wide haplotypes exist in a sample? Mathematically, the block with the largest number of haplotypes provides a lower bound for whole genomic heterogeneity. Biologically, we must mitigate this estimate because of the possibility of sequencing or mapping errors and because some genes might be highly diverse i.e. not representative of overall genomic diversity.
It is important to consider at least two alternative explanations for observing 3 or more local haplotypes before we conclude that heterogeneity is real.
Sequencing or mapping error?
The first alternate explanation involves sequencing or mapping error. In order to minimize this type of error, we instituted four filters: (i) we first called SNPs using the GATK UnifiedGenotyper so that we are only considering heterogeneity around SNPs that pass the thresholds for strand bias, nearness to gaps and other bioinformatics artifacts, (ii) we ignored sequence bases with Phred quality scores less than 30, (ii) we ignored all sequence reads with a mapping quality score less than 30, and (iv) we only considered read haplotypes supported by three or more reads (after the above filters were applied). Thus SNP calling software has passed these SNPs, the base-calling software has assigned less than 1/1000 probability of sequencing error and the mapping software has assigned less than 1/1000 probability of incorrect placement and we have at least three such observations for each read-based haplotype. As seen in Figure 2, the number of sequencing reads that must be ignored in order to assume a uniform diploid genome is often a large proportion of the mapped reads. Doing so, deletes variants called at many positions, calling into question many basic conclusions from a sequencing experiment.
Though it is formally not possible to address experimental error within the regime of the same experiment, these filters serve to remove the least confident portions of our results.
Could it be due to segmental duplication?
It is possible that the regions of heterogeneity we are observing have multiple copies in the genome with subtle differences. In other words, the explanation could be that there are heterogeneous copies of a genomic locus rather than heterogeneity at a single locus. One way to examine this possibility is to see if heterogeneous blocks map mostly to known segmental duplications in the genome. We found that more than 90% of our heterogeneous blocks are outside of any regions known to be duplicated.
To throw more light on the duplication issue might need longer reads and/or much greater depth of coverage. Getting longer reads awaits technological improvements in sequencing. However, greater depth is feasible and we are currently in the middle of obtaining very deep sequencing. For this report, since more than 90% of the variants do not overlap known segmental duplication, this is unlikely to be the complete explanation for the observed heterogeneity.
Ways to experimentally validate heterogeneity
The most direct way to observe genome mosaicism is through single cell sequencing  of many different cells from the same tissue. Such technologies are still not broadly available in the market but preliminary results  suggest that genomic heterogeneity is real. Our analysis has shown that, even without the availability of single-cell sequencing technology, we can determine heterogeneity based on ordinary NGS data.
Another, somewhat indirect, way to validate heterogeneity is to see if similar conclusions are drawn when sequencing the same sample in a different technology. Recently Life Technology sequenced the exome of the CEU_TRIO using their Ion Torrent methodology and made this sequence available on their public server . As partial validation we note that the heterogeneity of some of our exonic blocks is also observed in this data set (unpublished observations).
It is worth noting that Sanger sequencing, typically the “gold standard” for validating individual SNPs , is not likely to be useful for validating haplotypes. Even though Sanger reads are typically longer than NGS reads, they are averaged over a pool of genomic DNA from the tissue. Thus each SNP in the block will be seen as an ambiguous base and information about which bases at individual SNPs combine to form a haplotype is typically not forthcoming.
Given that single cell sequencing also appears [20, 21] to indicate heterogeneity in the normal genome, LHA-derived heterogeneity seems to have a basis in fact. Further its ability to determine heterogeneity from ordinary NGS data can be put to powerful use in analyzing existing data.
Our procedure shows haplotypes at a local level in the genome. To observe similar combinations of SNPs that are far apart from each other might not be possible without single cell sequencing. However, statistical feature allocation methods  could indirectly infer such mosaic haplotypes over non-local SNPs or even SNPs on different chromosomes. One such method (Lee J, Muller P, Ji Y and Gulukota K, manuscript submitted) models haplotypes between non-local SNPs using a statistical technique called the Indian Buffet Process . At one SNP, the alternate allele might be observed in 10% of the reads and in 75% of the reads at another. Our Indian Buffet Process analyzes such variable minor allele frequencies to assign SNPs to imputed subclones and to model possible global haplotypes.
Local haplotyping analysis can provide directly observable evidence for heterogeneity and mosaicism using ordinary, though relatively deep, NGS data. Analyzing NGS data from three independent sources, we report that such heterogeneity is ubiquitous.
If genomes of normal tissues are heterogeneous at a large number of loci, the operational ramifications are quite dramatic. For example, the definition of cancer somatic mutations  might have to be altered because the germline is not uniquely defined. It might be important to periodically re-analyze a patient’s genome, if accumulation of replication errors over a life time leads to increased heterogeneity. Finally, in searching for genetic biomarkers, it might be important to consider not just genomic variants but also the heterogeneity context around them. New software will be needed for such analysis since existing software ignores this context.
Genome analysis tool kit
Local haplotyping analysis
Next generation sequencing
Sequence analysis and mapping
Single nucleotide polymorphism
Sequence read archive
Variant call format.
We thank Yuan Ji and Stefan Green for a number of stimulating discussions. DLH is supported by a grant to KG from a local philanthropy which wishes to remain anonymous. Institutional support from NorthShore University HealthSystem is gratefully acknowledged.
- Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, Kamiyama M, Hruban RH, Eshleman JR, Nowak MA, Velculescu VE, Kinzler KW, Vogelstein B, Iacobuzio-Donahue CA: Distant Metastasis Occurs Late during the Genetic Evolution of Pancreatic Cancer. Nature. 2010, 467 (7319): 1114-1117. 10.1038/nature09515.PubMed CentralPubMedView ArticleGoogle Scholar
- Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, Varela I, Phillimore B, Begum S, McDonald NQ, Butler A, Jones D, Raine K, Latimer C, Santos CR, Nohadani M, Eklund AC, Spencer-Dene B, Clark G, Pickering L, Stamp G, Gore M, Szallasi Z, Downward J, Futreal PA, Swanton C: Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med. 2012, 366 (10): 883-892. 10.1056/NEJMoa1113205.PubMedView ArticleGoogle Scholar
- Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell. 2011, 144 (5): 646-674. 10.1016/j.cell.2011.02.013.PubMedView ArticleGoogle Scholar
- Lupuski JR: Genome Mosaicism – One human, multiple genomes. Science. 2013, 341: 358-359. 10.1126/science.1239503.View ArticleGoogle Scholar
- Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011, 27 (21): 2987-293. 10.1093/bioinformatics/btr509.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G: Durbin R; 1000 Genome Project Data Processing Subgroup: The Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.PubMed CentralPubMedView ArticleGoogle Scholar
- He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics. 2010, Vol. 26: i183-i190. 10.1093/bioinformatics/btq215. ISMBView ArticleGoogle Scholar
- Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative Genomics Viewer. Nat Biotechnol. 2011, 29: 24-26. 10.1038/nbt.1754.PubMed CentralPubMedView ArticleGoogle Scholar
- Van der Auwera G: Which datasets should I use for reviewing or benchmarking purposes?. http://gatkforums.broadinstitute.org/discussion/1292/,
- Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M, Sherry S, Flicek P: The 1000 Genomes Project Consortium: The 1000 Genomes Project: data management and community access. Nat Methods. 2012, 9 (5): 459-462. 10.1038/nmeth.1974. Apr 27PubMed CentralPubMedView ArticleGoogle Scholar
- FTP Server. 1000, ftp://ftp.1000genomes.ebi.ac.uk, genomes.ebi.ac.uk
- Stransky N, Egloff AM, Tward AD, Kostic AD, Cibulskis K, Sivachenko A, Kryukov GV, Lawrence MS, Sougnez C, McKenna A, Shefler E, Ramos AH, Stojanov P, Carter SL, Voet D, Cortés ML, Auclair D, Berger MF, Saksena G, Guiducci C, Onofrio RC, Parkin M, Romkes M, Weissfeld JL, Seethala RR, Wang L, Rangel-Escareño C, Fernandez-Lopez JC, Hidalgo-Miranda A, Melendez-Zajgla J, Winckler W, Ardlie K, Gabriel SB, Meyerson M, Lander ES, Getz G, Golub TR, Garraway LA, Grandis JR: The mutational landscape of head and neck squamous cell carcinoma. Science. 2011, 333 (6046): 1157-1160. 10.1126/science.1208130.PubMed CentralPubMedView ArticleGoogle Scholar
- SRA: Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra,
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M, Furlotte NA, Eskin E, Nellåker C, Whitley H, Cleak J, Janowitz D, Hernandez-Pliego P, Edwards A, Belgard TG, Oliver PL, McIntyre RE, Bhomra A, Nicod J, Gan X, Yuan W, van der Weyden L, Steward CA, Bala S, Stalker J, Mott R, Durbin R, Jackson IJ, Czechanski A, Guerra-Assunção JA, Donahue LR, Reinholdt LG, Payseur BA, Ponting CP, Birney E, Flint J, Adams DJ: Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011, 477 (7364): 289-294. 10.1038/nature10413.PubMed CentralPubMedView ArticleGoogle Scholar
- API Documentation. http://samtools.sourceforge.net/samtools/masterTOC.shtml,
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38 (16): e164-10.1093/nar/gkq603.PubMed CentralPubMedView ArticleGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: an Information Aesthetic for Comparative Genomics. Genome Res. 2009, 19: 1639-1645. 10.1101/gr.092759.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Lasken RS: Single-cell sequencing in its prime. Nat Biotechnol. 2013, 31: 211-212. 10.1038/nbt.2523.PubMedView ArticleGoogle Scholar
- Pennisi E: The biology of genomes. Single-cell sequencing tackles basic and biomedical questions. Science. 2012, 336 (6084): 976-977. 10.1126/science.336.6084.976.View ArticleGoogle Scholar
- Mendel Demo Torrent Server. http://mendel.iontorrent.com,
- Li X, Buckton AJ, Wilkinson SL, John S, Walsh R, Novotny T, Valaskova I, Gupta M, Game L, Barton PJ, Cook SA, Ware JS: Towards clinical molecular diagnosis of inherited cardiac conditions: a comparison of bench-top genome DNA sequencers. PLoS One. 2013, 8 (7): e67744-10.1371/journal.pone.0067744.PubMed CentralPubMedView ArticleGoogle Scholar
- Broderick T, Jordan MI, Pitman J: Beta processes, stick-breaking, and power laws. Bayesian Analysis. 2012, 7 (2): 439-476.View ArticleGoogle Scholar
- Griffiths TL, Ghahramani Z: Infinite latent feature models and the Indian buffet process. Advances in Neural Information Processing Systems 18. 2006, Cambridge, MA: MIT PressGoogle Scholar
- Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011, 39 (Database issue): D945-950.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.