DNA copy number, including telomeres and mitochondria, assayed using next-generation sequencing
© Castle et al; licensee BioMed Central Ltd. 2010
Received: 15 October 2009
Accepted: 16 April 2010
Published: 16 April 2010
DNA copy number variations occur within populations and aberrations can cause disease. We sought to develop an improved lab-automatable, cost-efficient, accurate platform to profile DNA copy number.
We developed a sequencing-based assay of nuclear, mitochondrial, and telomeric DNA copy number that draws on the unbiased nature of next-generation sequencing and incorporates techniques developed for RNA expression profiling. To demonstrate this platform, we assayed UMC-11 cells using 5 million 33 nt reads and found tremendous copy number variation, including regions of single and homogeneous deletions and amplifications to 29 copies; 5 times more mitochondria and 4 times less telomeric sequence than a pool of non-diseased, blood-derived DNA; and that UMC-11 was derived from a male individual.
The described assay outputs absolute copy number, outputs an error estimate (p-value), and is more accurate than array-based platforms at high copy number. The platform enables profiling of mitochondrial levels and telomeric length. The assay is lab-automatable and has a genomic resolution and cost that are tunable based on the number of sequence reads.
DNA copy number variations occur within populations and aberrations can cause tumors, be used for drug target identification, and be used as biomarkers of tumor drug response. EGFR (epidermal growth factor receptor) amplification, for instance, is a marker for gefitinib treatment  and TYMS (thymidylate synthase) amplification conveys 5-fluorouracil resistance in colon tumors .
The next-generation sequence enables generation of millions of short sequence tags in a single experiment. Using DNA as an input, the technology has been used to resequence entire genomes, including from normal individuals  and from cancerous cells , and to resequence targeted genomic regions, such as resequencing protein coding regions to discover somatic mutations . Alternatively, using RNA as an input, the technology has been used to profile RNA expression levels, where the number of sequence reads "tagging" an RNA transcript is a measure of its expression .
Combining these ideas and building on previous methods [7–10], this report describes the development of a sequencing-based platform to profile "expression" of DNA, inputting DNA and analyzing the resultant data using algorithms previously established for gene expression profiling. The platform resolution and cost are tunable; is lab automatable; can be used for both discovery of novel and profiling of known DNA copy number variations; outputs copy number and uncertainty as opposed to ratios; and can be used to profile mitochondria and telomeric DNA.
Application 1: Assaying nuclear genomic copy number
We assayed the genomic copy number in a pool of DNA derived from blood from non-diseased males; a pool of DNA derived from blood from non-diseased females; and DNA from UMC-11 cells, a lung carcinoid-derived cell line. We generated and sequenced each library, aligned resultant reads to the genome, and selected reads aligning to only one genomic location (Methods).
Conversely, UMC-11 cells show dramatic copy number variations. Before normalization, the number of reads mapping to each block shows many discrete levels, including blocks with no reads and blocks with roughly 75, 150, and 225 reads (Figure 1A, red). After normalization, these discrete levels map to copy number of zero, one, two, three, and four (Figure 1C). Intriguingly, the homozygously deleted region spanning from 21 to 24 Mb includes the putative tumor suppressor CDKN2A  (Figure 1D).
Validation of sequencing copy number assayed by qPCR.
Application 2: Assaying genetic loci copy number
Conceptually, one can define biological elements in terms of gene loci rather than genomic blocks, and thus, by counting the number of reads uniquely mapping to each locus, generate a gene-copy number table. Here, we defined locus coordinates as the greater of the transcript start-to-stop span or 60 kb centered on the loci. As before, we normalized the counts to the reference male pool, assuming the male pool is diploid across autosomes and haploid across allosomes.
Similarly, biological elements can be defined in terms of the coordinates of known DNA copy number polymorphisms . By counting reads aligning within the coordinates of each established polymorphism, one can monitor population-variable copy number polymorphisms (not shown).
Application 3: Mitochondrial copy number
Application 4: Telomere copy number
Finally, another fascinating biological element that can be interrogated is telomeric sequence. Telomeres protect the ends of chromosomes; are on average shorter in cells that have undergone many divisions, such as older cells, tumors, and cell lines; and comprise repetitive TTAGGG motifs . We counted and normalized the number of sequence reads containing (TTAGGG)4. Strikingly, the UMC-11 cells do contain significantly fewer telomere-associated reads than either the female or male pools (Figure 6B).
Discussion and Conclusions
Assaying DNA copy number by next-generation sequencing is robust and accurate. The method described here requires a simple genomic DNA library construction; returns integer copy number values for homogeneous cells; and has a large dynamic range. The platform is unbiased in the sense that genomic targets are not preselected, such as is the case with qPCR and microarrays, and thus, given a new genome assembly, a new set of copy number polymorphisms, or a new set of biological DNA elements, the sequence reads maintain utility through re-alignment.
Our research builds on previous efforts. The 'digital karyotyping' protocol uses restriction enzymes and SAGE sequencing technology to generate reads that have been used to measure copy number variation and detect infectious viral DNA[17, 18]. Using Illumina deep-sequencing, fetal aneuploidy was assayed, identifying Down, Edward, and Patau syndromes based on chromosome-specific trisomy . Illumina deep-sequencing has been used to examine copy number across nuclear chromosomes. Campbell et al used paired-end reads from size-selected libraries to identify genomic structural rearrangements and, integrating estimates of copy number with paired-end reads mapping to distal locations, were able to identify breakpoint coordinates and novel DNA sutures. Recently, Chiang et al sequenced a size-selected library and measured the log-ratio change between normal and tumor sample pairs across nuclear chromosomes. They elegantly show trade-off curves between read number, copy number change, and genomic resolution and show statistical determination of breakpoints. In comparing their results to array-based methods, they find the sequencing-based platform has a larger dynamic range. Yoon et al.  and Alkan et al.  developed similar methods for use in human resequencing projects, using over 1 billion reads and 30× genome coverage to identify copy number polymorphisms in disease-free cells.
The method described here expands on these groundbreaking studies in several ways. First, our library construction does not include a size selection and is thus lab automatable. Second, by using a single diploid reference sample along with a novel normalization algorithm, our method removes biases inherent in molecular biology protocols and outputs absolute copy number in addition to log-ratio values. Third, we defined an uncertainty that allows us to estimate upper and lower bounds and p-values for each copy number measurement, both for absolute and relative measurements. Forth, by defining biological elements as not only nuclear DNA blocks positioned evenly across the nuclear genome, this platform enables assaying other biologically meaningful DNA elements, including gene loci, known copy number polymorphisms, mitochondria, and telomeres.
Finally, these results were generated using an Illumina Genome Analyzer II instrument in June, 2008, with one sample per lane, resulting in only 3 to 7 million 36 nt reads per sample. Sequencing instrumentation continues to improve, allowing more reads at lower costs. As the resolution of this assay is inversely proportional to the number of aligned reads, sample multiplexing and increasing numbers of sequence reads will enable increased resolution and/or significantly decreased costs.
Proof of concept of the sequencing-based DNA copy number profiling assay
To develop and evaluate a sequencing based platform for assaying CNV, we prepared genomic DNA libraries from a pool of DNA derived from blood from non-diseased males; a pool of DNA derived from blood from non-diseased females; and DNA from the UMC-11 cell line, a lung carcinoid-derived cell line. DNA was fragmented using DNase. We did not test whether DNase can shear heterochromatin sequence such as in centromeres; however, we used only sequence reads that align uniquely to the genome and thus would not use sequence reads from regions of low sequence-complexity. The ends of the fragmented DNA were filled in or cleaved to produce blunt ends. This blunted DNA was ligated to adapters containing reverse complements on like adapters for suppression PCR as well as priming sites for a set of universal PCR primers and the Illumina sequencing primer. These "DNA Libraries" were then amplified using standard PCR conditions. We did not use size-selection after fragmentation, allowing lab-automation of library construction.
Sequencing reads (33 nt) and alignments.
Aligned to genome
The resultant sequence reads were computationally aligned to the genome using the algorithm BWA . We used only reads aligning to the genome with the highest (least ambiguous) score, minimizing incorrect mappings such as those from the female pool mapping to the Y chromosome. A shortcoming of the use of uniquely aligning reads is that reads aligning to sequence-identical segmental duplications within the HG18 genome will be discarded and these regions not monitored. Between 92% and 93% of the reads aligned to the genome, with between 63% and 65% aligning unambiguously (Table 2). For each unambiguously aligned read, we recorded the chromosome and 5' coordinate.
Only 26% of the 20 nt tiles align uniquely, increasing to 65% at 25 nt, and to 70% at 30 nt, and smoothly increasing to 90% at 200 nt. A fundamental shift occurs between 20 and 25 nt: genome uniqueness changes from majority ambiguous to majority unambiguous. The 33 nt length (36 nt minus the 3 nt barcode) used in the study here represents an effective trade-off between sequencing costs (time and money) and read uniqueness.
Counting & uncertainty
We selected biological elements for examination, such as DNA windows. For each window, we counted the number of reads aligning within the window. We compared the number to that found in a diploid reference sample. We used the DNA pooled from non-diseased males as a reference as the DNA should be diploid across autosomes and represent all chromosomes, including chromosome Y.
With this error model, a block with 100 reads would have an uncertainty of ± 10. This error model reflects sampling uncertainty but neither biological or operator variability.
We used a uniform window size across the genome; however, a variable window size could be selected. The selected window size should reflect the total number of uniquely aligned reads available, the desired sensitivity/significance, and the sample cell-to-cell genomic heterogeneity. Here, we assumed that the cellular DNA was homogeneous and wished to have the power to distinguish copy number 3 from 2 at a p-value better than 0.001 assuming the described error model. This equates to 110 reads per genomic block. As we had 3.3 million reads from the female blood pool, of which 2.1 million uniquely aligned to the genome, and given the human genome of 3.1 billion base pairs, this results in a block size of 164 kb. We rounded to 150 kb blocks.
We normalized reads to account for the differing numbers of reads per sample and to determine absolute copy number. We found that the relative number of reads mapping to a DNA block depended on the molecular biology protocol used, thus necessitating use of a empirical normalization rather than simply a measure such as GC content. We used a male pool as a reference, and with this normalization were able to derive the copy number, upper and lower bounds, and the significance of any variation.
The distribution of UMC-11 to male ratios after normalization by the number of reads shows multiple peaks, none of which are centered at an integer (Figure 8, middle right). We tried using the ratios to determine the normalization: using the ratio mean or the median failed to center a distribution peak at an integer. However, we were able to normalize the counts based on the mode of the ratio distribution across autosomes. After normalization by the ratio model, we find that the three peaks in the ratio distribution from the UMC-11 cell line are regions with one, two, and three copy number (Figure 8, lower right).
Comparisons between samples
To compare CNV between samples, we calculated the significance of a difference using a t-test and p-value based on the normalized counts and normalized uncertainties. For each DNA element, the assay thus returns the absolute copy number, uncertainty, upper and lower bounds, and a p-value representing the significance of a measured difference.
Validation with qPCR
We used qPCR (Taqman) to assay DNA copy number in UMC-11 cells. TaqMan primer-probe reagents were obtained through the Applied Biosystems Assays-by-Design custom assay service (Foster City, CA) and designed to fall outside of repeat regions.
We thank David Haynor, Tomas Babak, Xinwei She, and John Lamb for suggestions, Melissa Cline for help loading tracks into the UCSC Genome Browser, and two anonymous reviewers for comments. All authors were employees of Rosetta Inpharmatics, which was a part of Merck & Co., Inc. Merck approved release of this manuscript.
- Hirsch FR, Witta S: Biomarkers for prediction of sensitivity to EGFR inhibitors in non-small cell lung cancer. Curr Opin Oncol. 2005, 17 (2): 118-122. 10.1097/01.cco.0000155059.39733.9d.PubMedView ArticleGoogle Scholar
- Wang TL, Diaz LA, Romans K, Bardelli A, Saha S, Galizia G, Choti M, Donehower R, Parmigiani G, Shih Ie M: Digital karyotyping identifies thymidylate synthase amplification as a mechanism of resistance to 5-fluorouracil in metastatic colorectal cancer patients. Proc Natl Acad Sci USA. 2004, 101 (9): 3089-3094. 10.1073/pnas.0308716101.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J: The diploid genome sequence of an Asian individual. Nature. 2008, 456 (7218): 60-65. 10.1038/nature07484.PubMed CentralPubMedView ArticleGoogle Scholar
- Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford-Shore BH, McGrath S, Hickenbotham M: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008, 456 (7218): 66-72. 10.1038/nature07485.PubMed CentralPubMedView ArticleGoogle Scholar
- Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27 (2): 182-189. 10.1038/nbt.1523.PubMed CentralPubMedView ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
- Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008, 40 (6): 722-729. 10.1038/ng.128.PubMed CentralPubMedView ArticleGoogle Scholar
- Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2009, 6 (1): 99-103. 10.1038/nmeth.1276.PubMed CentralPubMedView ArticleGoogle Scholar
- Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009, 41 (10): 1061-1067. 10.1038/ng.437.PubMed CentralPubMedView ArticleGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19 (9): 1586-1592. 10.1101/gr.092981.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Ruas M, Peters G: The p16INK4a/CDKN2A tumor suppressor and its relatives. Biochim Biophys Acta. 1998, 1378 (2): F115-177.PubMedGoogle Scholar
- Weinberg RA: Fewer and fewer oncogenes. Cell. 1982, 30 (1): 3-4. 10.1016/0092-8674(82)90003-4.PubMedView ArticleGoogle Scholar
- Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S: PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11 (1): 164-175. 10.1093/biostatistics/kxp045.PubMed CentralPubMedView ArticleGoogle Scholar
- Cooper GM, Zerr T, Kidd JM, Eichler EE, Nickerson DA: Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat Genet. 2008, 40 (10): 1199-1203. 10.1038/ng.236.PubMed CentralPubMedView ArticleGoogle Scholar
- Bogenhagen D, Clayton DA: The number of mitochondrial deoxyribonucleic acid genomes in mouse L and human HeLa cells. Quantitative isolation of mitochondrial deoxyribonucleic acid. J Biol Chem. 1974, 249 (24): 7991-7995.PubMedGoogle Scholar
- Hastie ND, Dempster M, Dunlop MG, Thompson AM, Green DK, Allshire RC: Telomere reduction in human colorectal carcinoma and with ageing. Nature. 1990, 346 (6287): 866-868. 10.1038/346866a0.PubMedView ArticleGoogle Scholar
- Wang TL, Maierhofer C, Speicher MR, Lengauer C, Vogelstein B, Kinzler KW, Velculescu VE: Digital karyotyping. Proc Natl Acad Sci USA. 2002, 99 (25): 16156-16161. 10.1073/pnas.202610899.PubMed CentralPubMedView ArticleGoogle Scholar
- Leary RJ, Cummins J, Wang TL, Velculescu VE: Digital karyotyping. Nat Protoc. 2007, 2 (8): 1973-1986. 10.1038/nprot.2007.276.PubMedView ArticleGoogle Scholar
- Fan HC, Blumenfeld YJ, Chitkara U, Hudgins L, Quake SR: Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood. Proc Natl Acad Sci USA. 2008, 105 (42): 16266-16271. 10.1073/pnas.0808319105.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.