Experimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative genomic hybridization
© Flibotte and Moerman; licensee BioMed Central Ltd. 2008
Received: 21 July 2008
Accepted: 21 October 2008
Published: 21 October 2008
Microarray comparative genomic hybridization (CGH) is currently one of the most powerful techniques to measure DNA copy number in large genomes. In humans, microarray CGH is widely used to assess copy number variants in healthy individuals and copy number aberrations associated with various diseases, syndromes and disease susceptibility. In model organisms such as Caenorhabditis elegans (C. elegans) the technique has been applied to detect mutations, primarily deletions, in strains of interest. Although various constraints on oligonucleotide properties have been suggested to minimize non-specific hybridization and improve the data quality, there have been few experimental validations for CGH experiments. For genomic regions where strict design filters would limit the coverage it would also be useful to quantify the expected loss in data quality associated with relaxed design criteria.
We have quantified the effects of filtering various oligonucleotide properties by measuring the resolving power for detecting deletions in the human and C. elegans genomes using NimbleGen microarrays. Approximately twice as many oligonucleotides are typically required to be affected by a deletion in human DNA samples in order to achieve the same statistical confidence as one would observe for a deletion in C. elegans. Surprisingly, the ability to detect deletions strongly depends on the oligonucleotide 15-mer count, which is defined as the sum of the genomic frequency of all the constituent 15-mers within the oligonucleotide. A similarity level above 80% to non-target sequences over the length of the probe produces significant cross-hybridization. We recommend the use of a fairly large melting temperature window of up to 10°C, the elimination of repeat sequences, the elimination of homopolymers longer than 5 nucleotides, and a threshold of -1 kcal/mol on the oligonucleotide self-folding energy. We observed very little difference in data quality when varying the oligonucleotide length between 50 and 70, and even when using an isothermal design strategy.
We have determined experimentally the effects of varying several key oligonucleotide microarray design criteria for detection of deletions in C. elegans and humans with NimbleGen's CGH technology. Our oligonucleotide design recommendations should be applicable for CGH analysis in most species.
In human health research microarray comparative genomic hybridization (CGH) has become a powerful technique to investigate DNA copy number variants (CNVs) in healthy subjects [1, 2] and genomic aberrations associated with various diseases and syndromes [3, 4]. Furthermore, CGH is now frequently used to analyze the genome of strains of interest in various model organisms [5, 6]. On some oligonucleotide microarray platforms individual researchers can design their own specialized microarrays for very specific experiments. Basically, the only crucial requirement before starting to design an array is to have access to a sequenced reference genome for the species under investigation. The first task facing a biologist trying to design a CGH microarray is to design criteria to eliminate oligonucleotides with particular properties that are expected to reduce the data quality. Some design criteria have been suggested and used for several years with little or no large-scale experimental validation [7, 8]. Large-scale studies of the effects of various oligonucleotide properties on microarray data quality are just starting to be published [9, 10] but few of them are designed to investigate the two-colour scheme typically used in CGH experiments. Most of these studies are concerned with the human genome but it would be useful to know if some design criteria could be relaxed for smaller and less complex genomes and in general what kind of penalty one has to pay in terms of data quality when relaxing constraints on specific oligonucleotide properties.
In our research we are particularly interested in using oligonucleotide microarray CGH to detect induced deletions in the C. elegans genome [5, 11]. We designed our own microarray chips but our criteria for oligonucleotide selection were arbitrary and relied more on empirical observation, that is the data quality was adequate for the task , and was not based on experimentally testing various oligonucleotide features. Optimal design criteria are expected to depend on the hybridization conditions and possibly on the complexity of the genome under investigation. In the current publication we report our findings on the effects of varying the oligonucleotide design criteria and how these alterations affect our ability to detect deletions in both the C. elegans and human genomes. Considering the differences in size and complexity of these two genomes the design properties we recommend here should be applicable to many organisms with a sequenced genome provided that the hybridization conditions are not drastically different from those used in our experiments.
Results and discussion
Effects of various oligonucleotide properties on resolving power
The concept of resolving power we use here was introduced in a software evaluation study . It is a useful tool to detect and quantify small variations in overall data quality when changes are made to the oligonucleotide selection process or the data analysis procedure, or even when comparing different array platforms. Briefly, using the experimental distributions of the data points in the so-called normal regions and in the regions with copy number aberrations it is possible to estimate the expected p-value associated with the detection of a typical aberrant DNA segment covered by a given number of probes. In the current work, the resolving power in C. elegans has been evaluated with the help of two strains with large heterozygous deletions previously found in CGH experiments . As a human sample, a pool of male DNA has been compared by CGH to a pool of female DNA so that probes targeting the X chromosome could be associated with a one-copy loss in the male sample. Details of the microarray design for both the human and C. elegans experiments can be found in the Methods section. Briefly, for both normal and deleted regions we had probes manufactured of length 50, 60, and 70 nucleotides. We also used a so-called isothermal design where the oligonucleotide length is varied in an attempt to obtain an approximately constant melting temperature. The only significant constraints applied on the oligonucleotides at the design stage were the exclusion of known repeats and for the human chip the elimination of segments with known CNVs and single nucleotide polymorphisms (SNPs). Microarrays with shorter oligonucleotides, for example 25-mers in the case of the Affymetrix platform , can also be used to infer CNVs [12, 14] but their optimization [15, 16] is associated with different issues than longer oligonucleotide arrays and therefore they will not be considered in the current study.
Standard oligonucleotide filters used in this work.
Condition for elimination
Length > 5
More than 5°C from median
< -1 kcal/mol
Similarity with other genomic location
In order to quantify the best design practices with regard to minimizing potential non-specific hybridization we have introduced a series of perturbations on a pre-selected number of oligonucleotides, see Methods section for details. Basically, two different kinds of sequence similarity have to be considered, either the presence of a stretch of perfect identity of a given length within the oligonucleotide or a given similarity level over the whole oligonucleotide.
Oligonucleotide design recommendations
Several constraints can be applied on oligonucleotide properties to improve the data quality but one should keep in mind that our design recommendations summarized in this section could easily be relaxed to improve coverage in specific genomic regions and still get useful information from the resulting data. In the current work we have only studied one-copy deletions but we expect that oligonucleotide design criteria improving the resolving power for detection of deletions should also improve the resolving power for detecting copy number gains. However, we cannot really infer that our design recommendations are necessarily optimal for experiments attempting to measure precise copy numbers where large copy numbers are expected.
We observed very little difference in data quality when varying the oligonucleotide length between 50, 60, 70, and even when using a so-called isothermal design where the length of each oligonucleotide varies between 50 and 70 in an attempt to minimize the overall width of the melting temperature distribution. Oligonucleotides of fixed length are simpler to work with at the design stage and it is easier to avoid unwanted genomic regions with shorter oligonucleotides. Even if only for convenience we will continue to use 50-mer oligonucleotides in our own research projects and suggest that oligonucleotides of this length will suffice for most other projects.
Our results demonstrate that filtering potential oligonucleotide probes according to their 15-mer count is probably the most effective way to control probe quality. In the current work repeat sequences have been eliminated right from the start and we therefore do not have a direct measurement of their effect on data quality. However, considering the effect of the 15-mer count above it is safe to assume that most types of repeats should be excluded from most CGH array designs. The quality of the data can be improved by considering the oligonucleotides self-folding tendency and the presence of homopolymers; a self-energy threshold of -1 kcal/mol seems optimal and the elimination of oligonucleotides with homopolymers longer than 5 is probably adequate in most situations. Only small gain in data quality is achieved by restricting the oligonucleotides melting temperature and using a relatively wide window of 10°C centered on the median value seems an acceptable compromise between data quality and coverage.
Two types of sequence similarity with multiple genomic regions have been investigated, the presence of a perfect identity over a fraction of the probe and the similarity over the whole length of the probe. Our elimination of non-unique 20 mers within the genome when designing oligonucleotides is conservative and is really only justified when the 20-mer is located at the end away from the slide in a 50-mer probe. A stretch of perfect identity of length 22 in the middle of a 50-mer probe will produce measurable cross-hybridization. The same is true for a similarity level above about 80% over the full length of a probe. While the constraints on oligonucleotide design described here are good starting points, the optimal constraints to be used to eliminate cross-hybridization from both types of sequence similarity will depend on the genome under investigation and the desired coverage in a given region.
We have analyzed CGH experiments performed with NimbleGen's microarray platform in order to assess the relationships between various oligonucleotide properties and the quality of the data as measured by the ability to detect deletions in the human and C. elegans genomes. For the most part our microarray design recommendations summarized in the previous section are very similar for both species and they could probably be used without modifications for most other species with a sequenced reference genome. As expected, the larger and more complex human genome is more difficult to study with CGH and a deletion typically needs to affect approximately twice as many probes to achieve the same level of statistical confidence as in C. elegans. All our results were obtained with the NimbleGen platform with their standard hybridization protocol and of course our conclusions might not be valid for other microarray platforms or when using different hybridization conditions.
DNA from two C. elegans strains harbouring deletions were used as samples in the current study; strains VC10019 (gk487/mIn1) and VC10020 (gk488/mIn1)  carry 0.8 Mb and 0.5 Mb heterozygous deletions on chromosome II, respectively. For both C. elegans hybridizations, DNA extracted from the wild-type N2 strain has been used as the reference DNA. For the human DNA experiment, the sample was a commercial pool of DNA from 6 male anonymous individuals and the reference was a similar DNA pool from 6 female donors both supplied by Promega Corporation. Details of the nematode culture, DNA preparation and labelling can be found in a previous publication .
Oligonucleotide microarray design
Number of oligonucleotides for each category represented on the arrays.
For the C. elegans array design, the oligonucleotides selected correspond to an approximately uniform tiling of the deletions in gk487 and gk488 plus 0.5 Mb of flanking regions on each side. The repeats annotated in Wormbase data freeze version WS170 have been eliminated from consideration but no other constraints have been applied on the oligonucleotides except that they had to be synthesized in fewer than 180 cycles with NimbleGen's microarray manufacturing process . Approximately 22% and 16% of the oligonucleotides cover the deletions in gk487 and gk488, respectively.
A similar strategy has been applied to select the oligonucleotide for the human array. In this case the probes were approximately uniformly distributed on the whole genome but with increased density for chromosome X resulting in approximately 37% of the oligonucleotides covering that chromosome. Once again, repeats were eliminated but also the regions with known SNPs (in dbSNP) , CNVs and other genomic variants (in the Database of Genomic Variants) [24, 25].
Hybridization and data processing
The hybridization, image analysis, extraction of fluorescence intensities and their ratios log2ratio together with their subsequent normalization have been described in detail in a previous publication . Briefly, a two-colour CGH scheme has been used and the hybridization and image analysis have been performed as a commercial service by Roche NimbleGen Inc. No background has been subtracted before calculating the log2ratio values and the normalization followed a LOESS regression. The data discussed in this publication have been deposited in NCBI's Gene Expression Omnibus  and are accessible through GEO Series accession number GSE12208 .
The concept of resolving power, as used in the current work, has been described in a previous publication . The inputs to the resolving power calculations are the mean and standard deviation of the log2ratio data points in the normal and aberrant regions. Consequently, the calculations assume that the distribution of log2ratio is Gaussian for both types of region and no attempt is made to account for possible autocorrelation between data points mapping to nearby genomic locations. Armed with the mean and standard deviation of both distributions and knowing the total number of data points on the array one can evaluate the expected p-value for copy number aberrations affecting a given number of probes with the same mathematical formulation that is used to calculate a t-test. In the current study we are only interested in calculating the resolving power for one-copy deletions. The logarithm of the p-value coming out of a resolving power calculation is linear with the number of probes affected by the copy-number aberration and therefore a resolving power curve can be summarized by its slope, which is easily calculated with a linear regression.
Oligonucleotide properties and standard filters
We are using the term standard filters to refer to the microarray design constraints on oligonucleotides similar to those that gave us acceptable results in previous CGH studies . In summary they correspond to 1) the elimination of repeat sequences, 2) the elimination of 20-mers occurring more than once in the genome, 3) the elimination of homopolymers longer than 5 nucleotides, 4) the selection of oligonucleotides with a melting temperature T m within +- 5°C of the median melting temperature where T m has simply been calculated as a function of percent GC content and oligonucleotide length L by T m = 64.9 + 0.41GC - 500/L, 5) the elimination of oligonucleotides with a self-folding energy smaller than -1 kcal/mol according to a hybrid-ss-min calculation , 6) the elimination of oligonucleotides mapping to more than one location in the genome with a similarity level above 70% over the whole oligonucleotide according to a MegaBLAST  search, and 7) the elimination of oligonucleotides with a 15-mer count above median where the 15-mer count is defined as the sum of the genomic frequency of all the constituent 15-mers within the oligonucleotide. With the exception of the first constraint, all the filters can be modified at the analysis stage before calculating the resolving power, the repeats have already been eliminated when designing the arrays and therefore that constraint cannot be modified during the analysis and applies to all the results presented in the current work.
We wish to thank Rick Zapf for growing the worms and preparing the DNA for the C. elegans experiments. This work was supported by grants from Genome Canada, Genome British Columbia and the Michael Smith Research Foundation.
- Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C: Copy number variation: new insights in genome diversity. Genome Res. 2006, 16: 949-961. 10.1101/gr.3677206.PubMedView ArticleGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMedPubMed CentralView ArticleGoogle Scholar
- Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, Ye K, Wigler M: Strong association of de novo copy number mutations with autism. Science. 2007, 316: 445-449. 10.1126/science.1138659.PubMedPubMed CentralView ArticleGoogle Scholar
- Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM, Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L, Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L, Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS, Nelson SF, Singleton AB, Lee MK, Rapoport JL, King MC, Sebat J: Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008, 320: 539-543. 10.1126/science.1155174.PubMedView ArticleGoogle Scholar
- Maydan JS, Flibotte S, Edgley ML, Lau J, Selzer RR, Richmond TA, Pofahl NJ, Thomas JH, Moerman DG: Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization. Genome Res. 2007, 17: 337-347. 10.1101/gr.5690307.PubMedPubMed CentralView ArticleGoogle Scholar
- Egan CM, Sridhar S, Wigler M, Hall IM: Recurrent DNA copy number in the laboratory mouse. Nat Genet. 2007, 39: 1384-1389. 10.1038/ng.2007.19.PubMedView ArticleGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50 mer) microarrays. Nucleic Acids Res. 2000, 28: 4552-4557. 10.1093/nar/28.22.4552.PubMedPubMed CentralView ArticleGoogle Scholar
- Relógio A, Schwager C, Richter A, Ansorge W, Valcárcel J: Optimization of oligonucleotide-based DNA microarrays. Nucleic Acids Res. 2002, 30: e51-10.1093/nar/30.11.e51.PubMedPubMed CentralView ArticleGoogle Scholar
- Sharp AJ, Itsara A, Cheng Z, Alkan C, Schwartz S, Eichler EE: Optimal design of oligonucleotide microarrays for measurement of DNA copy-number. Hum Mol Genet. 2007, 16: 2770-2779. 10.1093/hmg/ddm234.PubMedView ArticleGoogle Scholar
- Wei H, Kuan PF, Tian S, Yang C, Nie J, Sengupta S, Ruotti V, Jonsdottir GA, Keles S, Thomson JA, Stewart R: A study of the relationships between oligonucleotide properties and hybridization signal intensities from NimbleGen microarray datasets. Nucleic Acids Res. 2008, 36: 2926-2938. 10.1093/nar/gkn133.PubMedPubMed CentralView ArticleGoogle Scholar
- Moerman DG, Barstead RJ: Towards a mutation in every gene in Caenorhabditis elegans. Brief Funct Genomic Proteomic. 2008, 7: 195-204. 10.1093/bfgp/eln016.PubMedView ArticleGoogle Scholar
- Baross A, Delaney AD, Li HI, Nayar T, Flibotte S, Qian H, Chan SY, Asano J, Ally A, Cao M, Birch P, Brown-John M, Fernandes N, Go A, Kennedy G, Langlois S, Eydoux P, Friedman JM, Marra MA: Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data. BMC Bioinformatics. 2007, 8: 368-10.1186/1471-2105-8-368.PubMedPubMed CentralView ArticleGoogle Scholar
- Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X, Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, Surti U, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW: Large-scale genotyping of complex DNA. Nat Biotechnol. 2003, 21: 1233-1237. 10.1038/nbt869.PubMedView ArticleGoogle Scholar
- Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR, Futreal PA, Weber B, Shapero MH, Wooster R: High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res. 2004, 14: 287-295. 10.1101/gr.2012304.PubMedPubMed CentralView ArticleGoogle Scholar
- Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM, Lu G, Fang J, Liu WM, Ryder T, Kaplan P, Kulp D, Webster TA: Probe selection for high-density oligonucleotide arrays. Proc Natl Acad Sci USA. 2003, 100: 11237-11242. 10.1073/pnas.1534744100.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang L, Wu C, Carta R, Zhao H: Free energy of DNA duplex formation on short oligonucleotide microarrays. Nucleic Acids Res. 2007, 35: e18-10.1093/nar/gkl1064.PubMedPubMed CentralView ArticleGoogle Scholar
- Gräf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, Stunnenberg H, Flicek P: Optimized design and assessment of whole genome tiling arrays. Bioinformatics. 2007, 23: i195-204. 10.1093/bioinformatics/btm200.PubMedView ArticleGoogle Scholar
- Breslauer KJ, Frank R, Blöcker H, Marky LA: Predicting DNA duplex stability from the base sequence. Proc Natl Acad Sci USA. 1986, 83: 3746-3750. 10.1073/pnas.83.11.3746.PubMedPubMed CentralView ArticleGoogle Scholar
- Sugimoto N, Nakano S, Yoneyama M, Honda K: Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res. 1996, 24: 4501-4505. 10.1093/nar/24.22.4501.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffith M, Tang MJ, Griffith OL, Morin RD, Chan SY, Asano JK, Zeng T, Flibotte S, Ally A, Baross A, Hirst M, Jones SJ, Morin GB, Tai IT, Marra MA: ALEXA: a microarray design platform for alternative expression analysis. Nat Methods. 2008, 5: 118-10.1038/nmeth0208-118.PubMedView ArticleGoogle Scholar
- Baldocchi RA, Glynne RJ, Chin K, Kowbel D, Collins C, Mack DH, Gray JW: Design considerations for array CGH to oligonucleotide arrays. Cytometry A. 2005, 67: 129-136.PubMedView ArticleGoogle Scholar
- Singh-Gasson S, Green RD, Yue Y, Nelson C, Blattner F, Sussman MR, Cerrina F: Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat Biotechnol. 1999, 17: 974-978. 10.1038/13664.PubMedView ArticleGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29: 308-311. 10.1093/nar/29.1.308.PubMedPubMed CentralView ArticleGoogle Scholar
- Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet. 2004, 36 (9): 949-951. 10.1038/ng1416.PubMedView ArticleGoogle Scholar
- Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW: Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006, 115: 205-214. 10.1159/000095916.PubMedView ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30: 207-210. 10.1093/nar/30.1.207.PubMedPubMed CentralView ArticleGoogle Scholar
- NCBI Gene Expression Omnibus, Series GSE12208. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12208]
- Markham NR: Hybrid: A software system for nuclei acid folding, hybridizing and melting predictions. Masters thesis. 2003, Rensselaer Polytechnic Institute, Troy, NYGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000, 7: 203-214. 10.1089/10665270050081478.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.