High-throughput marker discovery in melon using a self-designed oligo microarray
© Ophir et al. 2010
Received: 12 August 2009
Accepted: 28 April 2010
Published: 28 April 2010
Skip to main content
© Ophir et al. 2010
Received: 12 August 2009
Accepted: 28 April 2010
Published: 28 April 2010
Genetic maps constitute the basis of breeding programs for many agricultural organisms. The creation of these maps is dependent on marker discovery. Melon, among other crops, is still lagging in genomic resources, limiting the ability to discover new markers in a high-throughput fashion. One of the methods used to search for molecular markers is DNA hybridization to microarrays. Microarray hybridization of DNA from different accessions can reveal differences between them--single-feature polymorphisms (SFPs). These SFPs can be used as markers for breeding purposes, or they can be converted to conventional markers by sequencing. This method has been utilized in a few different plants to discover genetic variation, using Affymetrix arrays that exist for only a few organisms. We applied this approach with some modifications for marker discovery in melon.
Using a custom-designed oligonucleotide microarray based on a partial EST collection of melon, we discovered 6184 putative SFPs between the parents of our mapping population. Validation by sequencing of 245 SFPs from the two parents showed a sensitivity of around 79%. Most SFPs (81%) contained single-nucleotide polymorphisms. Testing the SFPs on another mapping population of melon confirmed that many of them are conserved.
Thousands of new SFPs that can be used for genetic mapping and molecular-assisted breeding in melon were discovered using a custom-designed oligo microarray. A portion of these SFPs are conserved and can be used in different breeding populations. Although improvement of the discovery rate is still needed, this approach is applicable to many agricultural systems with limited genomic resources.
Genetic maps based on molecular markers are a key step in marker-assisted selection (MAS) for plant breeding and for gene identification based on positional information . Discovery of novel genetic variation, enhanced selection techniques and the identification of genotypes with new or improved traits can lead to superior strains of interest. The density of the markers on a genetic map defines the map's resolution and the ability to introduce these traits using molecular means. New high-throughput technologies for the discovery of genetic markers and linkage analysis can enhance plant breeding efforts .
In the last few years, genomic tools have begun to replace the traditional methods of molecular marker discovery. Use of computational tools to search sequence information for molecular markers [3, 4], or genomic technologies such as whole-genome scanning (WGS), have enabled the discovery of thousands of markers in a single experiment . These markers can be mapped to create high-density genetic maps or used directly for MAS [3–5]. However, use of these types of technologies for plant breeding is still restricted, as many agricultural systems have a limited amount of genomic resources. Furthermore, plant species contain wide genetic variation that cannot be explored by sequencing only one or two plants from each species.
One of the technologies that has been used for identification of genetic variation is single-feature polymorphism (SFP) [3–5]. This approach is based on the concept that target DNA that perfectly matches its probes binds with greater affinity than one with a mismatch. Thus, natural imperfections can be detected as a difference in signal intensity in microarray hybridization using labeled genomic DNA (gDNA). SFPs can be used, with no further information, as markers, or they can be sequenced to identify the genetic difference (frequently a single nucleotide polymorphism (SNP) or a small indel). SFP technology draws its strength from the fact that it can be implemented in a variety of genetic applications, such as marker discovery and fine mapping of traits, as well as for genome-wide association studies [3, 6, 7]. SFP discovery has been relatively successfully implemented in model organisms such as Arabidobsis  and Drosophila , and in non-model organisms such as rice and soybean [9, 10]. To date, all hybridizations have been performed on high-density short oligonucleotides (Affymetrix arrays). These types of arrays are only available for a few organisms, they are expensive and they are not flexible in their design. Therefore, the ability to implement this technology on any custom array (Agilent, Nimblegen, and others) has the potential to create a very useful tool in many breeding programs for agricultural crops.
Melon is among the most important fleshy fruits for fresh consumption. It belongs to the family Cucurbitaceae which includes other important crops, such as watermelon, cucumber and squash. Melon is a diploid species (2n = 24) of African origin with a high level of phenotypic and molecular variation  and a genome size estimated at 450 Mb . Over the last 15 years, several research groups have published genetic maps for melon [13–22]. An integrated map, proposed to serve as a reference map, was constructed, based on two recombinant inbred line (RIL) populations ('Vedrantaise' × PI161375 and 'Vedrantaise' × PI414723). This map contains 668 loci; it spans 1654 cM on 12 linkage groups. It also includes 23 morphological traits . Another map, based on two populations derived from a cross between 'Piel de Sapo' and PI161375, identified several quantitative trait loci (QTLs) controlling fruit-quality traits . The Katzir group has recently established a reference map based on a cross between PI414723 and 'Dulce'  and constructed a QTL map using RILs developed from this population . Other resources that were created for melon are a limited EST collection  and BAC libraries . Although a substantial amount of effort has been invested in creating genetic maps and molecular tools for melon, only a few genes have been cloned by map-based cloning [25–27]. The genetic maps that exist for melon are not dense enough (containing only a few hundred markers) and are still the limiting factor for MAS, cloning of genes of interest and WGA (genome-wide association studies).
We chose melon as representative of a large group of vegetables and other crops that are agriculturally important but are still lagging behind in their genomic resources. Although sequencing has become extremely affordable using new technologies, we are still not at a stage at which we can sequence a variety of the same plant again and again. In this paper, we present the ability to use a custom-designed microarray with longer oligos (45-55 mer), based on a partial EST database of melon, to discover thousands of new markers for melon. These markers can be used for genetic mapping, breeding and association studies.
We designed a custom oligo microarray (Agilent) based on the melon ESTs presented in the Cucurbit Genomic database (ICuGI) . The microarray contains 186,600 unique probes from 16,114 UniGenes covering approximately 9 Mb of cDNA sequence from different ecotypes and different conditions (for further details, see Methods). For marker discovery, we used DNA from the parents of our mapping population, developed by Katzir's group from a cross between representatives of two subspecies of Cucumis melo L.: PI414723 (subspecies agrestis) and 'Dulce' (subspecies melo). Two biological replicates from PI414723 and two from 'Dulce' were used. Each biological replicate contained gDNAs pooled from 10 different plants. gDNA samples were labeled and hybridized using standard Agilent procedures for comparative genomic hybridization (CGH) (see Methods).
Sequence validation of putative SFPs
SFP per UniGene
No. of genes
Proportion of genes by SFP on array
Proportion of genes by SFP number in sequence validation
Transition-to-transversion ratios in SFPs
In the present study, we tested the feasibility of identifying genetic polymorphism in melon crops by hybridizing gDNA to custom-designed microarrays based on a partial transcriptome. SFP technology has been previously applied for a variety of genetic applications, such as linkage analysis and association studies, mostly in model organisms such as Arabidopsis and yeast [5, 32], with limited use in some other agricultural organisms [29, 33, 34]. One of the main reasons for the limited use of this technology is that the technological platform (Affymetrix arrays) is not available for many biological systems. In this study, we used custom-designed arrays with 45-55 mer probes (limitation of the Agilent CGH protocols) covering a substantial area of the partial transcriptome to discover genetic variation in melon. As a result, we increased the number of markers available for melon by an order of magnitude using the cost-effective SFP technology.
The melon cultivars showed substantial within-population genetic variation (Additional file 2), as can be expected in agricultural crops. This needs to be considered when defining the variation between different cultivars. Based on our results, the use of pools of different parents is essential to overcoming this issue.
With some exception [5, 32], studies have reported discrepancies between calculated and observed false discovery rates (FDRs) for SFPs. Sequence validation (Table 1) showed that the FDRs in our array (melon, 21%) were similar to those found in rice (19%)  and unsequenced Arabidopsis strains (15-20%) , better than those found in tomato (25-30%) , and not as good as those found in some of the Arabidopsis studies [5, 7]. One reason for this discrepancy is that genetic variation in loci near a target may influence its affinity to probes . In our study, we eliminated all putative SFPs that did not harbor changes inside the SFP, because we could not evaluate the connection between these differences and hybridization to the SFPs. Nevertheless, these differences could still be used as markers. In this work we tried to exclude the possibility of minor frequency alleles (MFAs). For example, focusing on a specific locus, in one parent the allele may be T in its two biological replicates (pools) whereas in the other parent gDNA pools, the T allele might make up only 80% of the pool and G might make up the other 20%. This might affect the hybridization signal, but it would not be observable at the sequence level. By showing that the inter-population variation is greater than the intra-population variation, we exclude this possibility (Additional file 2). Luo et al. , using labeled cDNA and Affymetrix array in barley, suggested that many of the SFPs that do not show any differences in direct sequencing are a result of cis-regulatory expression regulators. This explanation is not relevant to our study as we are using genomic DNA for labeling. As only partial transcriptome data exist for melon, the ability to discover SFPs and to eliminate cross-hybridizations and gene duplications is limited. Divergence of exon-intron boundaries and alternative splicing of cDNA that are printed on the array and the labeled genomic DNA could also contribute to the high FDR that is found by sequencing. Another explanation for the discrepancy in FDR can be the power of the statistical test, which is might be too low given the number of replicates in the different experimental systems, We do not believe that this is the reason for the high FDR as we used the empirical Bayesian model that can repair low power in limma ). One can improve the discovery rate by ignoring all multiple SFPs (four or more), which can be considered copy number variations (Figure 2). It is also possible that by using this type of long probe array, we lose a portion of the variation; however in doing so, we gain better coverage of the transcriptome, flexibility in format design and lower price. We anticipate that by improving the array design and adding the complementary strand for each probe, the FDR will be reduced as suggested by Smemo et al . As the new Agilent and Nimblgen platforms hold much larger probes sets, one can increase the number of probes for any given sequence without hampering coverage. Another solution is increasing the statistical threshold in order to the decrease the FDR. There no doubt that this approach will reduced the discrepancy in FDR but will also will reduced dramaticly the number of SFPs that are discovered .
The question of conservation of the SFP markers across melon cultivars was tested using two different mapping populations (Figures 5 and 6). The data presented here show that a large portion of the SFPs are conserved between the different populations. This experiment demonstrated another advantage of SFP technology, which utilizes large sets of probes with no assumption of the location of the changes in different mapping populations. The large set of probes allows identification of large sets of genetic markers in different mapping populations. Although these sets of markers are not ordered, one can still use them for breeding experiments as a whole -genome screening tool and for creating "classical" genetic maps.
Developing a custom-designed array creates the possibility of using SFP technology in many agricultural systems that do not have an Affymetrix platform and allows much more flexibility in design and use. The low price of the screening and the relatively rapid execution and flexibility make it a useful technology. Although SFP technology is far from perfect, and a reduction in FDRs is needed to make thi approach more efficient, we firmly believe that it can contribute to marker discovery, MAS and WGS approaches in various crop plants. We therefore envision the ability to use WGS (SFP technology and others) as a routine element in the process of MAS in breeding programs as the next challenge in the field of breeding technology.
SFPs can be discovered using custom-designed microarrays and partial genomic data. This approach can be used with many agricultural organisms for marker discovery and breeding.
A single-feature array was designed based on 16,637 non-redundant cDNA sequences of the Melon Genomics Database from different genotypes and conditions . This set of sequences was used as input for OligoWIZ 2.0  to design an optimal set of probes along the melon partial transcriptome. We designed probes of 50 base pairs (bp) in length ± 5 bp with Tm optimization (76.6°C). The Tm, as well as other features of the probes such as cross-hybridization, folding, and low complexity, were summed to total scores with weightings of 0.245, 0.61, 0.025, and 0.12, respectively. The output of this process was 198,500 probes. Probes with a total score lower than 0.3 that included "N" in their sequence were filtered out, leaving a total of 186,600 probes. This design was loaded on the Agilent eArray site for array production https://earray.chem.agilent.com/earray/.
Genomic DNA was extracted using standard procedures  from young leaves pooled from 10 plants of each of the different cultivars. Two separate pools were created for each cultivar. The DNA was treated with RNAseH (Sigma, USA) and re-purified using phenol/chloroform/isoamyl alcohol followed by ethanol precipitation.
gDNA (2.5 μg) was digested with restriction enzymes Alu I and Rsa I (Promega, USA). The digested DNA was used for labeling and hybridization following Agilent procedures for CGH . Scanning of the array was performed with the Agilent scanner. Labeling, hybridization and scanning were done at the Weizmann Institute's DNA Microarray Facility (Rehovot, Israel).
Probe signal pre-processing and fitting were performed with the R-package for LIMMA , following background subtraction; for within-array normalization we applied the "lowess" method , and for between-array normalization the "Aquntile" method . To select for statistical significance of the signal ratios, two subsequent steps were applied.
Where the vector of P-D, D-P, D-D, and P-P is the vector of log signal ratios, the matrix of 1, 0, and -1 is the design matrix, and the vector of D2-D1, P1-D1, and P2-D1 is a vector of the estimated coefficients. Next we used these coefficients to estimate three comparisons (contrasts):
2) The ordinary t-statistic of each comparison (contrast) for each SFP probe was moderated by shrinking its standard error towards a common value borrowed from the gene ensemble by Bayesian as described in Smyth 2004 . Therefore, from equations (2), (3), and (4) we used the moderated-t statistics for the statistical significance test. The p -value was corrected using Benjamini and Hochberg's method  to get an adjusted p -value (or q -value).
The B statistic is a Bayesian view of the p -value giving the log odds that an SFP is true (probe signal ratio is significant). The exact formula of B is described in .
SFPs were selected randomly for validation from the list of putative SFPs. Based on sequence information from the melon database, primers were designed from 50-100 bp of the SFP using Primer3 software . PCRs were performed from DNA pools of PI414723 and 'Dulce' (PCR conditions: 30 cycles of 30 s at 94°C, 30 s at 58°C, 1 min at 72°C). Sequencing was executed by Macrogen Inc., Korea. Sequence analysis was performed using Vector NTI software (Invitrogen). Only samples that showed good-quality sequences from the two parents were analyzed and compared to the database sequence that was used as a reference.
positive predictive value
false discovery rate
bacterial artificial chromosome
This research was funded by a grant from the Agricultural Research Organization manager fund. Special thanks to Dr. Shirely Saban from the DNA Microarray Unit of the Weizmann Institute for helping with technical issues and performing the microarray hybridizations. The authors would also like to thank Dr. Zohar Yakhini and Prof. Yoav Benjamini for fruitful discussions. This is a contribution No. 113/2010 series from the Agriculture Research Organization, the Volcani Center, Bet Dagan, 50250, Israel.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.