In this paper we provide an initial platform for functional genomics of melon by the identification of more than 16,000 unigenes assembled from almost 30,000 ESTs sequenced from 8 melon cDNA libraries. It is probably premature to estimate the proportion of melon genes represented in this dataset, but based on available data for other plant species (i.e. Arabidopsis and rice), it is likely that the melon unigene set characterised here represents approximately between half and one-third of the number of expressed, protein coding genes of melon. Libraries were constructed from various tissue types, but with a bias towards fruit development and pathogen-infected tissues. Data from these libraries will become a useful resource of genes for experiments aimed at understanding important processes involved in fruit development and resistance to viral and fungal pathogens. Also, data presented here provide an important tool for generating markers to saturate melon genetic maps.
In contrast to typical EST gene-sampling strategies reported previously, we have found a low degree of redundancy in the sequences determined. The process of clustering reduced the number of sequences to 56%, from 29,604 good quality ESTs to 6,023 contigs and 10,614 singletons. Contigs with more than 8 ESTs were scarce, the majority of them being formed by 3 or 2 ESTs. Redundancy of the sequences derived from each library ranged from 13% to 20%, with singletons constituting approximately one third of the unigenes determined per library. This low redundancy is probably due to the success of the normalization process, responsible for the suppression of superabundant transcripts specific for a given tissue or condition. Normalization precludes in silico analysis of gene expression, but greatly increases the number of unigenes that can be determined by reducing redundancy . Here we have used a recently described normalization protocol which is based on the cleavage of DNA or DNA-RNA duplexes by a specific DNase ; this process, in our hands, has proven simple, reproducible and efficient. Another factor that has contributed to the low redundancy values obtained has been the sequencing of libraries from very distinct tissues. Thus, the number of library specific unigenes was about one half of the total number of unigenes contributed by each library, suggesting that further sequencing of the libraries still has the potential to provide a good number of new, non-redundant sequences.
cDNA sequences are a useful source of SSRs, which are excellent molecular markers due to their high degree of polymorphism. A common feature of cDNA sequences obtained from plants is the high frequency of SSRs that they contain . We have identified more than 1,000 potential SSRs in the melon dataset, with approximately 6% of the melon unigenes containing di-, tri- or tetranucleotide repeats. A clear bias toward AG and AAG repeats existed, that account for 67% of the SSRs. In contrast, the GC repeat was not found in the melon dataset. A similar bias toward AG and against CG repeats has been identified in Arabidopsis and other plant species [40, 78]. As proposed at least in one other instance , this may be due to the tendency of CpG sequences to be methylated , which potentially might inhibit transcription. Another interesting feature of melon SSRs relates to their pattern of localization with respect to putative initiation and termination codons. It is known that the UTRs of transcribed sequences are richer in SSRs than coding regions, particularly at the 5'-UTRs [36, 40]. However, in the melon dataset, a high proportion of SSRs (29.5%) were found in ORFs. An analysis of the localization of di-, tri- and tetranucleotide repeats separately showed that di- and tetranucleotides were preferentially located in UTRs, whereas trinucleotides localised in both, UTRs and ORFs, consistently with maintenance of the ORFs coding capaCity. Thus, the prevalence of trinucleotide repeats in the melon dataset (71%) explains this result.
We identified in the melon sequence dataset 356 high-quality SNPs. Since non-redundant sequences analysed here encompassed 4.5 Mb, one SNP was found every 12,000 pb of sequence. This small figure is probably due to the limited number of melon genotypes used and the low redundancy found among libraries. In fact, when the frequency of SNPs is computed in relation to the length and number of contigs containing SNPs, the corresponding value (one SNP in every 616 bp of sequence) is of the same order of magnitude as values previously calculated for melon (441 bp; ) and other plant species . With the advent of high-throughput detection systems, the SSRs and SNPs identified here will constitute an important resource for mapping and marker-assisted breeding in melon and closely related crops.
As an approach to the function of melon unigenes, we carried out a bioinformatics analysis based on BLASTX and matches with the Pfam database . The proportion of melon unigenes with no similar sequences in databases was quite high, suggesting that the melon dataset may encompass an important number of melon-specific sequences. However, the proportion specific sequences might be overestimated because blasting has been made with unigene sequences, which in many cases do not cover the complete length of the transcript. We performed a functional classification of the unigenes following the Gene Ontology scheme, which is one of the more versatile and complete systems for functional classification . A comparison of the distributions of melon and Arabidopsis unigenes in GO categories showed that both followed similar tendencies, suggesting that the melon dataset is representative of the whole melon transcriptome. This is remarkable, as the number of different libraries sequenced has been relatively small; again, this is probably due to the success of the normalization process. We have also carried out specific searches for genes involved in pathways of particular relevance in melon, such are resistance response and fruit development, identifying a remarkable number of melon candidates. For example, an ortholog of the flagellin receptor FLS2 from Arabidopsis  has been identified, together with 163 candidate RLKs that may have critical roles in pathogen recognition or diverse signalling processes. Similarly, up to 8 MADS-box gene homologs with potential roles in development have been found in the melon dataset. Moreover, a bioinformatics approach [52, 53] allowed the identification of potential precursors of melon miRNAs together with several potential targets in the melon dataset. This finding opens the door to biotechnology approaches based on the use of artificial miRNAs to specifically silence melon genes [81, 82].
The transcript accumulation analysis for the 20 selected genes revealed important changes in gene expression associated with pathogen infection and fruit development. For virus infection, the accumulation of transcripts remained unaltered for 12 genes, but showed a significant increase for 6 genes and a decrease for 2. Among the set of genes analysed, TOM2A and EIF4E were known to code for virus susceptibility factors [27, 83]; the expression of TOM2A was increased after CMV infection, consistently with its requirement by the virus, but this was not the case for EIF4E. Different hypotheses can explain this result: since EIF4E is an abundant, housekeeping protein, increased expression may not be essential for virus multiplication; alternatively, CMV may not use EIF4E in melon or may use a factor coded by a different member of the 4E family; it may also be that timing of the sampling for this experiment was not appropriate to detect such an effect, as requirement of EIF4E might occur very early during virus multiplication. In the cases of infection of susceptible ("Piel de sapo") and resistant (pat81) melons by M. cannonballus, more extensive alterations in gene expression seemed to occur in the susceptible than in the resistant accession. Significantly, for the susceptible accession, stress responsive genes (e.g. HSP101) appeared to be maximally induced, whereas for the resistant accession, a gene encoding a WRKY70 transcription factor, potentially involved in resistance response, was induced to high levels. Significantly, expression of GA2OX1 increased about 1.5 times in pat81 after the M. cannonballus attack, whereas it decreased in "Piel de sapo". GA2ox is a major gibberellin (GA) catabolic enzyme, with an important role in controlling GA levels in plants. Hormones control many plant developmental processes, and strong evidence indicates that hormone signalling is involved in the regulation of root growth and architecture [84, 85]. The differential response of the GA2OX1 gene in the two melon genotypes is consistent with an enhanced root growth in pat81 after infection . Notably, other genes involved in hormone-mediated signalling pathways, such as the IAA9 gene, did not show such differential response to M. cannonballus infection in both genotypes. In the case of fruit development, differences in the expression of selected genes between immature and ripening fruits appeared to be even sharper than in the cases of healthy and pathogen-infected tissues. Specific roles during fruit development for HSP70, TOM2A and TOM3 have not been identified, though an increased expression has been shown at least in the case of TOM2A in tomato . The ethylene receptor gene EIN4 showed a two-fold increase in expression. EIN4 is the ortholog of Arabidopsis EIN4 and tomato LeETR4 [88, 89]. In tomato, LeETR4 is also highly expressed in ripening fruit, suggesting that it responds by modulating ethylene signalling during ripening . The MADS-box gene (SVP) showed about a four-fold decrease in expression. This gene is the ortholog of tomato JOINTLESS, which specifies the abscission zone in tomato. In tomato fruit microarray hybridizations, the expression of JOINTLESS also decreased from 7 to 57 DAP , in agreement with our data for melon. The lycopene epsilon cyclase (LUT2) and xyloglucan endotransglycosylase (TCH4) genes showed an approximately four-fold decrease in expression during melon fruit development. These findings fit with the patterns of expression of these genes in tomato, where their transcript levels decrease to a non-detectable level in the ripe fruits [90, 91].