Genome sequence determination of the model organism Drosophila melanogaster was a landmark that launched a new era for functional genomic studies in complex organisms. The almost complete version of the euchromatic DNA sequence was first released in March 2000 due to a collaborative effort of the Drosophila Genome Projects and Celera Genomics . Using gene prediction softwares in combination with searches of protein and EST databases, initial in silico analyses indicated the existence of 13,601 protein-coding genes (PCG), an extraordinarily small number of genes when compared to the approximately 19.000 PCG encoded in the C.elegans genome .
After the release 1, an intensive collective work took place in order to improve sequence quality and annotation, fill in the gaps, and correct the assembly. With the aim of generating the information necessary to define the transcripts encoded in the genome, the Berkeley Drosophila Genome Project (BDGP) initiated a high throughput production of both EST and full length cDNA sequences based on conventional and normalized cDNA libraries from different tissues and developmental stages . This effort was followed by non-BDGP projects with a major contribution from the Exelixis Drosophila melanogaster EST project, which has adopted sequencing of random primed libraries of mixed stage embryos, imaginal disks, and adult heads to increase the transcription units coverage . Currently, there are about 39,346 full length mRNA and 532,557 EST sequences available in the NCBI database, totalizing approximately 16,681 clusters according to UniGene . Since the year 2000, several subsequent genome versions have been released, each one improved by BDGP and annotated by FlyBase . Release 3.2, considered the first finished version, was published in March 2004 and provided a complete revision of all gene models and other genome features , estimating a total number of 13,792 PCGs plus 527 non-protein coding genes (tRNAs, rRNAs, microRNAs, sn/snoRNAs). Release 4.3, the last annotated genome version published in March 2006, includes a total of 14,816 genes and is available for searches by gene annotation, BLAST or sequence ID at the FlyBase website .
During the last few years, an enthusiastic debate about the number of PCGs in the organisms with sequenced genomes has arisen. For D. melanogaster, estimates varied from the initial ~ 13,600 coding gene predictions  to about 16,000 gene predictions, based on microarray expression data . A careful computational and experimental analysis carried to validate the Drosophila genome annotation has recently concluded that the D.melanogaster genome in fact contains approximately 14,000 protein-coding genes, although some genes presenting unusual features that make them refractory to prediction methods may remain to be discovered . However, the truthful notion about the complexity of the D. melanogaster transcriptome is still under construction. In this respect, it has been inferred from DNA oligonucleotide microarrays, with unique sequences tiled throughout the genome and across predicted splice junctions, that over 40% of the Drosophila genes contain one or more alternative exons . Additionally, a transcription map with a 35 bp resolution of the initial 24 hours of development indicates that 30% of the transcribed regions are still unannotated. Approximately 23% of these are intronic and 7% correspond to intergenic regions. Based on manual and computational surveys designed to identify coregulated expression patterns between unannotated and annotated genome regions, it was estimated that 29% of the unannotated regions are part of transcripts incompletely annotated or potential alternative exons of known genes . Therefore, correcting and refining the genome annotation is a reiterative task, which is continuously being done and depends on experimental data for final validation, especially for the identification of rare transcripts and alternative splice variants. With the aim of covering the diversity of transcripts expressed in Drosophila the generation of EST information from different sources is currently under way [2, 3].
Here we use the Open Reading Frame Expressed Sequence Tags (ORESTES) methodology, which is based on low stringency RT-PCR, to generate D. melanogaster expressed sequence information. ORESTES are preferentially derived from the central coding portions of the transcript and frequently identify less abundant messages [12, 13]. Such approach was previously applied for human transcriptome characterization, validating a large percentage of genes and identifying 219 unannotated transcribed sequences on chromosome 22 . More recently, a large-scale analysis of ORESTES derived from head, neck and thyroid tumors pointed to 788 putative new alternative splicing isoforms. A subset of 34 was submitted to experimental validation resulting in the confirmation of 23 (68%) new alternate exons .
Analysis of 1,303 Drosophila ORESTES clusters revealed 68 potential transcribed regions unannotated in the current version of the genome (release 4.3). Experimental validation of 38 (~ 50%) of this unannotated ORESTES revealed 17 new exons that most likely belong to low abundance transcripts. Using the ORESTES information together with a PCR based approach we obtained the complete coding sequence of a new serine protease which mRNA expression is induced upon infection. Our data reinforce the importance of PCR based methodologies for refining the Drosophila transcriptome, particularly for the identification of previously unannotated low copy transcripts.