cDNA library construction
RNA was isolated following the RNeasy maxi kit protocol (Qiagen) (Additional file 6 Figure S1). cDNA production and normalization was performed by GenXPro GmbH (Frankfurt, Germany). In order to optimize the recovery of full-length transcripts, Calf Intestinal Phosphatase (CIP, Ambion) was used to remove the 5'-phosphate from all RNA species except those with cap structure. Tobacco Acid Phosphatase (TAP, Ambion) was used to remove the cap structure, leaving RNA molecules with a 5' phosphate. An RNA oligonulceotide ("SceI-site") was then ligated to the afore capped RNAs. First strand synthesis was performed using SuperScript® cDNA Synthesis System as described by the manufacturers, with SuperScript III as reverse transcriptase (Invitrogen) and a biotinylated oligodT with a T7 recognition site at the 5' end. The first strand product was used as template for a 20 cycles PCR reaction using primers specific for the T7 and the SceI-site. The PCR product was then normalized using the kamtchatka Crab Double Strand Specific Nuclease (DSN; Evrogen), as described by Zhulidov et al. (Additional file 6 Figure S2). The normalized product was again amplified by PCR with 15 cycles and used as template for T7 -RNA polymerase amplification (Fermentas) for 5 hours. In order to select for longer transcripts, RNA > 600 bps was gel-selected and purified from the agarose gel (Additional file 6 Figure S3). The recovered RNA was fragmented by Zinc treatment. Illumina GA-II sequencing adapters were ligated to the fragments, as described by Illumina's Paired-End Sample Preparation Guide (catalogue number PE-930-1001). A 300-400 bp smear containing the cDNA fragments flanked by Illumina PE adapters was cut from an agarose gel and cDNA purified. Sequencing was performed on an Illumina GA-II using the Chrysalis 36 cycles v 3.0 sequencing kit, with one lane of 2 × 101 bp reads from both ends of the fragments ("paired ends") with an insert length of > 180 bp. The PhiX control lane revealed error rates between 3.16% for Read 1 and 4.07% for Read 2. Data basecall was performed using the Solexa GAPipeline 1.4.0. We chose to use the paired-end sequencing approach as it bears some additional advantages compared to normal one way sequencing. For one, information on two sequence reads being located in close vicinity is obtained and their position to one another known, second, if long enough these reads might be joined to one single read, longer than the usual read length.
EST analysis and bioinformatics
The raw sequence data was quality checked and trimmed if five consecutive bases were of quality score "B", these plus the subsequent bases at the 3' end of the low-quality-base reads were discarded. T7 and Illumina adaptors were removed using an inhouse script.
EST-contigs were assembled using three different assemblers: a) VELVET short read assembler, Version 0.7.55  as commonly used short read assembler, with following parameters: k-mer hash length 31, coverage = 1, read category = shortPaired; b) OASES short read assembler, Version 0.1.11 has recently been released as extension for Velvet, optimized for EST assembly (available at http://www.ebi.ac.uk/~zerbino/oases/). It uses preliminary VELVET assemblies with reduced setting options as input. Two additional VELVET assemblies were created with k-mer sizes 21 and 31, with read category set to shortPaired of the raw read data. The same k-mer sizes were additionally applied to the previously trimmed sequence set, which in both cases resulted in less contigs of smaller size (data not shown), and were thus excluded from further analysis.; c) SeqMan NGEN program (hereafter called NGEN) from Lasergene DNAStar software package (Madison, WI; http://www.dnastar.com/t-products-seqman-ngen.aspx) k-mer size = 31, Illumina reads and PairedEnd option turned on for the first contig assembly, with the Illumina raw reads. If not otherwise stated default parameters were used. Assemblies were run on computers with 32 Gb RAM, and finished within 24h.
As contigs created from the different assemblers differed in their read length, number of BLAST hits and position along a transcript, we additionally investigated whether a meta-assembly, thus the combined assembly of all four contig sets might increase contig information. The approach of combined assembly of several contig sets has already proven useful for 454-contigs obtained by different assemblers . As our first-order contig analyses revealed adaptor remains on some contigs, all first-order contig sets were BLASTed against the T7 and Illumina adaptors to ensure high quality meta-assembly. If similarities were detected the according bases plus subsequent bases to the nearest sequence end were trimmed (this affected 3-7% for the Oases and Velvet as well as 35% of the NGen contigs). We chose to use NGEN for this task as it gave good results in our study but especially also for long read assembly . Default parameters were used for the "other reads" option with k-mer size = 25.
Contigs from all assemblies can be found in Additional files 7, 8, 9, 10 and 11.
The BLAST program  was used to perform BLASTN and BLASTX homology searches in the following databases: RefSeq Invertebrates (Release38_11-11-2009) and UniProt protein database (UniProtKB/Swiss-Prot Release 2009_09) respectively. As the Radix genome has not been sequenced yet, we chose to use three indirect quality measures for an overall comparison of all contig sets. In addition we were able to inspect contig quality directly by making use of the recently obtained mitochondrial genome sequence of R. balthica . The four different quality measures (three indirect and one direct) were defined and applied as follows:
a) Mapping accuracy has been shown to increase significantly with contig size (100bp vs. 300bp; ). We therefore compare the number of BLAST hits using a moderate cutoff value of < e-5 to a smaller cutoff value of < e-10 and/or contigs larger than 200bp.
b) The number of BLASTX hits against the EST database of Biomphalaria glabrata (NCBI EST databases, state march 2010), the only other gastropod available for comparison at the time with a large EST dataset. B. glabrata belongs to a sister-superfamiliy, which diverged about 180 million years ago. The EST data set with 86,121 ESTs of which 26,730 resulted in a hit with a cutoff value of < e-5 and 5753 UniGens.
c) After removing redundant gene hits from each UniProt BLASTX data sets, only single hits with < e-10 and larger 200bp were used to determine the number of conforming hits between contig sets produced by the three different assemblers. The higher the number of overlaps the better, follows the assumption that genes identified from contigs of two different assemblers should be trustworthy.
d) All contigs were subjected a BLASTN against the 13 mitochondrial genes of R. balthica (NCBI Accession No.: HQ330989). The overall number of hits, the number and percentage of identical bp between contigs and genes were determined and compared. Non-matching contig sections were visually scored as to their position at either the ends or middle of the contig. Additionally non-matching contig sections were separately aligned against the Illumina sequencing adaptor sequence to test whether they represent adaptor remains.
After evaluating the results of the above mentioned methods, we came to the conclusion that the best quality contigs obtained are the ones assembled by NGEN. We therefore produced an overview graph of contig average coverage versus contig length, and the gene ontology annotation (see below) for this contig set only. Contig average coverage (ac) was calculated by summing up the lengths of all sequences (ssl) building up one contig, divided by the contig length (cl), thus ac = ssl/cl.