A combined strategy involving Sanger and 454 pyrosequencing increases genomic resources to aid in the management of reproduction, disease control and genetic selection in the turbot (Scophthalmus maximus)

Background Genomic resources for plant and animal species that are under exploitation primarily for human consumption are increasingly important, among other things, for understanding physiological processes and for establishing adequate genetic selection programs. Current available techniques for high-throughput sequencing have been implemented in a number of species, including fish, to obtain a proper description of the transcriptome. The objective of this study was to generate a comprehensive transcriptomic database in turbot, a highly priced farmed fish species in Europe, with potential expansion to other areas of the world, for which there are unsolved production bottlenecks, to understand better reproductive- and immune-related functions. This information is essential to implement marker assisted selection programs useful for the turbot industry. Results Expressed sequence tags were generated by Sanger sequencing of cDNA libraries from different immune-related tissues after several parasitic challenges. The resulting database (“Turbot 2 database”) was enlarged with sequences generated from a 454 sequencing run of brain-hypophysis-gonadal axis-derived RNA obtained from turbot at different development stages. The assembly of Sanger and 454 sequences generated 52,427 consensus sequences (“Turbot 3 database”), of which 23,661 were successfully annotated. A total of 1,410 sequences were confirmed to be related to reproduction and key genes involved in sex differentiation and maturation were identified for the first time in turbot (AR, AMH, SRY-related genes, CYP19A, ZPGs, STAR FSHR, etc.). Similarly, 2,241 sequences were related to the immune system and several novel key immune genes were identified (BCL, TRAF, NCK, CD28 and TOLLIP, among others). The number of genes of many relevant reproduction- and immune-related pathways present in the database was 50–90% of the total gene count of each pathway. In addition, 1,237 microsatellites and 7,362 single nucleotide polymorphisms (SNPs) were also compiled. Further, 2,976 putative natural antisense transcripts (NATs) including microRNAs were also identified. Conclusions The combined sequencing strategies employed here significantly increased the turbot genomic resources available, including 34,400 novel sequences. The generated database contains a larger number of genes relevant for reproduction- and immune-associated studies, with an excellent coverage of most genes present in many relevant physiological pathways. This database also allowed the identification of many microsatellites and SNP markers that will be very useful for population and genome screening and a valuable aid in marker assisted selection programs.


Background
The turbot (Scophthalmus maximus) is a flatfish with increasing commercial relevance in Europe with a current annual production of~10,000 tones [1] with an increasing consumer demand worldwide. Thus, turbot production significantly increased in Northern China during the last decade. However, fish disease outbreaks collapsed its production in 2006, with economic losses estimated to amount several hundred million Euros [2,3].
It seems clear that one of the major concerns for turbot aquaculture is disease control. Intensive culture conditions in fish farms favors the proliferation of pathogens and the consequent economic losses associated with disease outbreaks. Hence, a comprehensive knowledge of the immune system of commercially important fish species is required [4]. The immune-prophylactic control of fish diseases through vaccination, probiotics and immunostimulation has been undertaken since long ago [5][6][7], whereas genetic programs on disease resistance, specifically in turbot, clearly require further investigation. Obtaining resistant broodstock is an appealing solution to control diseases in front of the economic cost of vaccines, treatments and the possible generation of resistances against antibiotics.
Another major concern for the aquaculture industry is fish reproduction. Like in other vertebrates, reproduction in turbot is controlled by the brain-pituitary-gonad axis, which integrates environmental signals and controls the production and secretion of the major hormones involved in controlling the reproductive cycle, including the onset of puberty [8,9]. Furthermore, turbot exhibits one of the largest cases of sexual dimorphism for growth rate in favor of females among aquacultured species [10]. Therefore, there is an interest in the turbot aquaculture industry to produce stocks with as many females as possible in order to increase biomass. Gonad development is a complex biological process in which an undifferentiated bipotential gonad is transformed into either a testis or an ovary [11,12] according to sex determination and differentiation [11,13]. External factors such as temperature, pH or social behavior can directly influence gonadal development in some fish [14,15] and, consequently, affect sex ratio. Understanding the process of gonadal development can greatly aid in the control of sex ratios in finfish aquaculture. However, in turbot there is a lack of information of genes involved in reproduction and their interactions. The induction of gynogenesis suggested a XX/XY system of sex determination [16], but later studies involving the analysis of progenies from sexreversed parents revealed a ZW/ZZ system [17]. Linkage maps were developed [18][19][20][21] and led to the identification of the major sex-determining region [22] and facilitated the characterization and mapping of sex-associated markers [23,24], although the sex determining gene(s) is (are) still unknown.
Despite recent increases in the number of Expressed Sequence Tags (ESTs) for flatfish [25][26][27][28], their resources are still limited when compared to those available for salmonids (25,781 vs 851,722; GenBank). Particularly in turbot, only 12,427 entries were found but this number was recently raised up to 55,404 contigs [29]. During the last few years, efforts have been done to create a comprehensive turbot database with a large number of gene sequences available based on the immune response to the most common different pathogens of industrial relevance [30][31][32]. These include Aeromonas salmonicida subspecies salmonicida, a bacterium capable of causing 100% mortalities in just 7 days after challenge [33], and the parasites Philasterides dicentrarchi and Enteromyxum scophthalmi, which are responsible for severe fish outbreaks [34,35]. Therefore, the first Turbot database was originally created with almost ten thousand high-quality ESTs sequences [36] (Table 1). From this database, a first custom oligo-microarray (8X15 Agilent) was successfully designed [37] and applied for evaluating expression profiles of genes involved in defense against pathogens [38,39].
Next Generation Sequencing (NGS) strategies have positively affected genetics research over the last few years and their advantages have been applied to many research fields. 454 FLX Titanium is a massive pyrosequencing strategy which generates medium-size single reads uncovering large amounts of DNA sequences providing much deeper sequencing coverage than it is possible with conventional Sanger sequencing [40]. Sequencing small subsets of the genome such as the transcriptome is an attractive alternative for gene discovery in species whose genome is still not available, and fish are not an exception. For example, in guppy (Poecilia reticulata) the sequencing of a total of 336 megabases (Mb) produced the first reference transcriptome for this fish species [41]. In the selffertilizing hermaphroditic mangrove Rivulus, Kryptolebias marmoratus, the identification of more than 150,000 sequences provided the first insights on the mechanisms underlying the response to environmental stresses and chemical toxicities [42]; and in the gilthead sea bream (Sparus aurata), the fast skeletal muscle transcriptome was described [43]. In particular, the reproductive system of the lake sturgeon (Acipenser fulvescens) has also been studied by resorting to modern pyrosequencing and it has been useful for the discovery and evaluation of candidate sex-determining genes and xenobiotic-responsive genes in the gonads [44,45].
Another approach to improve the aquaculture production is based on the application of molecular markers such as microsatellites or simple sequence repeats (SSRs) and SNPs. These markers are the basis for genetic mapping and comparative genomic analysis, which are in turn used for detection of quantitative trait loci (QTL) and for marker assisted selection (MAS) programs [46,47]. Several types of genetic markers have been developed and investigated in turbot [48] and many of them have already been mapped [18][19][20][21]. Recently, a genome scan for sex-determination [22] and resistance/survival to A. salmonicida [49] and P. dicentrarchi [50] using the genetic map identified several consistent QTLs and associated markers in turbot, which suggests the existence of genetic factors underlying these characters and supports their application in genetic breeding strategies. The advents of new high-throughput sequencing technologies, which produce extensive sequence data, are providing new opportunities to increase the amount of molecular markers, as demonstrated in the sturgeon, where hundreds of SNPs were discovered [44].
Overall, the improvement of the turbot aquaculture industry by selecting, on one hand, the most resistant broodstock and, on the other hand, female-biased batches is a priority challenge. The purpose of this study was to increase turbot database information for genes related to the immune and reproductive systems by creating a powerful tool for genomic research in this species. The turbot database was updated with genes obtained both by Sanger sequencing from immune-related tissues after challenges with the myxozoan parasite E. scophthalmi and by a 454 FLX Titanium run from gonad and brain-hypophysis at different stages of development. Description and comparison of the two sequencing strategies, annotation procedures, and construction of a larger database, the support for microsatellites and SNP discovery, and for designing a pilot-microarray platform, are presented.

Results and discussion
The increase of known immune-related genes in turbot by Sanger sequencing The progression in the construction of the turbot database is summarized in Table 1. First, the Turbot 1 database was created from almost ten thousand high-quality EST sequences from three cDNA libraries of three immune relevant organs (liver, head kidney and spleen) generated from turbot infected with A. salmonicida subspecies salmonicida and P. dicentrarchi, as well as from non-infected fish [36]. The Turbot 2 database included several resource sequences: i) 1,371 sequences from seven microsatellite-enriched DNA libraries from muscle tissues [51]; ii) 3,339 ESTs available in public databases [52], which were loaded on the turbot database and clustered with the set of the existing EST; and iii) Sanger sequencing data from two new cDNA libraries generated from several immune tissues (liver, head kidney, spleen,  [53], but it constitutes an appropriate approach to obtain a first picture of the immune response [36]. A total of 6,053 out of the 6,170 unique sequences (98.1%) in Turbot 2 database displayed significant matches with sequences available in public databases with E-values equal or less than 1,00E-5 (6,049 with BLASTN and 2,116 with BLASTX). Gene Ontology (GO) annotation classified sequences as follows: 586 in Biological Process (BP), 472 in Cellular Component (CC) and 692 in Molecular Functions (MF).
454 pyrosequencing of the turbot brain-hypophysis -gonad axis transcriptome The Turbot 2 database described above contained 17,626 sequences, many of them related to the immune response. However, further exploration of this database revealed that sequences related to reproduction, the other major issue for turbot farming, were underrepresented. In order to obtain more sequences of genes related to sex phenotype and reproduction control, and for isolation of ESTassociated genetic markers, a 454-pyrosequencing run was performed from the brain-hypophysis-gonadal axis by using tissues of 30 turbot individuals at different stages of sexual development. Table 2 summarizes the statistics of the turbot pyrosequencing normalized library. Raw data generated 2,762,845 sequences. These sequences were filtered using Roche's software with default settings. After filtration, 1,191,866 sequence reads (341.2 megabases, Mb) were obtained with an average length of 286 bp. Sequences were assembled into 65,472 contigs (and 172,108 singletons) with a mean length of 625.9 bp (median = 499 bp; mode = 365 bp). About half of these contigs (32,612) were longer than 500 bp and their distribution by range was the highest for the 200-499 bp length, followed by the 1-199 bp length and finally by the 500-999 bp length ( Figure 1A). The average depth coverage per contig was of 4.6 sequences (SD: 11.7; Figure 1B). Reads obtained in this high throughput sequence analysis have been submitted to the NCBI Sequence Read Archive under accession number SRA056483. Table 3 shows the top 20 longest contigs obtained from the 454 run with their annotation. They ranged from 3,550 bp to 5,012 bp and their average coverage depth per nucleotide ranged between 4.3 and 33.2. Cytochrome c oxidase subunit 3 was the longest contig. Table 4 shows the top 20 contigs with the deepest coverage (253.2-578.7). Although a normalized library was used, most contigs with the deepest coverage corresponded to protein ribosomal genes. However, genes involved in the reproductive system such as the histone deacetylase complex or the epididymal secretory protein, which is highly expressed on the surface of ejaculated spermatozoa [54], were also present.
About half of the contigs obtained in the 454 run were successfully annotated and classified into Gene Ontology categories. More precisely, contigs exclusively obtained by the 454 run were functionally classified in the BP (8,390), CC (7,081) and MF (10,026) categories.

Creation of the turbot 3 database
The sequencing strategies used, i.e. traditional Sanger and high throughput 454, yielded a high amount of transcriptomic sequences both from immune and reproductive systems in turbot. With all the information generated, a new Turbot 3 database was created and stored in a web-based portal for exploitation, first by the consortium participating in this project and then publically once the project is finished by the end of 2013. Cap3 software was used to assemble the sequences coming from all Sanger-based libraries (17,626) Figure 2B). The number of sequences exclusively assigned to each functional category was 2,417 for BP, 828 for CC and 4,328 for MF. Most significant BLAST-hits were obtained against a small number of species represented in public databases including model fish species (Danio rerio, Tetraodon nigroviridis, Oryzias latipes, Takifugu rubripes and Gasterosteus aculeatus), cultured fish species (Oncorhynchus mykiss and Salmo salar) and two mammalian species (Mus musculus and Homo sapiens) (Figure 3). G. aculeatus was the highest represented species followed by a group including T. rubripes, O. latipes and T. nigroviridis, all these species and turbot belonging to the Acanthopterygii superorder [55]. Figure 4 summarizes the number of sequences representing the different 2nd level GO terms in the Turbot 3 database. Cellular process (7,676 sequences) and Metabolic process (6,627 sequences) were the most represented categories within BP terms ( Figure 4A), but categories related to immune function had also a high representation: Response to stimulus (1,482 sequences), Viral reproduction (612 sequences), Immune system process (176 sequences) and Death (149 sequences). The reproductive system was also represented by the Reproduction (453 sequences) and Reproductive process (448 sequences) categories, and to a lower extent by Growth and Cell proliferation. Cell and Cell parts categories (7,995 sequences each) followed by Organelle (4,006 sequences) were the highest represented within CC terms ( Figure 4B). Finally, within MF terms Binding (7,813 sequences) and Catalytic activity (5,523 sequences) were the most represented categories followed by Transporter activity (887) and Structural molecule activity (830) ( Figure 4C).

Identification of genes related to the immune response
The knowledge of the immune system of fish has greatly increased recently. However, there are still many fish diseases which produce important losses to industry because still there is no an effective strategy for their control, including vaccines. The immune system of fish is composed of non-specific and specific immune defenses, being the first more important than in higher vertebrates [56,57]. Examples of innate immunity include anatomic barriers, mechanical removal of pathogens, bacterial antagonism, pattern-recognition receptors, antigen-nonspecific defense compounds, the complement pathway, phagocytosis, and inflammation [58]. In the present study, the main organs of the immune system of fish such as head kidney (equivalent to the bone marrow of mammals), spleen and thymus (often neglected in fish immune transcriptomics) were included. In addition, other organs such as the liver, a multifunctional organ with innate immune functions in  mammals [59] and poorly studied in fish, and the pyloric caeca, the target organ of the myxozoan parasite, which also plays a role in immunity [60], were included as well.
Next-generation pyrosequencing has become an important tool for transcriptomic studies, enabling the identification of new immune molecules that are expressed upon activation of the immune response. A remarkable recent example is the study of the liver transcriptome of orange-spotted grouper (Epinephelus coioides) after virus infection [61]. It seems very likely that developments related to fish immunology will have a significant impact for obtaining a new generation of vaccines against diseases. A disadvantage of turbot is that neither the genome nor the complete transcriptome are available yet and, therefore, important information about immunity and stress related genes and their expression is lacking. Many genes were identified previously in turbot using classical Sanger sequencing in response to A. salmonicida and P. dicentrarchi [36], Vibrio harveyi [62] and nodavirus [52]. However, the number of genes related to the immune system in this species remained low. Recently, Pereiro et al. [29] used 454-pyrosequencing after different immune stimulations to provide a rich source of data to improve the knowledge of S. maximus immune transcriptome. Their results revealed a large number of contigs and singletons with potential immune function in turbot and identified many of the proteins involved in the main immune-pathways in humans, showing the potential of pyrosequencing. Although our 454 run was not specifically from immunerelated tissues, after combining the Sanger and pyrosequencing data, a significant number of genes associated to essential functions directly or indirectly related to innate and acquired immunity were detected in the Turbot 3 database (2,241 of the 23,661 annotated sequences; see Additional file 1). Most of the immune-related sequences (1,873) were derived exclusively from the 454 run and only 149 and 219 sequences from Sanger or mixed Sanger-454, respectively. We found several novel genes, including components or family members related to acute phase response and inflammation, stress and/or defense response and in the coagulation cascade. Many of the genes shown in the immune pathways presented by Pereiro et al. [29] (complement, Toll-like receptor signaling, B cell receptor signaling, T cell receptor signaling and apoptosis cascade) could be identified, but also some other important immune genes were identified here for the first time in turbot, a selection of which is shown in Table 5. Relevant examples include DFF40 subunit, a substrate for caspase-3, which triggers DNA fragmentation during apoptosis; BCL-XL, an anti-apoptotic protein; TRAF2, which regulates activation of NF-kappa-B and JNK, playing a central role in the regulation of cell survival and apoptosis; TRAF6, which mediates signaling from members of the TNF receptor superfamily as well as the TOLL/IL-1 family; IRAK1, which plays a critical role in initiating innate immune response against foreign pathogens; JNK, involved in cell differentiation and proliferation, neurodegeneration, inflammatory conditions and AP-1-mediated cytokine production; TOLLIP, which is involved in the turnover of IL1R-associated kinase, interleukin-1 receptor (IL1R) trafficking and regulation of the imflammatory signaling; FYN, which participates in the downstream signaling pathways that lead to T-cell differentiation and proliferation following T-cell receptor (TCR) stimulation; NCK, which plays a pivotal role in the T cell receptor (TCR)-induced reorganization of the actin cytoskeleton and the formation of the immunological synapse; DLG1, which is involved in lymphocyte activation; COT, which promotes the production of TNF-alpha and IL-2 during T lymphocyte activation; CD28, involved in T-cell activation; GADS, involved in B and T cell activation; and GRB2, which provides a critical link between cell surface growth factor receptors and the Ras signaling pathway. The obtained data complements and expands the known spectrum of immunity related genes and provides a valuable platform for more detailed analyses of the immune response in fish in general a turbot in particular. Several immune related pathways were also identified in the Turbot 3 database. Chemokine signaling is an important immune pathway due to the fundamental role of chemokines in providing directional cues for the trafficking of leukocytes to sites of inflammation but also it has been implicated in dendritic cell maturation, macrophage activation, neutrophil degranulation, B cell antibody class switching, and T cell activation. The data infers that chemokines influence both the innate and acquired phase of an immune response to host insults. Thus, the protein richness of this pathway in the Turbot 3 database was described (see Additional file 2). Most members intervening in this pathway were identified showing the usefulness of the Turbot 3 database for gene discovery.

Identification of genes related to reproduction
To date, fish gonad-related ESTs are poorly represented in public databases. A first attempt to identify genes related to gonad development in male and female turbot was carried out by cDNA-AFLP technology and several specific sequences could be identified [63]. However, the amount of information presently available is still scarce and thus a small number of sex-specific sequences have been identified. Here, the use of the 454 FLX Titanium sequencing allowed obtaining a large number of gene sequences (65,472 contigs) and their subsequent assembly and gene annotation led to the identification of a total of 1,410 annotated sequences related to reproductive function. This means that sequences corresponding to many genes of the brain-hypophysis-gonad axis, expressed first during the process of sex differentiation and then during gonadal maturation, have been identified (Additional file 3). Functional annotation terms classified all those sequences in a total of 8,425 GO terms. This is the first time that sex-related genes have been massively identified towards understanding gonad development and maturation in the turbot.  Table 6 shows 20 relevant genes with a well-known function in the reproductive system, most of them identified for the first time in turbot. The annotation, as shown by the low E-values, was reliable for all of them. Some of these are genes involved in testicular development, such as the androgen receptor-alpha (AR), Müllerian inhibiting substance (AMH), SRY-related genes containing a HMG box (SOX), Spermatogenesis associated 13 or steroid 11-β -hydroxylase. AR has been cloned in several cultured fish species like the rainbow trout (Oncorhynchus mykiss) [64] and the European sea bass (Dicentrarchus labrax) [65]. AR mediates the androgen effects by binding to specific DNA recognition sites and regulating the transcription of many different processes [66]. In fish, AMH expression levels are consistently higher in males than in females during sex differentiation, suggesting that this factor plays an important role in testicular development [67]. The SOX gene family encodes an important group of developmental regulators, involved in sex determination in fish [68,69] and other key processes such as development of the central nervous system [70]. Furthermore, the important transcription factors SOX6 and the SOX9 were identified.
Another group of identified genes in turbot is involved in ovarian development. This is the case of the cytochrome P450 aromatase (CYP19A), Zona pellucida glycoprotein (ZPG) or the Zygote arrest protein. CYP19A is a key enzyme in the hormonal steroidogenic pathway that mediates the conversion of androgens into estrogens [71], with two isoforms with specific regulation and tissue distribution [72,73] and it has been cloned in several cultured fish species like European sea bass [74]. Higher expression of CYP19A is found in female gonads when compared to male gonads from early development and recently it has been shown that different methylation levels of its promoter are related to temperature during the thermal sensitive period [75]. ZPGs are glycoproteins found in the fish chorion, which mediate speciesspecific sperm binding [76]. ZPGs are encoded by multiple gene families and here several of them have been identified (ZPG2, 3, 4). The steroidogenic acute  [77]. Here, we were able to identify two different START genes, START5 and START 7. Gonadotropins (GtHs) control the complex endocrine system that regulates gonadal growth, sexual development and reproductive function, and are secreted by the hypophysis [78]. Three forms of GnRH in the brain and pituitary of the turbot have been identified so far [79]. One of the main gonadotropins in vertebrates, as well in fish, is the follicle-stimulating hormone (FSH). Its receptor, FSHR, is found in male and female gonads and although cloned in other cultured fishes such as the sea bass [80], here is it identified for the first time in turbot.
Not only genes expressed in somatic cells but also genes expressed in the germ cell line were present in the Turbot 3 database. VASA plays an important role in germ cell determination and development and is an essential factor for primordial germ cell formation and migration to the germinal ridge [81]. In fish, VASA was first cloned in zebrafish [82] and later in rainbow trout [83] but also in turbot (JX235364). Table 7 shows the pathways related to reproduction identified in the Turbot 3 database with more than 50% coverage. Here, pathways are ranked based on the number of genes included in the database with respect to the total number of genes present in each pathway: Oocyte meiosis (dre04114), Circadian rhythm (dre04710), mTOR signaling pathway (dre04150), ErB signaling pathway (dre04012), Progesterone-mediated oocyte maturation (dre04914), GnRH signaling pathway (dre04912), Insulin signaling pathway (dre04910), Androgen and estrogen metabolism (dre00150), Steroid biosynthesis (dre00100), Wnt signaling pathway (dre04310), and Notch signaling pathway (dre04330). Additional file 4 shows the Progesterone-mediated oocyte maturation pathway highlighting the presence and absence of proteins in the Turbot 3 database. The exposure to either insulin-like growth factor 1 (IGF-1) or the steroid hormone progesterone breaks oocyte meiotic cell division arrest and induces meiosis resumption and therefore the transformation of the oocyte into a mature, fertilizable egg [84]. Oocyte maturation is also dependent on the activation of a cascade of genes, which activate the MAPK (mitogen-activated protein kinase) signaling pathway [85]. The key activity driving meiotic progression is the maturation-promoting factor (MPF ), a heterodimer of CDC2 (cell division cycle 2 kinase) and CYCLIN-B [86].
In fish, the importance of some of the genes involved in the oocyte maturation pathway has been described so far [87]. Here, 79 out of 105 genes belonging to this pathway were found, showing the coverage of the generated Turbot 3 database. Overall, our results show that the approach followed was successful since most of the well-known reproductionrelated genes found in other species have been also identified in turbot essentially at once.

Genetic markers
An important emerging application of high-throughput 454 sequencing is the identification of molecular markers from genomic DNA. In fact, recent studies have identified 26 polymorphic microsatellite by pyrosequencing in an endangered fish species of China [88] and 21 microsatellites loci from the threatened freshwater Yarra pygmy perch (Nannoperca obscura) [89]. However, few studies have been conducted to search for cDNAassociated microsatellites, like those identified in the Atlantic herring (Clupea harengus) [90], despite the potential for targeting candidate genes [91].
Due to their location within genes, EST-SSR markers frequently display a high degree of transferability between related species, thus facilitating comparative genomics strategies with model species. In addition, high sequence coverage in principle allows the assessment of variability in silico, aiding for selection of polymorphic markers. We searched for new microsatellite markers within our sequence database to identify sequences with different repeat motifs. Our search revealed 993 sequences containing 1,237 new SSRs identified from 52,427 sequences, with 394 EST sequences containing at least two SSRs. Of these, 759 showed significant hits in BLAST with an E-value cutoff of ≤ 1,00E-5 and, thus, were annotated. The frequency of EST-SSRs observed in the turbot transcriptome was 1.9%, and the distribution density was 1.48 microsatellites per Mb. SSR motifs were identified using criteria based in a minimum number of repeats for di-, tri-, tetra-or pentanucleotide motifs (see Methods section). Similar to other vertebrate genomes, the most abundant repeat type was AC (20.93%, 259) followed by AAG (15.04%, 186), AGG (10.51%, 130), AGC (9.30%, 115), and AG (6.87%, 85). The frequency of microsatellites was inverted regarding the length of the motif, dinucleotide microsatellites being the commonest ones and pentanucleotides the less abundant (Table 8). In addition, those microsatellites with a lower number of repeats were more frequent than those with a higher number of repeats, the most common class being n = 4 (438 loci: 35.41%). Further, 12.53% of loci contained more than 10 repeat units. All the new microsatellite-containing ESTs showed sufficient flanking sequence length for primer design, and 5,609 polymorphisms of them appeared polymorphic after in silico analysis.
A total of 7,362 SNPs were detected in 1,040 of the 9,495 contigs using the three filters set in the QualitySNP pipeline [92]. Only clusters with at least 4 EST sequences (2,456) were selected to minimize the detection of SNPs caused by sequencing errors. On average, one SNP per 196 bp was identified, which is a frequency in the order of that estimated in non-model species [93]. The types of detected SNPs according to different criteria are summarized in Table 9. Among them, 2,223 were transitions, 2,404 transversions and 1,578 indels. In addition, the majority of SNPs were detected in contigs involving a large number of sequences, which provides an additional support for their confidence. The large amount of potential molecular markers found in this study will enable more detailed population and applied genomic studies. Since these new markers are linked to genes, they will be useful as Type I markers for population genomics screening in this species and for comparative mapping and fish evolutionary studies [94].

Pilot microarray and identification of natural antisense transcripts
To date, several custom microarrays have been designed in several non-model fish species. Examples exist in rainbow trout [95] gilthead sea bream [96], European sea bass [97], Atlantic salmon (Salmo salar) [98], common carp (Cyprinus carpio) [99] or Senegalese sole (Solea senegalensis) [26], but also in the turbot [37]. In the present study, samples from the reproductive and immune tissues were used to characterize their transcriptome using different sequencing strategies and de novo assembly to identify a large number of genes previously unknown in turbot. The assembled data present in the Turbot 3 database was the basis to construct a pilot microarray towards a new gene-enriched updated version. One of the drawbacks of 454 sequencing technology is that it may produce false annotations of genes [100,101], and since sequencing is not oriented as in cDNA libraries used for Sanger sequencing, it is not possible to know the DNA sense strand of a gene unless it is confidently annotated. To solve these problems, and in order to identify the most reliable oligos for a definitive turbot microarray, a pilot microarray was developed. In this pilot microarray, oligos were designed both in forward and reverse sequence orientation. In addition, several filtration criteria were followed to analyze microarray data (see Methods). This strategy allows, on one hand, to identify the sense strand of the nonannotated sequences, but also to identify false annotation of genes. On the other hand, this procedure also allows studying the frequency of putative natural antisense transcripts (NATs) in turbot transcriptome [97].
The importance of NATs, which can regulate eukaryotic gene expression, has emerged in the last decade [102]. A NAT is a single-stranded RNA sequence complementary to messenger RNA and includes various classes of short RNAs including micro RNAs (miRNAs), promoter-associated transcripts and long non-protein -coding RNAs [103,104]. The amount of NATs in eukaryotic cells remains unclear. It had been reported that over 20% of human transcripts might form senseantisense pairs [105], but large-scale cDNA sequencing suggested that antisense transcription is more common than previously thought [106]. Recently, it has been shown that up to 72% of the transcripts had antisense partners in human and mouse transcriptomes [107,108]. High-throughput sequencing strategies have revealed a plethora of non-protein-coding transcripts from both genic and intergenic regions [40]. Data on miRNAs, one of the short NAT classes, has been already published in rainbow trout [109] and halibut Hippoglussus hippoglossus [110]. Due to their increasing importance, the study of NATs cannot be longer ignored in transcriptome studies.
The functionality of the oligos included in the pilotmicroarray was checked by hybridizing the same RNA used for the Sanger and 454 sequencing strategies. To analyze microarray data two filtration criteria were applied (see Methods). Once the first filtration process was completed 37,759 signals in forward and 33,489 in reverse oligos still remained (75%) ( Table 10). Then, a second filtration with two additional filtering criteria (threshold intensity <100 and cross-hybridization between oligos) was performed to select the best performing oligo probes. As seen after this additional filtration process (Table 10), among the 94,582 probes there were 53,534 (56.6%) with no signal (24,083 forward and 29,451 reverse). After the two rounds of filtration, a total of 41,048 remaining oligos (43.4%) yielded signal in at least one tissue (brain-hypophysis-gonad or immune tissues) or in both. As a result of this filtration strategy, the remaining oligos were selected to be included in the updated turbot microarray. In the development of a custom microarray for the European sea bass, a similar strategy was followed to study NATs expression [97]. Although a lesser amount of sequences was designed for this purpose (640 sequences), identification of NATs was also achieved [97]. It is remarkable that after the second filtration, 2,976 sequences (6.3%) still showed signal in both strands in both types of tissues (brain-hypophysis-gonad and immune tissues). These double hybridization signals could represent putative NATs found for the first time in the turbot transcriptome. miRNAs, are one of the most relevant short NATs classes and function as regulators of gene expression at the level of translation, with an essential input in developmental processes [111]. Due to their growing importance in regulating gene expression, several miRNA databases have been already created. In Table 11, we show a selection of ten miRNAs from those identified in the Turbot 3 database including their number of reads, which could be considered as a gross indicator of their expression level. To our knowledge, these miRNAs are the first to be identified in turbot. Further work is being carried out on the turbot database for developing a consistent bioinformatic pipeline for miRNA identification, as well as for their validation using a Q-PCR approach.

Conclusions
This is the first time that the transcriptome of the reproductive and the immune systems of turbot have been widely explored together. Both systems are essential for the survival of individuals and are of primary importance for commercial aquaculture. This study was designed to fill in the gap of genomic resources in turbot and therefore to improve available turbot sequence databases, specifically in genes related to reproduction. The large amount of generated sequences (52,427 putative transcripts) resulted in one of the most complete available databases for flatfish, with more than half of the resources annotated by both gene and functional category. The detailed and focused sequence assembly and gene annotation strategies allowed the identification of several genes involved in the immune and the reproductive systems, being most of them involved in key functions. A large amount of genetic markers was identified, providing new tools for genomic studies. The performance of an informative pilot microarray was assessed and identification of putative miRNAs was possible. Thus, NGS technologies represent an essential tool to increase exponentially genomic resources in nonmodel species, opening new insights for our understanding of key biological processes and addressing production bottlenecks in their aquaculture.

Sanger sequencing Experimental design and samplings
The E. scophthalmi infection trial was performed at the facilities of CETGA (Centro Tecnológico Gallego de Acuicultura; NW Spain). Naïve turbot (8484 recipients = R and 8484 controls = C) from a balanced mixture of five unrelated families with known pedigree, hatched and reared at a commercial fish farm were sent to CETGA facilities and acclimated to experimental conditions for 10 days before the beginning of the experiment. R and C fish were kept in separate tanks (7 tanks for each group) in two separated recirculating systems, with constant water temperature (19-20°C) and fed with commercial dried pellets. R fish were infected by oral intubation with intestinal scrapings containing E. scophthalmi stages obtained from infected turbot, for two consecutive days. C fish were maintained under equivalent conditions as R fish, but intubated with PBS instead. More details on this procedure can be found in a previous work [112].
The progression of the infection was monitored by sampling both C and R groups at different times post Samples of spleen, head kidney, thymus, liver and pyloric caeca were rapidly dissected, immediately frozen in liquid nitrogen and stored at −80°C until used for RNA extraction. At each sampling time, samples of each tissue from the different individual fish from each group (C, R) were pooled. The serial times of sampling provided tissues expressing different genes related to immune response from initial until late states of the infection.

RNA isolation, library preparation and sequence analysis
RNA extraction of samples from control and infected fish, cDNA library construction and sequencing were performed as described elsewhere [36]. Briefly, RNA was extracted using TRIZOL Reagent (Invitrogen, Carlsbad, CA, USA). Poly-A mRNA was isolated using the Dynabeads W mRNA Purification Kit (Invitrogen, Carlsbad, CA, USA). The two cDNA libraries (control and infected) were directionally constructed (5 0 EcoRI, 3 0 XhoI), with equal amounts of RNA from each tissue at each sampling time, using the ZAP-cDNA Library Construction Kit (Stratagene, La Jolla, CA, USA), except size fractioning that was performed with the SizeSep 400 Spun Columns (GE Healthcare, Uppsala, Sweden). Plasmid DNA was isolated from approximately 4,000 clones from each library using the DirectPrep W 96 Miniprep kit (QIAGEN, Valencia, CA, USA). Plasmid DNA was sequenced following the ABI Prism BigDye ™ Teminator v3.1 Cycle Sequencing Kit  protocol on an ABI 3100 DNA sequencer (Applied Biosystems, Foster City, CA, USA). All clones were sequenced from their 3 0 ends using a standard T7 primer to obtain the highest specific gene sequences for oligo-microarray design. Those clones that suffered a systematic drop on sequencing signal after poly-A tails were sequenced from the 5 0 end. Basecalling from chromatogram traces was performed by using PHRED [113,114].

pyrosequencing run
Reproductive tissue sampling and RNA extraction A total of 30 turbot samples were collected from CETGA from a mixture of unrelated genetic families. In order to obtain the widest possible range of expressed transcript sub-sets, tissues were dissected in fish at different stages of gonad development. The number, age and the mean values of biometry (standard length and body weight) for each animal group were the following: undifferentiated animals (n = 5; 90 days post hatch [dph], 5.2 ± 0.6 cm, 2.9 ± 0.9 g); differentiating animals (n = 4; 150 dph, 9.8 ± 1.3 cm, 21.8 ± 9.5 g); male juveniles (n = 4; 400 dph, 20.9 ± 4.2 cm, 195.0 ± 123.0 g); female juveniles (n = 5; 450 dph, 23.4 ± 3.6 cm, 264.1 ± 93.4 g); male broodstock (n = 3; 900 dph, 44.8 ± 3.6 cm, 2,059.0 ± 591.6 g) and female broodstock (n = 3; 900 dph, 47.7 ± 6.4 cm, 2,834.3 ± 1,264.9 g). Brain and hypophysis from broodstock animals were also dissected and rapidly flash frozen in liquid nitrogen. Gonads were fully isolated in adult and juvenile fish and thus gonadal tissue was devoid of any other tissue. However, gonads of sexually differentiating fish contained a bit of attached epithelium. Due to their extremely small size, the isolation of the gonads alone was not feasible and thus samples contained also portions of the surrounding tissues. RNA was individually extracted by RNeasy Mini Kit (Qiagen, Hilden, Germany) following the manufacturer's instructions. Quantity was determined using a Nanodrop spectrophotometer (Nanodrop Technologies, US). The RNA integrity number (RIN) was determined in an Agilent BioAnalizer (Agilent Technologies, US). RNA samples with a RIN > 8.1 were further processed for the sequencing run. A pooled sample was generated by mixing 70% of gonads containing equal amounts of RNA from each individual and 30% of equal amount of RNA from broodstock brains and hypophysis tissues.

cDNA library, normalization and 454 FLX Titanium pyrosequencing
Full-length-enriched double stranded cDNA was synthesized from 1.5 μg of pooled total RNA using the MINT cDNA synthesis kit (Evrogen, Moscow, Russia) according to the manufacturer's protocol, and was subsequently purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). The amplified cDNA was normalized using the Trimmer kit (Evrogen, Moscow, Russia) to minimize differences in representation of transcripts [115,116]. The single-stranded cDNA fraction was then amplified twice by sequential PCR reactions according to manufacturer's protocol. Normalized cDNA was purified using the QIAquick PCR Purification Kit (Qiagen USA, Valencia, CA). Normalized cDNA (5 μg) was used to generate a 454 library. cDNA was fractionated into small, 300 to 800 bp fragments and the specific A and B adaptors were ligated to both the 3 0 and 5 0 ends of the fragments and used for purification, amplification, and sequencing steps. Two and a quarter PTP regions were used for the GS-FLX sequencing run using Titanium chemistry. All reagents and protocols were from Roche 454 Life Sciences, USA. 454 data was processed with Roche's software, using default settings, to obtain fasta and quality files containing the trimmed sequence of all reads [117]. Contigs with at least 100 bp were recovered. Sequences were de novo assembled into contigs by running Mira v3.2.0rc1 in EST mode. Contigs less than 100 bp were filtered out and the rest was blasted against D. rerio RefSeq protein sequences with est2assembly's analyse_assembly.pl script [118] in order to validate the whole process.

Turbot databases
Bioinformatic tools were developed to process all sequencing data obtained from both Sanger and 454 FLX Titanium technologies. The starting point of the current work was the Turbot 1 database, which was reported previously [36]. In order to generate the Turbot 2 database sequences of Turbot 1 database (9,873) were clustered with: 3,043 sequences obtained from the E. scophthalmi trial cDNA libraries, 1,371 genomic sequences from enriched DNA libraries [51] and 3,339 sequences available in public databases [52], using CAP3 software (http://seq.cs.iastate. edu/) ( Table 1). The resulting ".ace" file was used to study coverage and construct user-friendly alignment views with Mview [119]. To construct the Turbot 3 database, the primitive sequences of Turbot 2 (17,626) were pooled with the 454 contigs (65,472) and then clustered using CAP3 software. The resulting contigs and singletons were annotated using AutoFact (http://www.bch.umontreal.ca/Software/AutoFACT.htm), BLASTN and BLASTX with databases nr, UniProt, UniRef, COG, KEGG, PFam, LSU and SSU. Results were uploaded to a MySQL database and a portal web was created.
To study the different pathways found in the Turbot 3 database the DAVID web tool was used [120,121]. After the selection of the pathways of interest, a list of reference genes was downloaded from the NCBI RefSeq database and BLASTed against the Turbot 3 database. A gene was considered present in our database if its reference sequence had a match with an e-value cut off ≤ 1,00E-5 and hit length ≥50. To make the colour pathway diagrams (Additional file 2 and Additional file 4) the KEGG mapper tool http://www.genome.jp/kegg/tool/map_pathway2.html was used [122,123]. Due to the lack of a D. rerio Chemokine signaling pathway in KEGG website the human version was used for Additional file 2. In Additional file 4, the Progesterone-mediated oocyte maturation pathway from D. rerio given by KEGG website is labeled as Xenopus oocyte. This label is kept in the figure.

Microsatellites and SNPs
For SSR and SNP detection, EST sequences were clustered with CAP3 using default parameters and the resulting ".ace" format assembly file was fed into the corresponding programs. The set of unique sequences was searched for microsatellites using the SPUTNIK program (http://espressosoftware.com/sputnik/). The minimum repeat number used for this search was six for dinucleotide and four for tri-, tetra-and pentanucleotide microsatellites. Microsatellite-containing ESTs were identified as candidates for marker development if they presented enough flanking sequences on either side of the repeats for primer design. Whenever possible, we selected three putative primers using the Primer3 software (http:// frodo.wi.mit.edu/primer3/). SNP detection was performed with contigs of at least four sequences using the QualitySNP program (http:// www.bioinformatics.nl/tools/snpweb/). This program uses three filters for the identification of reliable SNPs (see [92] for details). SNPs that pass filters 1 and 2 are called real SNPs and those passing all filters are called true SNPs. The resulting files were processed with our own custom Perl programs to extract relevant information. The obtained true SNPs were imported into a MySQL database (http://www.mysql.com). A user-friendly web access interface was designed so that contig graphs are clickable and the output visually refined with color-coded nucleotide views (http://bio-mview.sourceforge.net/). A graphical interface allowing for SNP database search by alleles, contig depth, and annotation was also established in our on line database. Searchable chromatograms for each of the Sanger sequences making up each contig were also included. It should be emphasized that SNPs detected with the help of bioinformatic pipelines are only putative and they should be technically validated.
To ensure identification of new molecular markers, sequences similar to GenBank deposited sequences were filtered out to avoid identification of already known SSR and SNP sequences, especially the ones previously identified by turbot [18,19,48,51,94,124,125].

Pilot-microarray platform
A custom 2 x 105 K array was printed with turbot sequences from the Turbot 3 database by Agilent Technologies (US). In order to study the orientation of the non-annotated sequences and their possible gene expression, false annotation of genes and identify possible NATs, oligos were designed in both orientations, forward and reverse. Oligo design was done by using Repeat Masker to eliminate low-complexity regions, and then OligoArray 2.1 software to do the design itself [126]. Cross-hybridization between oligos was checked by BLAST searches against the entire Turbot 3 database and oligos with ≥ 3 putative cross-hybridizations were removed. A total number of 96,292 oligos were printed and almost half of the array contained oligos (47,291) also designed with the opposite orientation. This pilot microarray also included all default positive and negative controls defined by the company (1,325 spots).

Microarray hybridization
The same samples of immune tissues used for library construction and Sanger sequencing and those from the brain-pituitary-gonad axis used for 454 sequencing were used for hybridization (in duplicate) with the pilot microarray. A total of four microarrays were used, two for the reproductive system and two for the immune system. Hybridizations were performed at the Universidad de Santiago de Compostela (USC) Functional Genomics Platform by the Agilent Technology Gene Expression Unit using a 1-colour labeling protocol. This method demonstrated very similar performances to the 2-colour protocol [97,127]. Briefly, 50 ng of total RNA were labelled using the Low Input Quick Amp Labeling Kit, One-Color (Cy3) (Agilent Technologies, USA). cRNA was prepared for overnight hybridization with the corresponding buffers during 17 h at 65°C and washed on the following day. Hybridized slides were scanned using an Agilent G2565B microarray scanner (Agilent Technologies, USA).

Pilot microarray data processing, filtration, and identification of NATs
The hybridization signal was captured and processed using an Agilent scanner (G2565B, Agilent Technologies, USA). The scanner images were segmented with the Agilent Feature Extraction Software (v9.5) using protocol GE1-v5_95. Extended dynamic range implemented in the Agilent software was applied to avoid saturation in the highest intensity range. Agilent feature extraction produced the raw data for further pre-processing. The processed signal (gProcessed-Signal) value was chosen as statistical for the absolute hybridization signal.
The filtration process was made in two steps. First, the features which did not conform with any of the following well established quality criteria were filtered: (1) nonuniform pixel distributed outliers and population replicate outliers according to the default Agilent feature extraction criteria; (2) features whose ratio between processed signal and their error was below 2; (3) spots not differentiated from background signal (as estimated for each spot); (4) features below the limit where the linear relationship between concentration and intensity was lost according to Spike-In information. The numbers of oligos filtered using this first step is shown in Table 10. Second, two additional filtering criteria were applied: (5) only features with intensity ≥ 100 fluorescence units were kept; (6) features likely to present cross-hybridization were filtered. Table 10 shows the numbers of oligos filtered using the complete filtration process.
For miRNA identification in the Turbot 3 database, a BLASTN search against the miRBase v.18 database (http://www.mirbase.org/) was used. The ten best matches were selected and are depicted in Table 11.
Statistical analyses were carried out with the statistical language R (2.13.1 version). The GOStats Bioconductor package (version 2.18) was used to perform the analysis of GO Terms.

Additional files
Additional file 1: List of all the immune-related genes found in the Turbot 3 database.
Additional file 3: List of all the reproduction-related genes found in the Turbot 3 database.