High-throughput sequencing and analysis of the gill tissue transcriptome from the deep-sea hydrothermal vent mussel Bathymodiolus azoricus

Background Bathymodiolus azoricus is a deep-sea hydrothermal vent mussel found in association with large faunal communities living in chemosynthetic environments at the bottom of the sea floor near the Azores Islands. Investigation of the exceptional physiological reactions that vent mussels have adopted in their habitat, including responses to environmental microbes, remains a difficult challenge for deep-sea biologists. In an attempt to reveal genes potentially involved in the deep-sea mussel innate immunity we carried out a high-throughput sequence analysis of freshly collected B. azoricus transcriptome using gills tissues as the primary source of immune transcripts given its strategic role in filtering the surrounding waterborne potentially infectious microorganisms. Additionally, a substantial EST data set was produced and from which a comprehensive collection of genes coding for putative proteins was organized in a dedicated database, "DeepSeaVent" the first deep-sea vent animal transcriptome database based on the 454 pyrosequencing technology. Results A normalized cDNA library from gills tissue was sequenced in a full 454 GS-FLX run, producing 778,996 sequencing reads. Assembly of the high quality reads resulted in 75,407 contigs of which 3,071 were singletons. A total of 39,425 transcripts were conceptually translated into amino-sequences of which 22,023 matched known proteins in the NCBI non-redundant protein database, 15,839 revealed conserved protein domains through InterPro functional classification and 9,584 were assigned with Gene Ontology terms. Queries conducted within the database enabled the identification of genes putatively involved in immune and inflammatory reactions which had not been previously evidenced in the vent mussel. Their physical counterpart was confirmed by semi-quantitative quantitative Reverse-Transcription-Polymerase Chain Reactions (RT-PCR) and their RNA transcription level by quantitative PCR (qPCR) experiments. Conclusions We have established the first tissue transcriptional analysis of a deep-sea hydrothermal vent animal and generated a searchable catalog of genes that provides a direct method of identifying and retrieving vast numbers of novel coding sequences which can be applied in gene expression profiling experiments from a non-conventional model organism. This provides the most comprehensive sequence resource for identifying novel genes currently available for a deep-sea vent organism, in particular, genes putatively involved in immune and inflammatory reactions in vent mussels. The characterization of the B. azoricus transcriptome will facilitate research into biological processes underlying physiological adaptations to hydrothermal vent environments and will provide a basis for expanding our understanding of genes putatively involved in adaptations processes during post-capture long term acclimatization experiments, at "sea-level" conditions, using B. azoricus as a model organism.


Background
Deep-sea hydrothermal vent ecosystems are driven by unique physical, geochemical and biological processes with specialized energy sources at the origin of the trophic web. Since the discovery of hydrothermal vents and their associated fauna in the Galapagos Rift, evidence of the establishment of dense faunal communities based on chemosynthesis have mounted over the past decades, and generally in relation to areas where tectonic movements and deep ocean volcanism are active [1]. Hydrothermal vent ecosystems are characterized by the synthesis of organic matter by means of chemoautotrophic bacteria using reduced elements extracted from the hydrothermal fluids as source of energy [2,3].
Mussels in the genus Bathymodiolus are biomass dominant at many known deep-sea hydrothermal vent and cold seep habitats. Survival in such extreme conditions requires unique anatomical and physiological adaptations. For example the development of specialized gill epithelial cells harboring methanotrophic and thiotrophic endosymbiont bacteria constitutes one the best recognized adaptation strategies to chemosynthetic environments [4]. Dual symbiosis thus provides a clear nutritional advantage to Bathymodiolid mussels, allowing them to obtain energy from both sulfide and methane at the vent sites [5][6][7]. Near the Mid-Atlantic Ridge, and in the vicinity of the Azores region, Bathymodiolus azoricus subsists at vent sites, amid unusual levels of heavy metals, pH, temperature, CO 2 , methane and sulfide, while coping successfully with environmental microbes [8].
Despite its prominence as a model to study physiological adaptation to extreme physical and chemical conditions [9], there is currently no large scale genome project for Bathymodiolus species. Gene expression profiles are limited to a few EST projects mainly originated from the Evolution and Genetics of Marine Populations team at the Biological Station of Roscoff, France. In a recent analysis, the screening of cDNA libraries from whole bodies of B. azoricus resulted in 362 contigs and 1,918 singletons. Many genes known to be involved in both metallic and oxidative stress responses were then identified [10]. However, these data remain private and there is currently no published reports based on those sequences. More recently, the effect of temperature on the vent mussel B. thermophilus was investigated by means of subtractive suppression hybridization experiments aimed at the identification of genes differentially expressed in response to different temperatures regimes [11].
Thus far, knowledge of deep-sea biology or of the molecules involved in the maintenance of homeostasis in hydrothermal vent animals has been limited in part by the lack of information about their genomes and systematic sequencing of expressed sequence tags to identify protein coding genes on a large scale. The deepsea vent biological systems represent the opportunity to study and provide new insights into the basic physiological principles that govern the defense mechanisms in vent animals and to understand how they cope with microbial infections. The problem of microbial threat and the need for immunity exist in both deep sea and shallow water bivalves however differences in the genes of marine organisms living in so distinct habitats are likely to occur. In order to significantly increase the number of B. azoricus genes in the public database and to discover new deep-sea vent adaptation-related genes in B. azoricus, and particularly for immune-related genes we have conducted a high-throughput experimental approach using pyrosequencing, on the 454 GS FLX (Roche-454 Life Sciences) with Titanium chemistry, to sequence the transcriptome of B. azoricus gill tissues. In the absence of a reference genome, this sequence method, which has not yet been widely applied to hydrothermal vent animals, holds great potential for discovery of genes and genetic markers in unconventional model species through de novo transcriptome sequence assembly. The assembled and annotated sequences were produced and have been organized in a dedicated database, accessible through the website, http://transcriptomics.biocant.pt:8080/deepSeaVent providing an extensive catalog of genes expressed in gill tissues harboring immune cells, the hemocytes, of the deep-sea vent mussel Bathymodiolus azoricus.

Sequence analysis
A cDNA library was constructed from mRNA of fresh gill tissues from Bathymodiolus azoricus collected at deep-sea vents and sequenced in a single GS FLX Titanium plate. A total of 778,996 raw nucleotide reads were produced with an average length of 283 bp, corresponding to 223.7 Mb. After removal of the SMART adaptors, by a custom script, the sequences were assembled with the MIRA software. A total of 582,650 quality reads were assembled into 75,407 contigs of which 3,071 were singletons corresponding to a total of 38,8 Mb. The length of the consensus sequences varied from 40 bp to 3,400 bp, with an average length of 509 bp. A summary of data is indicated in Table 1. 3,416 contigs were longer than 1 Kb, and 270 longer than 1.5 Kb. The distribution of contig length and EST assembly by contig are shown in Fig 1.

Functional annotation
The contigs were annotated taking into consideration the identity of the translation frame, the conserved protein domains and Gene Ontology (GO) terms. To obtain the translation frame, all contigs were queried against the NCBI protein database (nr) using BLASTx algorithm [12], resulting in 18,407 significant matches for an E-value ≤ 10 -6 . The process was repeated for contigs without hits, increasing the E-value cut-off to 10 -2 , and resulting in the identification of 3,616 additional contigs. The remaining contigs were processed through the ESTscan application [13], from which it was possible to identify the coding frame for 17,402 additional contigs. A total of 42,073 putative amino-acids sequences was obtained following this approach. The entire set of amino-acid sequences hits was queried against the Inter-Pro database of protein families and functional domains http://www.ebi.ac.uk/InterPro [14,15], from which 15,839 were identified as bearing conserved protein domains. The same set of sequences was annotated with Gene Ontology (GO) terms resulting in 9,584 functional assignments according to the organizing principles of GO describing gene products and their properties. This classification scheme was useful to assign Bathymodiolus contigs to one of the major GO annotation categories, i.e., Biological Processes, Cellular Components and Molecular Functions in a species-independent manner [16] (Fig 2).
The contigs without any homology may correspond to one of the following categories: a) novel or diverged amino acid coding sequences that are specific to Bathymodiolus species, b) represent mostly 3' or 5' untranslated regions (UTRs) that would lack protein matches as they are non-coding or c) contain sequences to short to result in significant hits.
Despite that the Gene Ontology project is aimed at describing gene product characteristics and gene product annotation data, by ways of a direct acyclic graph (DAG) structure of controlled vocabularies, the process of assigning GO terms to gene products might prove difficult given the current GO structured vocabulary and the systematic relationship between the GO terms. This follows from the fact that GO describes how gene products behave in a cellular context and thus a gene product might be associated with or located in one or more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions.
Immunity, stress and bacterial genes in Bathymodiolus azoricus gills The innate immune system acts to protect the individual from invasive bacteria, viruses, and eukaryotic pathogens by detecting molecular signatures of infection and initiating effector responses. It probably originated early in animal phylogeny and is intimately related with wound healing and tissue maintenance functions among invertebrates [17]. This defense system is essential for the survival of invertebrates which do not possess immunoglobulins and rely on unique capabilities to detect and respond to microbe associated molecular patterns (MAMPs) such as lipopolysaccharides (LPS), lipoteichoic acids, lipoproteins, peptidoglycan (PGN) and (1 3) β-D-glucans [18] present on the cell surface of microorganisms. Upon recognition of the MAMPs, signal transduction molecules and transcription factors are set to mediate and control the expression of immune effector molecules, for example, the antimicrobial peptides. The vent mussel B. azoricus has been the subject of previous studies, in our laboratory, *contigs without BLASTx matches at an E-value cut-off of 10 -6 were queried again with BLASTx with an E-value cut-off of 10 -2 . ** The difference between the number of transcripts and total number of amino-acid sequences is due to the possibility of a contig having more than one annotated protein hit.
aiming at the characterization of its cellular and humoral immune reactions as a means to better understand physiological adaptations to vent environments. Given the limited immune gene expression information available for this non-conventional model organism, a query-based strategy was used to identify coding sequences within the DeepSeaVent database, for the selection and retrieval of nucleic acid information of cDNAs corresponding to genes potentially involved in immune and inflammatory reactions. Therefore, the selection of candidate genes presented in Table 2 was based on queries using specific descriptors matching immunity and inflammation-related criteria allowing thus the identification and retrieval of relevant cDNA sequences from the transcriptome library.
The identification of putative genes was based on GO annotation and querying the DeepSeaVent database A number of putative genes which have not previously been identified in B. azoricus were classified within the context of GO representation and to one of the following immune categories: immune recognition, signal transduction, transcription and effector molecules    (Fig 3), providing thus a classification resource for the investigation of specific processes, functions or cellular structures involved in animal physiological responses, particularly immune and stress-related responses, as demonstrated for the de novo analysis of the Acropora millepora larval transcriptome [19]. Subsequently, the physical counterpart of these putative genes was confirmed by semi-quantitative Reverse Transcription-PCR, using primers specifically designed on the basis of the newly identified EST sequences and using the same source of cDNA as for 454 pyrosequencing. Consequently, 454 sequencing artefacts could be discarded on the assumption that PCR amplifications of partial cDNA of the candidate genes were successfully obtained. Even though the transcriptome library was normalized, it is still qualitatively possible to visualize  Figure 3 Categorization of putative immune genes. A proposed categorization of immune genes is illustrated, according to Gene Ontology terminology, into four functional classes of innate immunity constituents from B. azoricus: immune recognition, signal transduction, transcription and effector molecules. different cDNA abundance as PCR amplifications revealed different amplicon intensities for some of the confirmed genes, presumably due to different mRNAs prevalence in the normalized cDNA library (Fig 4). This was further analyzed by quantitative PCR (qPCR) utilizing non-normalized cDNA as template for amplification of immune-related genes. Indeed, results demonstrated the differential abundance of transcripts in the non-normalized transcriptome library confirming that genes tested in our study are not expressed at the same level and are likely representing the transcriptional activity at the time animals were collected from the bottom of the sea (Fig 5). Particularly, the metal-binding metallothionein gene was highly expressed in the non-normalized transcriptome library, probably reflecting the physiological need for continuous production of this protein to overcome the metal stress occurring at vent sites and to regulate intracellular levels of metals. Other immune response representatives also showed increased levels of expression as compared to the remaining genes studied. These include immune recognition molecule aggrecan and c-lectin, cell signalling molecule integrin, signal transduction complex component IRAK, and antibacterial protein defensin, (Fig 5), all providing evidence of a functional immune system in B. azoricus.
The selected putative genes are being currently investigated under controlled conditions in our laboratory, to assess the effect of long-term acclimatization in aquaria at atmospheric pressure, the effect of de novo hyperbaric stimulations in the IPOCAMP chamber and the effect of exposure to marine Vibrio bacteria, on B. azoricus transcript profiling experiments [20,21].
Furthermore, this search-based analysis was also particularly important to determine a "bacterial fingerprint" in B. azoricus gill tissues, since we expected the vent mussel to have a rich microorganismal community and more specifically a substantial accumulation of endosymbiont bacteria within its gill tissues. The selection of poly-A RNA as the starting material for our transcriptome library likely eliminated many potential microbial sequences. However, 3,522 contigs in DeepSeaVent presented protein match hits to bacterial phylotypes, supporting the evidence for the presence of bacteria in gill tissues of B. azoricus, and representing thus a potential bacterial fingerprint, most likely of chemoautotrophic nature, in deep-sea hydrothermal vent mussels. Additionally, a number of bacterial sequences were ascribed to several non-cultured marine bacteria, to chemolithoautotrophic, sulfur-or methane-oxidizing bacteria as evidenced, for instance, by the presence of the SoxB, SoxY, SoxH, methane monooxygenase, Biopolymer transport protein ExbD/TolR genes when querying our database using InterPro or GO terms as "methane" or "sulfur". Similarly, searches using the genera Calyptogena and Riftia names returned several hundreds of putative protein sequences, the majority of which associated to endosymbionts from the giant hydrothermal vesicomyid clam and vestimentiferan tubeworm, respectively from the East Pacific Rise hydrothermal region.
Such an approach has the potential to reveal sequences that have apparent bacterial origin including many genera of species that have been associated with mussel pathogens or as normal flora in the gut system [22]. In this case, mutualistic interaction between symbiont bacteria and their animal hosts may be taken to another level of analysis based on transcriptome sequencing. New genes involved in host recognition of endosymbionts and immune-effector mechanisms underlying host defense responses may shed light into understanding mutualism better and provide explanations as to how endosymbiont bacteria, living inside the bacteriocytes of vent mussels, are able to evade immune recognition, during early stages of acquisition and how mutualism is maintained. One possible explanation might lie within the immune response itself, where constitutive immune responses of bacteriocytes towards the dense population of endosymbionts, might be expressed at different levels of the rest of the body and therefore, keeping endosymbionts under control [22]. This hypothesis is currently under experimental consideration in our laboratory.

Deep-sea vent mussel comparison to shallow water mussel
The DeepSeaVent database was compared to Mytibase, a recently created EST database from the shallow water mussel, Mytilus galloprovincialis [23]. The comparison was restricted to the amino-acid sequences with InterPro annotation from both databases and by setting the BLAST parameters to an E-value of 10 -5 and bit scores ranging from 90 to 200. Under such conditions 5,261 and 1,923 amino-acid sequences, respectively, were identified. Furthermore, the shared amino-acid sequences were analyzed in relation to the distribution of GO functional annotation categories, using the 4,120 sequences retrieved with a bit score of 120. The highest percentage of common GO-annotated sequences (50% sharing and above) between the two databases corresponded to the categories, Cell Killing, Cellular Component Biogenesis, Macromolecular Complex and Structural Molecule Activity ( Table 3). The biological significance of these findings was not immediately evaluated due the comparatively low amount of amino-acid sequences available in the Mytibase as compared to DeepSeaVent. However, representatives of broad GO categories are present in both Mytibase and Deep-SeaVent databases, suggesting that mussels originated from distinct marine habitat may share common biological processes, cellular components and molecular functions.
Such inter-database computational analyses offers now the potential to unravel genes specifically involved in hydrostatic pressure and chemosynthetic environmental adaptations by comparing transcriptome profiles from two closely related Mytilid family members living in very distinct marine habitats. The fixed E-value 10 -5 was kept for three different bit-scores (90,120 and 200). The number of contigs with GO ontology assignments shared between both databases is shown together with respective shared percentage.

Conclusions
Comparison of our results with recently published transcriptomic studies on B. azoricus confirms the efficacy of 454 sequencing to reveal a large number of putative transcripts and significantly improve the genomic knowledge on this deep sea animal. The use of 454 pyrosequencing to develop a new EST collection containing potentially 39,425 new transcripts provides a new resource for genome-wide association studies of vent mussel physiological variations, which is the focus of ongoing projects in our laboratory, addressing in particular the molecular adaptation mechanisms of B. azoricus to deep-sea hydrothermal vent environments. This new resource now gathered in the DeepSeaVent database will set the stage for innovative work and the establishment of large scale expression studies to validate the deep-sea vent mussel as a bone fide experimental model to study the biology of adaptation to deep-sea hydrothermal vent environments.

Animals sampling
Mussels were collected from the hydrothermal vent field Lucky Strike (37°13.52' N, 32°26.18' W; 1700 m depth), on the Mid-Atlantic Ridge (MAR), with the American R/ V Revelle using the ROV Jason II (Woods Hole Oceanographic Institution), during the MAR08 cruise (July 9 th -August 16 th 2008) led by Chief Scientist Dr Anna-Louise Reysenbach. Once the mussels were brought to the surface, they were immediately processed onboard for subsequent manipulation of RNA or immediately stored at -80°C for long-term preservation of tissue samples.

cDNA library construction and pyrosequencing
Gill tissues from 6 different animals were dissected from -80°C preserved animals and processed for total RNA extraction using the RiboPure(tm) kit (Ambion (r) , Austin, TX). The quality of total RNA was verified on a 1.4% (w/v) agarose-MOPS-formaldehyde denaturating gels and by assessing the A 260/280 and A 260/230 ratios using the NanoVue spectrophotometer (GE Healthcare, Piscataway, NJ). Poly-A RNA was extracted from each total RNA sample using the Poly(A)Purist(tm) mRNA Purification Kit according to manufacturer's instructions (Ambion Inc, Applied Biosystems). The mRNA from each gill tissue sample from all 6 animals (approximately 20-50 μg/ml) was pooled and used as the source of starting material for cDNA synthesis and the production of normalized cDNA intended for 454 sequencing. The normalization process was performed by Evrogen (Moscow, Russia) and based on the SMART double-stranded cDNA synthesis methodology using a modified templateswitching approach that allows the introduction of known adapter sequences to both ends of the first-strand cDNA. cDNA was further amplified by PCR and normalized using Duplex-Specific Nuclease-technology [24,25]. Four micrograms of normalized cDNA were sequenced in a full plate of 454 GS FLX Titanium according to the standard manufacturer's instructions (Roche-454 Life Sciences, Brandford, CT, USA) at Biocant (Cantanhede, Portugal).

Sequence processing, data analysis and functional annotation
Following 454 sequencing, the quality trimming and size selection of reads were determined by the 454 software after which the SMART adaptor sequences were removed from reads using a custom script and the poly-A masked using MIRA, to assure correct assembly of raw sequencing reads [26]. A total of 778,996 quality reads were subjected to the MIRA assembler [26] (version 3.0.5), with default parameters, yet only 582,650 reads were assembled. For some reads, after masking the poly-A, the sequence length was shorter than 40 bp, otherwise the minimum length assumed by the MIRA default parameter settings. The software also disregards all reads that do not match any other read or that belong to the megahub group, i.e. a read that is massively repetitive with respect to other reads. Such reads are considered singlets and were not included in the final assembly result. On the other hand, singletons were defined as reads that match other reads but neither do have sufficient coverage nor present conditions to assemble with them. The entire set of reads used for final assembly was submitted to the NCBI Sequence Read Archive under the accession n°SRA024338 (Submission: SRA024338.1/Bathymodiolus azoricus).
The translation frame of the contigs was determined through queries against the NCBI non redundant protein database using BLASTx with an E-value of 10 -6 and assessing the best twenty five hits. Contigs without hits were submitted again to BLASTx homology searches against the NCBI nr database with a higher E-value cutoff set at 10 -2 . Sequences with a translation frame identification derived from the two previous searches were used to establish the preferential codon usage in B. azoricus based on which the software ESTScan detected further potential transcripts from the two previous sets of sequences with yet no BLASTx matches. This procedure originated a third set of sequences with putative amino-acid translation. To illustrate the outline flow for the transcriptome annotation procedure a diagram is shown in Fig 6. The entire collection of sequences of at least 30 amino-acid long, resulting from the BLASTx and the ESTScan procedures, was processed by InterProScan for the prediction of protein domain signatures and Gene Ontology terms. All the results were compiled into a SQL database developed as an information management system.
The distribution of sequences into GO categories was calculated at each level and were passed to the parent GO at the top of the broad ontology domains, considering that each single assignment into a GO child was only counted once in the total sum. This information was also useful to establish the number of amino-acid sequences shared between the two Mytibase and Deep-SeaVent databases.
The retrieval of sequence information matching bacterial phylotypes was based on a convergent analysis of bacterial hits among the best BLASTx hits. The number of positive hits was calculated and plotted according to the position of best hit ranging from the first to the twentieth. A convergent number of bacterial hits was found at 3,522 contigs (Fig 7). The unique identifiers of the 3,522 contigs (gi accession number) were retrieved and translated into the taxon ID using the information provided by NCBI. A custom script based on the BioPerl  module bio::db::taxonomy, version 1.6 [27], was used to link the taxon of interest to the superkingdom Bacteria and verify whether or not the BLASTx hit corresponded indeed to the canonical Eubacterial sequences.

Identification of candidate genes putatively associated with immunity and inflammatory reactions
Candidate genes were selected following queries performed to our database using distinct search descriptors based on BLAST hit descriptions; GO descriptions; Interpro descriptions, GO and Interpro identification numbers. Queries were aimed at the identification of genes described in the literature as being related to immunity and inflammatory reactions. Reverse-Transcription followed by Polymerase Chain Reaction (RT-PCR) was performed to confirm the physical identity of the selected genes recognized among the 454-ESTs using the normalized cDNA. 2 μg of total RNA, initially used to generate the transcriptome library, were reverse transcribed using the ThermoScript™ RT-PCR system (Invitrogen, CA, USA) following manufacturer's instructions. All the contig sequences from putative immune genes were submitted to GenBank and assigned the accession numbers HM756110 to HM756130; HM756132 to HM756134; HM756136 to HM756149; HM756151 and HM756152 (Table 4). Primers targeting the immune candidate genes were designed using the Primer-Blast [28] from NCBI http:// www.ncbi.nlm.nih.gov/, specifying an expected PCR product of 200-300 bp and primer annealing temperatures between 56°C and 58°C. 25 μl PCR volume reactions were set with 1 μl of each forward and reverse primers (0.5 μM final concentration) and using a 2× PCR mix from PROMEGA (Madison, USA). PCR cycling conditions were according to Bettencourt et al. 2009 [29]. PCR products were examined by agarose gel electrophoresis according to standard methods.
Quantitative PCR (qPCR) was further used to assess and quantify the relative expression of the candidate genes previously tested on semi-quantitative RT-PCR. The non-normalized cDNA was obtained as previously described and consisting of the same cDNA utilized for subsequent normalization and 454 sequencing procedure. Quantitative PCR reactions were performed on the CFX96™ Real Time PCR System mounted onto the C1000 Thermal Cycling platform (Bio-Rad, CA, USA). Amplifications were carried out using 0.5 μl (10 μM) of the specific primers as for semi-quantitative PCR and mixed to 10 μl of SsoFast™ Eva Green SuperMix (SYBR based system, Bio-Rad) and 50 ng of cDNA in a final volume of 20 μl. PCR cycling conditions were 95°C for 3 min, followed by 35 cycles of 95°C 10 s, 58°C 15 s and 72°C 30 s. 6 replicates were performed for each gene tested in real time PCR reactions. Melt curves profiles were analyzed for each gene tested. The 28S rRNA gene was used as the housekeeping gene and for normalization of expression of gene of interest or immune-related target genes. The comparative CT method (ΔΔC T ) for the relative quantification of gene expression was used for assessing the normalized expression value of immune-related genes using the 28S rRNA as the control transcript (CFX Manager™ Software, Bio-Rad). Data were transferred to Excel files and plotted as histograms of normalized fold expression of target genes.