AmpuBase: a transcriptome database for eight species of apple snails (Gastropoda: Ampullariidae)

Background Gastropoda, with approximately 80,000 living species, is the largest class of Mollusca. Among gastropods, apple snails (family Ampullariidae) are globally distributed in tropical and subtropical freshwater ecosystems and many species are ecologically and economically important. Ampullariids exhibit various morphological and physiological adaptations to their respective habitats, which make them ideal candidates for studying adaptation, population divergence, speciation, and larger-scale patterns of diversity, including the biogeography of native and invasive populations. The limited availability of genomic data, however, hinders in-depth ecological and evolutionary studies of these non-model organisms. Results Using Illumina Hiseq platforms, we sequenced 1220 million reads for seven species of apple snails. Together with the previously published RNA-Seq data of two apple snails, we conducted de novo transcriptome assembly of eight species that belong to five genera of Ampullariidae, two of which represent Old World lineages and the other three New World lineages. There were 20,730 to 35,828 unigenes with predicted open reading frames for the eight species, with N50 (shortest sequence length at 50% of the unigenes) ranging from 1320 to 1803 bp. 69.7% to 80.2% of these unigenes were functionally annotated by searching against NCBI’s non-redundant, Gene Ontology database and the Kyoto Encyclopaedia of Genes and Genomes. With these data we developed AmpuBase, a relational database that features online BLAST functionality for DNA/protein sequences, keyword searching for unigenes/functional terms, and download functions for sequences and whole transcriptomes. Conclusions In summary, we have generated comprehensive transcriptome data for multiple ampullariid genera and species, and created a publicly accessible database with a user-friendly interface to facilitate future basic and applied studies on ampullariids, and comparative molecular studies with other invertebrates. Electronic supplementary material The online version of this article (10.1186/s12864-018-4553-9) contains supplementary material, which is available to authorized users.

range of morphological, behavioural and physiological adaptations to their inhabited environments [10,11]. For example, the evolutionary radiation of Lanistes in Lake Malawi contains species with contrasting morphological and behavioural features that have been interpreted as differential adaption to habitats which differ in wave action, food resources, and predators [9,12]. Due to their long evolutionary history, wide geographic distribution and high diversity, Hayes et al. [4] suggested that ampullariids altogether provide an interesting system to study speciation and phylogeography in freshwater gastropods. Furthermore, several species of apple snails, especially P. canaliculata and P. maculata, are notorious invasive species in Asia and Hawaii, where they cause dramatic agricultural losses [10,13], and other conservation concerns such as reductions in aquatic plant diversity and shift in wetland ecosystem functions [14,15]. Therefore, there is substantial interest in the mechanisms of adaptation that have enabled these species to become invasive pests [16,17], and in their biological control [18,19].
Ampullariids are well-known for their diverse reproductive behaviours. While they are all dioecious and most genera of apple snails deposit their eggs in a jelly mass underwater, two genera (i.e., Pomacea and Pila) produce calcareous egg clutches that are deposited above the waterline. The shift from aquatic to aerial oviposition thus has occurred at least twice in the evolution of ampullariids, indicating parallel evolution in the genera Pomacea and Pila with respect to the changes in egg deposition behaviour and morphology (e.g., larger lung size and longer siphons [10]). Such behavioural and morphological adaptations in Pomacea are known to be accompanied by biochemical adaptations to predation [20]. In this regard, studies of several Pomacea species have shown that the major proteins of the egg perivitelline fluid (PVF), the fluid that surrounds and nourishes the embryo, possess multiple protective functions against predators including several anti-predation proteins (perivitellins) displaying anti-digestive, antinutritive, neurotoxic and aposematic properties [20][21][22][23]. Comparison between the protein-coding genes of P. canaliculata and P. maculata has revealed the involvement of gene duplication and positive selection in the formation and evolution of some PVF proteins [24,25]. Further comparison with more distantly related genera/ species would yield novel insights into the origin and evolution of PVF proteins that may underlie the diversity of reproductive behaviour and morphology in apple snails.
Apart from their use in ecological and evolutionary studies, some ampullariids, including P. canaliculata and M. cornuarietis, have been used in toxicological studies due to their high fecundity and the high sensitivity of their juveniles to pollutants such as heavy metals [26], organic pesticides [27] and organotins [28]. Mortality and deficiencies of growth or development have typically been considered to be informative toxicity end-points. Nevertheless, the lack of extensive genomic resources hinders the documentation of molecular pathways in toxicological studies of apple snails.
To facilitate molecular-oriented studies on apple snails, we sequenced the transcriptomes of seven species of apple snails: Lanistes nyassanus; Pila ampullacea; Asolene platae; Marisa cornuarietis; Pomacea diffusa; Pomacea scalaris and Pomacea canaliculata. Together with our previously generated RNA-Seq data for P. canaliculata [29] and P. maculata [25], we cover eight species which represent five genera (Fig. 1a) and both the New World and Old World clades. Among the Old World species are L. nyassanus, a species endemic to Lake Malawi in the East African Rift [9,30]; and Pila ampullacea, a common species in the paddy fields and irrigation channels of northern Thailand [31]. Among the New World species, A. platae is restricted to the La Plata River basin and has a distribution range from Bolivia to the northern Buenos Aires province of Argentina [32]; this species has a slower growth rate and smaller reproductive output than other ampullariids and probably less invasive [33]. The other five species have been introduced from South America to various freshwater ecosystems in North America, Asia and Pacific islands including Hawaii [10,13,34,35]. Following their introduction, two species of Pomacea (i.e., P. canaliculata and P. maculata) have become widely distributed and they are regarded as some of the most notorious invasive species in freshwater habitats [7,34,36,37]. Our species selection thus covers the various phylogenetic lineages, the diversity of reproductive strategies, the most important invaders, and members that are commonly adopted in ecotoxicology. Fig. 1b shows the phylogenetic relationships among the species used in this study, whereas a phylogeny featuring more extensive taxon sampling is presented in Additional files 1 and 2.

Sample collection and preparation
Adult snails were collected from the field in various regions of South America, Africa and Asia, or purchased from an aquarium shop in Hong Kong (Table 1). All snails were reared in aquaria filled with tap water and acclimated for at least one month at 26 ± 1°C and a photoperiod of 14 h light/ 10 h dark. Snails were fed with a mixed diet of lettuce, carrot and fish meal once a day and the water was renewed twice a week. For most of the species, four to five female and male snails were chosen for dissection to obtain various tissues. For L. nyassanus, however, due to limited individuals available, only a female was used for dissection. Dissected tissues were immediately fixed in RNAlater™ (Invitrogen, USA) and then stored at − 20°C until they were subjected to RNA extraction.

RNA isolation and sequencing
Total RNA was extracted separately from each tissue sample using TRIzol® reagent (Invitrogen, MA, USA) following the manufacturer's protocol. In general, two RNA samples, including one of the albumen gland (AG), and one of other tissues (OT), which contained equal amounts of RNA extracted from three to four tissue types, were prepared for sequencing (Table 1). AG was always processed separately, because this organ, which secrets the perivitelline fluid that protects and nourishes the embryo, is expected to play a crucial role in the reproduction and evolution of ampullariids [24,25,38]. More tissue types of P. canaliculata were sequenced due to the need for producing tissue-specific gene expression data in another project for this species. To enhance the comprehensiveness of the transcripts for P. canaliculata, we combined our new data with the transcriptome data we generated from a previous study [29] for assembly. The transcriptome data of P. canaliculata from another study [39] were not included here because of uncertainty of sample preparation, and because more data would not likely improve the assembly metrics [40]. Raw reads of P. maculata were obtained from a recent publication [25], and re-assembled as described below. In P. scalaris, only AG was sequenced due to the lack of high quality RNA in OT preserved in RNAlater. For all samples, the quality of extracted RNA was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Germany). Samples with an RNA Integrity Number ≥ 8 were used for cDNA library construction using a TruSeq RNA Sample Prep Kit v2 (Illumina, California, USA), and sequenced in paired-end mode on an Illumina HiSeq sequencer (Illumina, California, USA). Library construction and sequencing were conducted by BGI Hong Kong as a commercial service. a b Fig. 1 Geographical distribution and phylogeny of apple snails used in the present study. a Rough native distribution ranges of the Old World (Lanistes and Pila) and New World (Asolene, Marisa and Pomacea) genera/species [7,56]. b A maximum likelihood tree showing the phylogenetic relationship among the eight species of ampullariids based on sequences of three genes used in previous phylogenetic studies of ampullariids [6,52]. Methodological details for the phylogenetic analysis can be found in Additional file 1. Bootstrap support values are shown, as is a scale bar of 0.05 substitution per site. Photo credit: L. nyassanus, Pila ampullacea and M. cornuarietis (JCHI); A. platae and P. scalaris (SI); P. canaliculata, P. maculata and P. diffusa (HM)

Transcriptome assembly and annotation
Illumina raw reads were cleaned by removing adaptor sequences, reads with > 5% unknown "N" bases or > 20% bases with a quality score ≤ 10 (Table 1). Trimmomatic v0.33 was then used to further remove low quality reads with a quality score < 20 and a length < 40 base pairs (bp) [41]. For each species, clean reads from different tissue samples were pooled for de novo assembly using Trinity 2.2.0 under default settings [42]. The assembled transcripts (ranging from 126,582 to 388,329; Table 2) were clustered with CD-HIT-EST 4.6.6 to reduce redundancy using a threshold of 95% sequence similarity [43]. Open reading frames (ORFs) were predicted with Trans-Decoder 3.0.0 (https://transdecoder.github.io/) using a threshold of ≥100 amino acids. Only the single best ORF per transcript was retained. The longest ORF in each gene cluster was selected as the unigene. Expression levels were estimated as transcripts per kilobase million read (TPM) using Salmon 0.7.2 [44], and unigenes with TPM less than 0.5 were considered as non-expressed [25]. The level of completeness of our eight assembled transcriptomes was evaluated using BUSCO (benchmarking universal single-copy orthologs) v1.1b [45].
Predicted protein sequences were annotated using BLASTp 2.4.0+ [46] against NCBI's non-redundant (nr) database with an E-value of 1 × 10 − 5 . Gene Ontology (GO) function for each unigene was assigned using Blast2GO [47] with BLASTp nr input. Sequences were also submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Sever (http:// www.genome.jp/kegg/kaas/) to determine their functional relationships using the bi-directional best-hit method. References for the KEGG annotation included the default representative eukaryotic genomes as well as  Table 2.

AmpuBase database construction
AmpuBase is a relational database that provides public access to these newly assembled ampullariid transcriptomes and annotations. The database structure and layout are similar to those of PcarnBase [48], except that data from several species can be searched at the same time and that the GO and KEGG search pages are integrated. In brief, for each species, a relational database was developed using MySQL v5.6.34 and hosted on an Apache HTTP server. The BLAST search function is powered by ViroBLAST [49] using the PHP programming language. The database consists of DNA and protein sequences of all unigenes that are linked with associated NCBI nr, GO and KEGG annotations through unigene ID. The database consists of five entity tables ("NCBI annotation", "Proteins", "DNAs", "Gene Ontology" and "KEGG") and two relation tables ("NCBI_GO_relation" and "NCBI_KEGG_relation").

Transcriptome assembly metrics
There were between 72,341,892 to 462,231,634 bp of clean data, corresponding to between 20,730 and 35,828 unigenes with ORFs in each of the eight species (Table 2; Fig. 1a). The mean N50 value (shortest sequence length at 50% of the unigenes; 1576 bp) and the percentage of annotated unigenes (average 75.9%) in our study are higher than the corresponding values from previously published ampullariid transcriptomes (P. canaliculata, N50: 283 bp, 29.2% unigenes annotated [29]; P. maculata, N50: 1332 bp, 36.6% unigenes annotated [25]). Our transcriptome assembly metrics are comparable to those of recently published transcriptomes from other families of mollusks (Table 3), indicating the overall robustness of our transcriptome sequencing, assembly and annotation pipeline.
To further evaluate the completeness of transcriptomes, we examined the proportions of complete as well as partial homologs of 843 conserved metazoan genes within the eight coding unigene sets. The transcriptomes contain 77.46 to 92.41% of the complete conserved metazoan genes, and 87.54 to 96.09% of the genes if fragmented BUSCO hits are also included ( Table 2). These BUSCO metrics are comparable with those of other mollusc transcriptomes published in recent years (Table 3).

AmpuBase: Functions and applications
AmpuBase is available online via web interface at http:// www.comp.hkbu.edu.hk/~db/AmpuBase/. The database can be searched with BLAST or other query terms. The BLAST search function allows users to blast query sequence(s) or fasta files against single or multiple DNA/protein datasets with default settings (under Basic Search option) or customizable settings (under Advance Search option) (Fig. 2a). Upon submitting a BLAST search, matched sequences are returned with their Evalue and similarity score, and information on the corresponding annotation can be obtained by clicking "Unigene ID" (Fig. 2b).
Apart from BLAST search, the transcriptome data can be searched in two other ways (Fig. 2c). General Annotation Search allows one to query the relevant annotations (i.e., NCBI annotation, GO and KEGG) either by using the unigene ID or unigene name (e.g., perivitellin ovorubin). Each successful query returns a table that contains Unigene ID, NCBI's nr, GO and KEGG description (if available). The resultant sequences can be downloaded by selecting the Unigene ID and clicking "Submit" for further analysis, for example, phylogenetic analysis of perivitellin ovorubins, major and multiple functional proteins in PVF [20,24,25]. In addition, GO and KEGG Annotation Search is also provided for searching GO and KEGG information using GO ID, KEGG ID or a keyword. All sequence data for these ampullariid transcriptomes are available for download under the "Downloads" menu, for transcriptome wide data mining and analysis of a specific gene.

Conclusions
In this study, we have generated a large set of transcriptome data for eight species that represent five genera of Ampullariidae. These data are compiled in a relational database, AmpuBase, which greatly enhances the publicly available genomic resources for ampullariids. The database provides tools for sequence-or keyword-based query functions, which will facilitate in-depth ecological and evolutionary studies on ampullariids, and comparative studies with other invertebrates. AmpuBase will be updated when more genomic data become available in the future.

Additional file
Additional Authors' contributions JCHI performed the experiments, data analyses and drafted the manuscript. HM and JS coordinated the experiments and revised the manuscript. XH and QC designed and constructed the database website, wrote the database section of the manuscript, and revised the manuscript. SI, HH, BVB and MG collected samples, provided advice on snail culturing and revised the manuscript. JWQ designed and oversaw the study, and revised the manuscript. All authors read and approved the final manuscript.