Transcriptome deep-sequencing and clustering of expressed isoforms from Favia corals
© Pooyaei Mehr et al.; licensee BioMed Central Ltd. 2013
Received: 12 December 2012
Accepted: 3 August 2013
Published: 12 August 2013
Skip to main content
© Pooyaei Mehr et al.; licensee BioMed Central Ltd. 2013
Received: 12 December 2012
Accepted: 3 August 2013
Published: 12 August 2013
Genomic and transcriptomic sequence data are essential tools for tackling ecological problems. Using an approach that combines next-generation sequencing, de novo transcriptome assembly, gene annotation and synthetic gene construction, we identify and cluster the protein families from Favia corals from the northern Red Sea.
We obtained 80 million 75 bp paired-end cDNA reads from two Favia adult samples collected at 65 m (Fav1, Fav2) on the Illumina GA platform, and generated two de novo assemblies using ABySS and CAP3. After removing redundancy and filtering out low quality reads, our transcriptome datasets contained 58,268 (Fav1) and 62,469 (Fav2) contigs longer than 100 bp, with N50 values of 1,665 bp and 1,439 bp, respectively. Using the proteome of the sea anemone Nematostella vectensis as a reference, we were able to annotate almost 20% of each dataset using reciprocal homology searches. Homologous clustering of these annotated transcripts allowed us to divide them into 7,186 (Fav1) and 6,862 (Fav2) homologous transcript clusters (E-value ≤ 2e-30). Functional annotation categories were assigned to homologous clusters using the functional annotation of Nematostella vectensis. General annotation of the assembled transcripts was improved 1-3% using the Acropora digitifera proteome. In addition, we screened these transcript isoform clusters for fluorescent proteins (FPs) homologs and identified seven potential FP homologs in Fav1, and four in Fav2. These transcripts were validated as bona fide FP transcripts via robust fluorescence heterologous expression. Annotation of the assembled contigs revealed that 1.34% and 1.61% (in Fav1 and Fav2, respectively) of the total assembled contigs likely originated from the corals’ algal symbiont, Symbiodinium spp.
Here we present a study to identify the homologous transcript isoform clusters from the transcriptome of Favia corals using a far-related reference proteome. Furthermore, the symbiont-derived transcripts were isolated from the datasets and their contribution quantified. This is the first annotated transcriptome of the genus Favia, a major increase in genomics resources available in this important family of corals.
With the advent of Next-Generation Sequencing (NGS) technology, genomic data acquisition has become much easier, especially for non-model organisms . The generation of transcriptomes from non-model organisms has also benefitted from NGS advances. Transcriptomic datasets can facilitate genome annotation, single-nucleotide polymorphism (SNP) analysis , marker development for population genetic and adaptive evolutionary studies , as well as functional classification  in non-model species. The application of transcriptome deep sequencing in metabolic pathway reconstruction and gene marker development has already shown great promise in Camellia sinesis, Cicer arietinum, Sphenodon punctatus, and Anopheles funestus.
This method is also valuable for relatively understudied species, such as Favia corals. Though corals are high in economic and ecological value, limited genomic resources are available, largely because samples are difficult to obtain. Because NGS requires only small amounts of animal tissue, it is possible get large amounts of information from very small samples (1–2 coral polyps). Recently, anthropogenic threats such as climate change, metal pollution and oceanic acidification  have led to rapid declines in worldwide coral populations, lending increased urgency to the need for genomic data. Detailed understanding at the genomic and transcriptomic level will allow for the development experimental studies to assess how the intensity and frequency of disturbances affects coral health and abundance.
Several studies have reported NGS long reads transcriptome sequencing of coral species such as Acropora millepora[10, 11] and Pocillopora damicornis. In addition, other recent studies have used the Short Sequence Reads (SSR) platform , or combined SSR and long reads approach to explore whole transcriptome modulation in response to low pH in adult Pocillopora damicornis, and in early life stages of Acropora millepora. Yet, these coral clades are quite phylogenetically divergent from Favia.
This Favia genus is one of the most widely and uniformly distributed of all coral genera and is phenotypically presented as massive, dome-shaped and flat. In many cases Favia species exhibit cryptic species complexes and their phylogeny has been parodied as being a “Bigmessidae” . Favia are in the Faviidae family that contains twenty-four genera, more than any other coral family . Faviidae is one of the highly fragmented families and Indo-Pacific members appear to be distinct from Atlantic counterparts. Therefore, adding more molecular markers to resolve their phylogeny will add further resolution to coral systematics.
Holobiont cDNA libraries were synthesized from the RNA of two individual adult Favia sp. collected from the Gulf of Eilat in the Red Sea. Illumina runs performed on each separate, normalized, cDNA pool generated approximately 80 million reads per sample with average quality scores > Q20 at each base. The first step of assembly was carried out with ABySS [18, 19], a de Brujin graph assembler. In order to recover transcripts across a range of expression levels, we carried out assembly across a range of k-mer values. Transcripts with low depth (i.e. weakly expressed) are best recovered with low k-mer values, while high depth (i.e. highly expressed) transcripts are best recovered with high k-mer values . Using a range of k-mer values also allows for the identification of expressed splice variants arising from a single gene. As the Illumina read length was set to 75 bp, we chose initial k-mer values ranging from 29 to 45 bp for each sample run.
After using the EMBOSS package  to generate all possible open reading frames (ORFs) from stop to stop for each assembled contig, the resulting predicted ORFs were searched for sequence similarity against the N. vectensis proteome , using reciprocal BlastP (E-value ≤2e-30)  (Script 1). For the 519,766 predicted ORFs longer than 150 bp, 12,141 unique ORFs in Fav1 showed considerable sequence similarity to 7,186 existing protein sequences in N. vectensis. Similarly, 12,425 unique ORFs in Fav2 showed similarity to 6,862 N. vectensis protein sequences (Additional file 2: File S2, Additional file 3: File S3). The top Blast hits for each sample were saved in a pre-clustering list using a Perl script (Script 2; Output files reported in Additional file 4: File S4, Additional file 5: File S5). These lists were then used in TRIBE-MCL  to identify homologous protein family clusters in a comprehensive and uniform way (Additional file 6: File S6, Additional file 7: File S7). The main clustering parameter, inflation value (r), was selected as default (r = 2.5). Fav1 and Fav2 had similar numbers (7,186 and 6,862, respectively) of protein family clusters homologous to unique N. vectensis proteins. These clusters were subjected to further functional annotation.
In order to evaluate the completeness of our annotation using N. vectensis as the reference as opposed to using another available Cnidarian non-annotated proteome (A. digitifera), we applied a newly-developed completeness metric  (In prep.) to determine the proportion of the reference proteome covered by our sets of assembled transcripts. Only those ORFs with length coverage ≥80% of the matched protein from the N. vectensis or A. digitifera proteome were included. Completeness measurements in Fav1 and Fav2 compared to N. vectensis were 29.54% and 28.20%, respectively; when the same procedure was carried out using the unannotated proteome of A. digitifera as a reference (23,677 ORFs downloaded from http://marinegenomics.oist.jp/genomes/downloads?project_id=3. This showed an improvement of only 1-3%, thus validating our usage of N. vectensis as a reference proteome (Additional file 8: Table S1).
To identify the putative function of 7,187 isoform clusters, Gene Ontology (GO) and protein domain (KOG, InterPro) searches were performed using the functional annotation of the N. vectensis. (Data downloaded from the JGI genome project http://genome.jgi-psf.org/Nemve1/Nemve1.download.html). The clusters were assigned gene names based on the gene name annotation of the best Blast match for the sequences (Additional file 9: File S8). This process successfully assigned gene names for 6,632 (92.27%) clusters using GO term, KOG description, and InterPro description. Among 12,141 annotated best hits, 11,411 (93.98%) gene names were assigned to sequences. These provide a rough estimate of the number of different genes expressed in Fav1 libraries. Broadly, the putative homologs of genes involved in various cellular processes and pathways found to be functionally conserved.
Top 30 frequent annotated functions of homologous protein clusters in Fav1
Top frequent GO-annotated homologous protein clusters in Fav1
1-Nucleic acid binding
2-Protein kinase activity
5-Calcium ion binding
10-Structural constituent of ribosome
3-Intracellular signaling cascade
4-Proteolysis and peptidolysis
8-Intracellular protein transport
10-Regulation of cell cycle
1-Ubiquitin ligase complex
2-Integral to membrane
Intracellular signaling pathway genes annotated in Fav1
Intracellular signaling pathway proteins annotated in Fav1
Receptor activity (IFRD-C)
Nuclear factor NF-kappa-B
Intermediate in Toll-signaling
Hepatocyte nuclear factor 4
RTK signaling protein
Major transcription factor families identified by conserved domain annotation
Transcription factors identified by KOG/InterPro/GO annotation in Fav1
Transcriptional Coactivator P50
Transcriptional Coactivator P100
Transcriptional Coactivator CAPER
P53 DNA-binding domain
NF-X1-type zinc finger protein
Dimerization partner (TDP)
Basic region leucine zipper & bZIP
Helix-loop-helix DNA binding domain
Myb-like DNA-binding domain
Zinc finger C2H2 type
Zinc finger MIZ type
Holobiont coral tissues also contain eukaryotic dinoflagellate endosymbionts of the genus Symbiodinium[30, 31]. We therefore determined the contribution of symbiont-derived transcripts in our analysis. First, we extracted the regions of cDNA contigs that corresponded to each individual annotated ORF in two datasets (For commands, see Additional file 1: File S1). Furthermore their similarity search against two Symbiodinium transcriptomes (http://medinalab.org/zoox/) was performed using BlastN. In order to define an E-value as a cutoff threshold, a reciprocal BlastN search between the N. vectensis genome and the two Symbiodinium transcriptomes showed an average E-value of e-80. Thus all contigs with similarity higher than this threshold to Symbiodinium were defined as likely to be symbiont-derived. Based on these results, 9% of the annotated ORFs (1.34% of the total assembled contigs) of Fav1 were labeled as symbiont sequences, and 8.7% (1.61% of total assembled contigs) of Fav2. FASTA files of these non-symbiont transcripts are reported (Additional file 11: File S9, Additional file 12: File S10). Finally, we performed BlastX (E-value equal to at least e-30) on the non-symbiont derived cDNA fragments against the N. vectensis proteome to confirm correct initial annotation by BlastP. All the cDNA sequences matched to the same N.vectensis IDs that were predicted using BlastP.
Molecular markers are essential tools for population genetic studies. Typically, combination of mitochondrial and nuclear markers are used to examine the species relationships. In order to generate a Favia molecular marker dataset, we downloaded Favia related sequences from NCBI. Similarity searches were carried out against this Favia dataset. Among various molecular markers, we chose COI, Cytb and 28S. Individual sequence regions were identified and extracted from the cDNA contig files in both samples. DNA alignments for each locus were generated using ClustalW2 with default parameters  (Additional file 13: File S11, Additional file 14: File S12, Additional file 15: File S13). Consequently, a matrix of these three loci was generated using FASconCAT . A Maximum likelihood phylogenetic analysis (RaxML) was carried out . Maximum likelihood phylogenetic analysis using three loci (COI, Cytb, 28S) suggests that these Favia samples belong to Faviids (Additional file 16: Figure S2). Morphological analysis places them as F. albidus, a species that is not yet represented in NCBI. For example, out of 18 Favia species that have been described morphologically, only 15 of them have molecular data in NCBI. F. albidus, F. helianthoides, and F. marshae lack molecular markers in NCBI. Based on geological distribution  and morphology, we suggest these two species belong to F. albidus. In fact, F. helianthoides has no morphological similarities with our samples, and F. marshae habitat has never been reported in Red Sea . However, further skeletal samplings are required for final validation [35, 36]. Regardless, this study increases the protein information of the Faviids from 496 proteins to over 12,000 proteins in NCBI.
In order to evaluate our assembly method and the possible impact of ABySS-specific errors on the annotation accuracy of the long candidate FP sequence, we performed both Trans-ABySS  and Trinity  on reads from Fav1. Both assembly programs led to the generation of sequences identical to Fav1 s23Contig16657-5 as predicted using ABySS and CAP3. (Additional file 20: File S15).
In order to rule out the possibility of promiscuous domain assembly, we assessed the quality of the de novo assembly of FP sequences, as well as all other transcripts, by mapping reads on assembled contigs for each sample. Such read alignment to contigs is necessary to provide support for new transcript identification as well as for determining gene expression levels [49, 50]. In order to measure the Reads Per Kilobase of exon model per Million mapped reads (RPKM) , a sub-fasta cDNA region, corresponding to each ORF, within each contig was generated. Reads were aligned to these annotated cDNA regions. Coverage (RPKM) measurements were determined using a Perl script (Script 3). The results are reported (Additional file 22: File S16, Additional file 23: File S17). The mapping of all the reads onto the annotated Faviids transcript showed that the number of reads corresponding to each transcript ranged from 10 to 47,189, with an average of 850 reads per transcript in Fav1, and 10 to 29,222, with an average of 766.37 reads per transcript in Fav2, indicating a wide range of expression level of Faviids transcripts. It also indicates that very low expressed annotated Faviids transcripts were also represented in our assembly. The minimum coverage (RPKM) of an annotated Fav1 transcript was 3.89 and maximum of 6,919.20 with an average of 68.61. The RPKM ranged from 3.60 to 8,576, with an average of 72.64 in Fav2. The average and the range of RPKM per transcript is similar and somewhat higher (25.7) than other whole transcriptome studies .
In this study, we demonstrate a gene clustering strategy and utilize this in conjunction with NGS contig assembly, sequence conservation measurements, annotation and expression quantification for de novo assembled transcriptomic data. Working with two uncharacterized Faviid corals, we report 120,000 non-redundant transcripts to a genus whose sequence data was previously limited to 496 in public databases. These results provide greatly enhanced access to the expressed genes in Faviidae reef building corals, a potentially valuable resource of genetic/functional markers for population structure and functional genomic studies. We also took advantage of the optical properties of these corals expressed fluorescent proteins to validate our annotation methods to show that these sequences were indeed bonafide fluorescent protein genes. These methods reported in this study are available via Open Source software programs as well as our provided scripts.
This study was conducted during the period of May–June 2009 on a coral reef on the northern tip of the Gulf of Eilat, in the northern Red Sea (29º30′N, 34°55′E).
Samples were collected at 65 m, using closed-circuit trimix rebreather system (Megalodon™). The organisms were identified under water to the family level, Faviidae, and brought to the surface in a black mesh bag to avoid sun exposure. The organisms were immediately photographed and vouchered with white light and fluorescent photography as described in  and stored in a shaded running-seawater facility. Within 1–2 hours of collection, samples were rinsed in sterile-filtered artificial seawater and processed for RNA and DNA. The tissue of the coral was extracted from the skeleton using QiaShredder (Qiagen). For RNA, the TriZol method was used and stored as an ethanol precipitate for travel back to the US. DNA was extracted using Qiagen DNAeasy kit according to manufacturer’s protocol and stored in at 4°C. The specimens have been photo vouchered and their genomic and transcriptomic raw materials are stored in the American Museum of Natural History Ambrose Monell Cryo Collection.
Illumina sequencing using the GAII platform was performed at the Yale University W.M. Keck Biotechnology Resource Laboratory according to manufacturer’s instructions (Illumina, San Diego, CA) (Additional file 24: File S18) and using high quality RNA with a 28S rRNA band at 4.5 kb that is at least twice the intensity of the 18 s rRNA band at 1.9 kb. The cDNA library contained 77,804,306, 75-mer length reads. The sequencing data are deposited in NCBI Sequencing Read Archive . (The BiosampleIDs = SAMN01761696, SAMN01761695).
De novo assembly was carried out using ABySS with default settings across multiple k-mer values . After assessing different k-mer values, the three best k-mer assemblies (35-mer, 39-mer, 45-mer for Fav1 and 31-mer, 35-mer, 39-mer for Fav2) were selected and concatenated for the second step of assembly. To evaluate the N50 length and the number of assembled contigs using different k-mer values, we used a Perl script. CAP3  was used to remove redundancy across ABySS assemblies and to merge contigs into longer sequences. All assembled contigs were subjected to annotation and further protein homology searches. Trans-ABySS  and Trinity  were used to confirm the long ORFs, homologous to fluorescent proteins, which were identified with ABySS and CAP3.
A set of possible Open Reading Frames (ORF), stop to stop from assembled sequences, was generated using EMBOSS . To annotate the de novo assembled sequences, a similarity search against N. vectensis proteome was conducted using BLASTP with two E values of 2e-10 and 2e-30. The resulting data (E-value of 2e-30) was filtered and clustered using TRIBE-MCL . Each homologous group was annotated using GO and KOG annotated N. vectensis data (http://genome.jgi-psf.org/Nemve1/Nemve1.download.html). For Symbiodinium peptide annotation, a homology search using BLASTN with E values of 2e-80 against the Symbiodinium transcriptome (http://medinalab.org/zoox/) was carried out. The final non-symbiont FASTA cDNA fragments were reported.
The BlastP (E-value of 2e-30) output list generated from homology search of both samples against N. vectensis and A. digitifera was organized for completeness measurements. The completeness formula according to  was implemented into a Perl script (In prep) to determine the percentage of the reference proteome that is covered by each of our sets of assembled transcripts. Length coverage of each of these reference ORFs by a hit from our data set had to be at least 80%.
The maximum likelihood tree of identified fluorescent protein was generated using RaXML  under PROTGAMMAWAGF amino acid substitution model, selected based on the results from ProtTest . The alignment was generated using MAFFT  and CLUSTALW2  with minor adjustment at the N-terminus region, when long gaps were inconsistent with other isoforms. Bootstrap values were estimated based on 1,000 replicates and were given for all presented branches. The variant sites were visualized with geneious (http://www.geneious.com). Dendroscope was used for visualization .
Molecular barcodes for all the Favia related sequences were downloaded from NCBI. A similarity search with sequences from our annotation was carried out against this Favia dataset. Cytb, COI and 28S sequences were identified and extracted from the cDNA contig files in both samples. DNA alignments for each locus were generated using ClustalW2 with default parameters . Consequently, a matrix of these three loci was generated using FASconCAT . A Maximum likelihood phylogenetic analysis (RaxML) was carried out . Bootstrap values were estimated based on 10,000 replicates and were given for all presented branches. Dendroscope was used for visualization .
The three cDNA sequences (Fav1 s23Contig16657, Fav2 s62Contig19888-6 and Fav2 s62Contig41210-3) were synthesized and propagated in pUC57 (GenScript USA Inc.). Kozak  analysis was used to determine the location of the potential start codon. The genes were subcloned from pUC57 into the NotI-BamHI site of the mammalian expression vector pcDNA 3.1 (Invitrogen, Inc.) using standard recombinant techniques .
Gene coverage levels were determined using a Perl script (Script 3). This script implements Bowtie  to map reads to an annotated reference cDNA, and calculates the RPKM according the formula used in . For visualization, BWA  was used to generate the read-to-contig alignment. The annotated cDNA from individual samples were used as the reference contig, and SAMtools  was used to generate binary files to be visualized in the IGV  genome viewer (For commands, see Additional file 1: File S1).
We thank Timor Katz and Tali Mass of the Interuniversity Institute of Eilat for their technical diving assistance in the collection of coral samples. Coral sample collections in this study have complied with the current laws of Israeli Natural Parks Authority; permit 2010/38008. Funding was provided by NSF grant # 0920572 and via a Baruch College Travel Grant to DFG.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.