Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx
© Meyer et al. 2009
Received: 13 January 2009
Accepted: 12 May 2009
Published: 12 May 2009
Skip to main content
© Meyer et al. 2009
Received: 13 January 2009
Accepted: 12 May 2009
Published: 12 May 2009
New methods are needed for genomic-scale analysis of emerging model organisms that exemplify important biological questions but lack fully sequenced genomes. For example, there is an urgent need to understand the potential for corals to adapt to climate change, but few molecular resources are available for studying these processes in reef-building corals. To facilitate genomics studies in corals and other non-model systems, we describe methods for transcriptome sequencing using 454, as well as strategies for assembling a useful catalog of genes from the output. We have applied these methods to sequence the transcriptome of planulae larvae from the coral Acropora millepora.
More than 600,000 reads produced in a single 454 sequencing run were assembled into ~40,000 contigs with five-fold average sequencing coverage. Based on sequence similarity with known proteins, these analyses identified ~11,000 different genes expressed in a range of conditions including thermal stress and settlement induction. Assembled sequences were annotated with gene names, conserved domains, and Gene Ontology terms. Targeted searches using these annotations identified the majority of genes associated with essential metabolic pathways and conserved signaling pathways, as well as novel candidate genes for stress-related processes. Comparisons with the genome of the anemone Nematostella vectensis revealed ~8,500 pairs of orthologs and ~100 candidate coral-specific genes. More than 30,000 SNPs were detected in the coral sequences, and a subset of these validated by re-sequencing.
The methods described here for deep sequencing of the transcriptome should be widely applicable to generate catalogs of genes and genetic markers in emerging model organisms. Our data provide the most comprehensive sequence resource currently available for reef-building corals, and include an extensive collection of potential genetic markers for association and population connectivity studies. The characterization of the larval transcriptome for this widely-studied coral will enable research into the biological processes underlying stress responses in corals and evolutionary adaptation to global climate change.
Reef-building corals support productive and diverse marine communities that provide important ecosystem services , but are threatened throughout their global range by a variety of stressors that include increasing sea surface temperatures, pollution, ocean acidification and disease [2–4]. A major focus of contemporary coral biology is to understand the factors determining stress resilience of corals and the potential for coral populations to recover from or adapt to these stressors . The responses of coral holobionts to heat and light stress are especially complex because of the symbiotic association between reef-building corals and algae (zooxanthellae). The effects of light and temperature stress on these intracellular symbionts have been well characterized , but the genetic determinants of stress tolerance in the coral host, and any interaction effects between coral and zooxanthellae phenotypes, remain poorly understood [7, 8]. Genomic resources such as genome or transcriptome sequences would make possible the detailed studies of gene expression, genetic connectivity, and stress physiology required for addressing these questions.
The genomic sequence resources currently available for corals are limited . About 11,000 to 15,000 reads from pilot shotgun sequencing projects are available at NCBI for each of three corals (Acropora millepora, Acropora palmata, and Porites lobata), but no coral genomes have been completed and to our knowledge no genome sequencing projects are underway for any coral species. A number of studies have developed EST resources for corals using Sanger sequencing [10–12]. The staghorn coral Acropora millepora (Ehrenberg, 1834) has emerged as the most extensively sequenced scleractinian coral, with ~10,000 ESTs and ~14,000 shotgun genome sequences publicly available at NCBI. Additional EST sequencing projects using this species are ongoing but the results are not yet publicly available . These EST sequencing efforts have allowed development of small-scale microarrays for gene expression analysis in the context of coral stress physiology , and similar studies aiming to identify stress candidates are currently underway . These studies have highlighted the utility of cDNA sequencing for candidate gene discovery in the absence of a genome sequence, but a comprehensive description of the full complement of genes expressed in corals remains unavailable.
The increased throughput of next-generation sequencing technologies such as 454 sequencing  shows great potential for expanding sequence databases of corals and other emerging model organisms (reviewed in ). 454 sequencing of transcriptomes for organisms with completed genomes has confirmed that the relatively short (100–200 bp) reads produced by current versions of this technology can be effectively assembled and used for gene discovery [17, 18]. These methods have not yet been widely applied to emerging model organisms, because of the lack of methods for de novo assembly and analysis in the absence of a reference genome sequence. Nevertheless, the few examples published to date have successfully demonstrated the potential for discovery of genes and genetic markers in these systems [19–23]. Despite their obvious potential, next-generation sequencing methods have not yet been applied to corals.
In this study, we describe improved methods for cDNA library preparation and titration for de novo transcriptome sequencing of any organism using 454, as well as sequence analysis and annotation procedures. The strategy outlined here does not require prior sequence knowledge, and relies exclusively on publicly available software and basic scripting tools. We applied these methods to sequence the larval transcriptome of the widely-studied coral A. millepora. The assembled, annotated sequences so produced provide a nearly complete catalog of the genes expressed in planulae larvae.
Summary of sequencing, assembly, and analysis
Raw sequencing reads
Trimmed & size-selected
Assembly of the trimmed, size selected reads along with the publicly available EST sequences (NCBI's dbEST) for A. millepora produced 44,444 contigs, with 62,657 reads remaining as singletons. Contigs ranged from 86 to 7,830 bp in size, with an average of 440 bp and an N50 of 693 bp (i.e., 50% of the assembled bases were incorporated into contigs 693 bp or longer). The size distribution for these contigs is shown in Fig 2B. The assembly produced a substantial number of large contigs: 11,813 contigs were >500 bp in length, and 4,654 were >1 kb. Sequencing coverage ranged from 1 to 65, with an average coverage of 5. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given contig and the number of reads assembled into that contig (Fig 2C). The majority of assembled reads were incorporated into the top 20,000 largest contigs, which included 74% of the bases sequenced and accounted for 47% of the total assembled length (Fig 2D).
Validation of singleton sequences and the contig joining procedure by PCR amplification and Sanger sequencing.
Aligned region (bp)
5' match, bp
3' match, bp
Of the 107,101 assembled sequences (contigs plus singletons), 24,850 had significant matches in public protein sequence databases (nr). These blast matches correspond to 17,707 different unique accession numbers, of which 2,331 were each matched by multiple queries without overlap. These 2,331 subject sequences correspond to 5,327 different query sequences (~2.3 queries matched each subject, on average). For each of these protein sequences, all queries matching that subject were joined together in the appropriate order and strand (sense/antisense) to produce a scaffold, a construct intended to represent the collection of fragments originating from a single transcript. Overall, the contig joining process condensed the number of sequences from 107,101 to 104,005. To validate the contig joining procedure, ten sets of sequences that had been joined in this fashion (i.e., 10 scaffolds) were randomly selected for validation. Note that because the gaps in scaffolds are of unknown length, the scaffold sequences provide minimum expectations for PCR product size. All ten scaffolds were successfully amplified by PCR (additional file 1). Sanger sequencing of these PCR products confirmed the identity of eight scaffolds at both ends, and the other two at only one end (Table 2). These findings confirm that the scaffolds produced from our contig joining procedure successfully recapitulate cDNA sequences in most cases. This strategy successfully identified different parts of the same transcript for >2,000 transcripts in our study, and should be widely applicable for other transcriptome sequencing efforts in emerging model organisms.
The preceding steps, including assembly of raw reads and contig joining, were expected to reduce redundancy among sequences (i.e., more than one sequence per gene). To characterize any redundancy that remained, assembled sequences were clustered using two parallel procedures based on (1) nucleotide sequence similarity among the assembled sequences themselves, and (2) protein sequence similarity among the query sequences' best blast matches. The final set of 93,466 merged clusters was obtained from the union of these protein and nucleotide sequence clusters. Excluding the large clusters matching transposable elements, sequence clusters ranged from 1 to 64 sequences in size, with most containing only 1–2 sequences and a small number of clusters containing more than 2. The largest clusters could indicate paralogs, highly-divergent alleles, well-conserved gene families, alternative splice variants, or simply high levels of expression with correspondingly high sequencing error due to increased sampling. Regardless of the biological interpretation of sequence clusters, the clustering of 104,005 assembled sequences into 93,466 clusters indicates that relatively little redundancy remained after assembly and joining. Annotation of sequences clusters allows the rapid identification of all sequences with similarity to a particular gene of interest, facilitating studies of sequence polymorphisms and closely related genes.
Summary of annotation of the A. millepora larval transcriptome.
≥ 300 bp
≥ 1000 bp
Total number of sequences
Sequences with BLAST matches
Sequences matching known genes
Sequences assigned GO terms
Sequences with conserved domains
Assembled sequences were assigned gene names based on the gene product and gene name annotation of the best blast match for that sequence. This procedure successfully assigned gene names for 15,860 sequences among the entire dataset; for 9,464 of the sequences ≥ 300 bp in length; and for 3,889 of the sequences ≥ 1 kb in length (Table 3). Among the 15,860 annotated best hits, 11,633 different gene names were assigned, providing a rough estimate of the number of different genes expressed in these libraries. This is probably an underestimate because many sequences lacked matches in public sequence databases (Table 3) and were therefore not assigned gene names.
Analysis of protein domains revealed that 12,785 of the assembled sequences matched profiles in NCBI's conserved domains database, corresponding to 4,475 different domains. Most domains were found in only 1–2 sequences, with a small number appearing more frequently. The top 20 most frequently detected domains include conserved domains associated with transcription factors, growth factors, and signaling pathways. The utility of domain annotation is that it allows researchers to quickly select all genes sharing a common domain; for example, selecting the 60 different sequences matching the domain COG5576 (homeodomain) retrieves putative homologs of 51 different genes with known roles in development and morphogenesis (e.g., SIX4, PAX3, dll, and engrailed).
Candidate genes identified based on GO annotation of A. millepora larval transcriptome.
Example gene (match accession)
Response to stress
Heat shock factor-binding protein (Q5RDI2)
Response to heat
70 kDa Heat shock protein (P17879)
Response to oxidative stress
Response to wounding
Phospholipase A2-activating protein (P27612)
Exocyst complex component 5 (P97878)
H-2 class II histocompatibility antigen (P04441)
Nitric oxide metabolism
Nitric oxide synthase (O19132)
60 kDa heat shock protein (P18687)
Vacuolar transport and organization
Vacuolar protein sorting-associated protein 26B-B (Q6DH23)
Genes from essential metabolic pathways and macromolecular complexes annotated in larval transcriptome.
Genes found (n)
Known genes (n)
Nuclear pore complex
Intercellular signaling pathway genes annotated in larval transcriptome.
Signal transducer and activator of transcription
Nuclear factor NF-kappa-B
Hepatocyte nuclear factor 4
Retinoid-related orphan receptors
Receptor tyrosine kinase
Major transcription factor families identified by conserved domain annotation of larval transcriptome
Conserved domain description
Cold-shock DNA-binding domain
Transcriptional Coactivator p15
Paired amphipathic helix repeat
Signal transducer and activator of transcription
P53 DNA-binding domain
RFX DNA-binding domain
NF-X1-type zinc finger protein
E2F/DP winged-helix DNA-binding domain
SRF-type transcription factor
Basic region leucine zipper & bZIP
Helix-loop-helix DNA-binding domain
Myb-like DNA-binding domain
Zinc finger, CCCH type
Zinc finger, C2H2 type
Sequence comparisons revealed broad similarity between the coral sequences described here and the protein sequences predicted from the anemone genome. Of the 104,005 assembled coral sequences, 20,619 have significant matches among the anemone predicted proteins (blastx with adjusted e-value threshold of 10-4). These correspond to 11,572 different predicted proteins. Reciprocal blast searches, in which the anemone sequences were queried against the coral sequences, revealed that 20,976 of the 27,273 N. vectensis proteins had significant similarity with coral sequences (tblastn), corresponding to 9,802 different coral sequences. Comparison of the best blast matches from these searches identified 8,515 unambiguous orthologs between coral and anemone, based on significant reciprocal best matches. Because the coral sequences represent partial transcripts, this is likely an underestimate of the true number of anemone orthologs expressed in this developmental stage.
To investigate whether the coral larval transcriptome contains "coral-specific" genes not found in the anemone, the assembled coral sequences were first compared with public databases. 22,158 coral sequences had significant matches in nr with an e-value ≤ 10-4; these represent sequences with reasonably strong matches to previously identified proteins in other organisms. A list of all coral sequences that showed even weak similarity to anemone proteins was compiled by searching the predicted proteins (blastx) and the genome assembly (tblastx), using a permissive threshold of e ≤ 0.01 (n = 28,190 sequences). Comparison of these results revealed that 748 coral sequences matched known genes in other organisms but had no significant similarity to anemone proteins from the draft genome. To reduce false positives stemming from the fact that many assembled sequences were not complete cDNAs, the top blast match for each of these (i.e., a complete protein sequence for that gene, from a different organism) was queried against the anemone proteins with a permissive threshold of e ≤ 0.01. After eliminating the sequences with matches in this search, 207 sequences remained, corresponding to 95 different annotated coral genes. These analyses suggest that these coral genes lack orthologs in the anemone. Because the draft assembly of the anemone genome is not complete , further experimental work would be required to verify the absence of these candidate coral-specific genes in the anemone.
Twenty of these predicted SNPs (corresponding to 18 different genes) were selected for validation using PCR and Sanger sequencing, and 14 of these tests (70%) were successful (additional file 3). Because the genes selected for validation include several previously described stress responses genes , the SNPs confirmed in these analyses may be valuable for investigating the genetic determinants of stress-gene expression. Overall, the successful experimental validation of the majority of computationally predicted SNPs confirms the utility of mining 454 transcriptome sequences for genetic markers.
The methods described here resolve several challenges associated with 454 sequencing of transcriptomes. Because 454 sequencing does not efficiently process homopolymer regions greater than 8 bp in length , the poly-A tails at the 3' ends of intact transcripts present a problem, and would be expected to result in under-representation of the 3' ends of transcripts. In our method, this is resolved by simply interrupting the poly-T region with a single C. A single copy of this 'broken-T' adaptor was found in 98,276 of the raw reads (16% of the total), slightly more than the expectation (~10%) based on average read length (232 bp) and assuming an average transcript size of 2,200 bp. This comparison suggests that the 3' ends of transcripts were well-represented in our dataset, confirming the effectiveness of this solution for overcoming the homopolymer problem. Another difficulty in 454 transcriptome sequencing is concatenation of adaptors during cDNA preparation, an issue that is commonly encountered but seldom reported because concatenated adaptor fragments either fail to sequence or are removed by adaptor trimming, decreasing the final sequencing yield. Our library preparation method resolves this difficulty by (1) sequencing all fragments from an internal end, so that any concatenated adaptors encountered will tend to be found at the end of the read; and (2) avoiding the over-amplification that can contribute to adaptor concatenation. The emPCR titration method described in this study reflects an additional improvement on previous methods that further increases the number of usable reads obtained per run.
Comparison of our results with recently published studies that employed 454 sequencing confirms the effectiveness of these improvements. First, 95% of the reads produced in our study passed stringent quality filters (i.e., adaptor trimming and size selection), an improvement over the 85–90% that passed similar quality filters in previous studies [17, 18, 23]. Our assembly incorporated 90% of high-quality reads, improving on the 40–88% reported for assemblies in previous studies [17, 21, 23]. The average contig size, which is strongly affected by read length, was ~90 bp higher in our study than in a recently published study that also used the 454 GS-Flx sequencing platform and so began with reads of comparable length . Our assembly also included a larger proportion of long contigs (27% of contigs were ≥ 500 bp) than the 8% reported in , despite the larger number of reads (>1 million) from multiple sequencing runs in that study. Finally, our library preparation and titration procedures produce 2–5 times more reads per sequencing run than standard methods: ~620,000 in our study versus ~130,000 to ~390,000 in previous reports [17, 21–23]. One possible caveat for these comparisons is that our assembly did include a small number of Sanger ESTs (~11,000), and these might be expected to improve the assembly. However, the effect of these Sanger ESTs on the assembly was actually negligible: a separate assembly that excluded those ESTs produced only 0.4% more contigs, 1% more scaffolds, and 4.8% shorter contigs. Overall, these comparisons with previous studies highlight the potential for our method to improve sequencing depth and efficiency in future transcriptome sequencing efforts.
Deep sequencing of transcriptomes with 454 sequencing typically produces many singleton sequences that fail to assemble, representing 10% of the reads in our study and 13–60% in other recent 454 transcriptome assemblies [17, 19, 23]. Singletons could result from sequencing errors, artifacts of cDNA normalization, genes expressed at low levels in the coral, or contaminants from other sources. For example, our cDNA was derived from larvae grown in non-sterile conditions and exposed to crustose coralline algae (CCA) collected from the reef, suggesting potential sources of contamination. To test for differences between cDNAs that remained as singletons and those that were assembled into contigs, a subset of 10,000 reads was randomly selected from each set. Because this comparison was based on sequencing reads, which did not differ in size between the two sets, this comparison avoided the confounding effect of sequence length. Contig reads and singleton reads were compared separately against nr using blastn (e-value ≤ 10-6), and the taxonomic origin of their top blast matches was recorded. This confirmed that most sequences in both sets best matched metazoan sequence records from other species, although singleton sequences contained a significantly lower proportion of metazoan matches (89%) than contig reads (95%) (Fisher's exact test; P < 0.01). Singletons contained a significantly higher proportion of sequences that best matched proteobacteria (P < 0.01) and amoebozoa (P < 0.05) records than reads that assembled into contigs, suggesting that singletons were enriched in cDNA from biological contaminants. However, the absolute numbers of putative contaminant hits were still very low (3–4 sequences for each non-metazoan group), suggesting that the contamination level was low overall. A notable difference revealed by these comparisons is that a larger proportion of contig reads (10%) had significant matches than singleton reads (2%) (Fisher's exact test; P < 0.001). This suggests that a substantial fraction of singletons (but not all) were artifacts of cDNA synthesis, normalization, or sequencing. Despite the potential artifacts and contaminants detected among singletons, 5171 of the total (62,657) sequences matched annotated protein sequences in the nr database, corresponding to 3989 different genes (blastx, e-value ≤ 10-4). These findings support the inclusion of singletons in our analysis on the basis that many represent coding sequence and so could be biologically interesting despite their low expression.
One of the primary questions for transcriptome sequencing projects is the number of genes expressed, but in organisms lacking a fully-sequenced genome, addressing this question is hampered by several unknowns. First, some contigs are expected to lack matches in public databases and therefore remain as separate sequences after contig-joining. Other genes present in a newly-sequenced transcriptome might be represented by fragments that match conserved regions in known genes, as well as fragments corresponding to poorly-conserved regions (e.g., un-translated regions) that would be expected to lack matches. Finally, some sequence fragments might be too short to allow for statistically meaningful matches, regardless of which transcript they were derived from. At the level of coverage obtained in a single sequencing run (~5-fold), sequence assembly alone is not sufficient to produce complete transcript sequences. Because the contig joining procedure described here relies on protein sequences similarity, it therefore cannot act on sequences without protein matches. No obvious solutions are available for assembling the sequences from a non-model system transcriptome sequencing project into a number of contigs that precisely matches the number of genes expressed, although paired-end library sequencing might improve the situation.
To arrive at an estimate of the number of genes expressed in this library, several alternative strategies are available that take into account the sources of uncertainty described above. (1) Gene number can be estimated based on the number of well-supported, unique sequences in the assembly as follows. The majority of sequenced bases (>70%) assembled into sequences at least 300 bp in length (Fig 2D), excluding essentially all singleton sequences (Fig 2A). The accuracy of these 19,210 assembled sequences is well supported by their high sequencing coverage (24 reads per assembled sequence, on average). Among these high-quality sequences, 16,360 sequence clusters (representing groups of highly similar sequences) were identified, providing one estimate of the number of genes expressed in these libraries. (2) Alternatively, gene number can be estimated based on the assembled sequences that matched known genes, as follows. Among those contigs sufficiently long to allow good matches (≥ 300 bp), 11,901 showed significant similarity to records in the nr protein database. Gene names were assigned to 9,464 of these on the basis that they matched known genes with annotated coding sequences, corresponding to a non-redundant list of 7,815 different gene names. Of course, many cnidarian genes are expected to lack matches in public databases; for example, 19% of predicted proteins from the anemone N. vectensis lack matches to other known genes . Based on the assumption that a similar proportion of coral genes would lack matches to known genes, this suggests an adjusted estimate of 9,591 genes expressed in A. millepora larvae (i.e., 7,815/0.81). (3) Finally, gene number can be estimated based on comparisons with the anemone N. vectensis. Based on blastx comparisons, 20,617 of the assembled coral sequences matched to 11,572 different predicted proteins from the anemone genome. Because more redundancy and gaps are expected among our assembled coral sequences than in the draft anemone genome, this should approximate the number of genes expressed in our libraries. Although no clear solutions are available to conclusively determine gene numbers without a fully-sequenced genome, the three estimates described suggest that ~11,000 different genes are expressed in larvae of A. millepora.
This estimate matches well with previous estimates of the number of genes expressed during metazoan development. About 11–12,000 genes are expressed during sea urchin development based on whole-genome tiling arrays . At least 6000–8000 genes are expressed during each embryonic or larval stage of nematode development, based on SAGE libraries . About 10,000 genes are expressed in fruit fly embryos . Comparison with these previous studies that span different animal phyla and analytical techniques reveals that our estimate of 11,000 genes expressed in coral larvae is within the expected range for metazoan development.
To evaluate whether the sequences annotated in this study includes all genes expressed in these treatments and developmental stages, we searched for a number of genes involved in metabolic pathways and macromolecular complexes involved in essential cellular processes. For these searches, the lists of genes associated with metabolic pathways were based on the BioCyc database of pathways , and gene lists for protein complexes were based on the CORUM database of mammalian protein complexes . The rationale for these searches was that since those essential genes must be expressed to maintain cellular functions, any failure to find these sequences in the transcriptome would reflect either inadequate sequencing depth or ineffective annotation. For the pathways and complexes considered here, essentially all genes (91–100%) were found based on these simple text searches (Table 5). A caveat for these findings is that high levels of expression might be expected for of some essential 'housekeeping' genes, causing them to be well represented in even an incomplete transcriptome sequencing effort. To account for this possibility, we also searched for genes associated with intercellular signaling pathways and transcription factor genes. Because of their more restricted spatial and temporal patterns of expression, these genes are not expected to be as highly expressed in whole-organism libraries as the essential 'housekeeping' genes. The query lists for these searches were based on genes and pathways conserved among primitive metazoans and cnidarians [12, 25, 31]. These searches successfully identified genes from nearly all of the pathways and transcription factor families considered (Tables 6, 7). The few genes that could not be found might result from incomplete annotation, inadequate sampling of the transcriptome, or genes that are truly not expressed. Regardless of the cause, the impact appears to be minimal (i.e., almost all genes from these pathways were found). Overall, these searches support the conclusion that the collection of annotated sequences produced in this study represents a reasonably complete description of the coral larval transcriptome.
One immediate application for the annotated sequence resource developed in this study is the design of microarray probes for gene expression analysis of biological processes such as heat tolerance or innate immunity. To that end, we have selected a set of 11,000 sequences for probe design based on the following criteria. To ensure that all sequences were sufficiently long to allow design of high-quality probes, we selected the subset of sequences ≥ 250 bp in length (n = 40,686). Among these, 10,652 matched annotated genes in public sequence databases, allowing unambiguous identification of the sense (protein-coding) strand. These included 8,523 clusters of similar sequences; within each cluster, sequences were ranked by coverage, and the most highly-covered member of each cluster was selected. Finally, based on gene name annotation, for any cases where two sequences were annotated with the same gene name, the sequence with higher coverage was selected. This produced a set of 7,930 well-annotated sequences corresponding to known genes. These sequences have been adjusted to ensure that the sense strand is represented, based on blast matches. To bring the total to 11,000, the remaining sequences ≥ 250 bp in length without matches in public databases (n = 30,034) were ranked by coverage and the 3,070 most highly covered sequences selected. For those sequences without blast matches, the strand information (sense/antisense) is not known. One important practical note about the use of these sequences for gene expression analysis is that probes or primers designed based on these sequences should be designed so as not to span the junctions produced by contig-joining. The set of sequences selected above provides a valuable resource for designing microarray probes, and the current widespread use of microarrays for gene expression profiling of corals [32, 33] suggests that this resource will be immediately useful.
Although A. millepora is an emerging coral molecular biology model for molecular ecology studies, few genetic markers are currently available for this species . These markers are obviously not enough for genome-wide association studies of coral physiological variation, which is the focus of ongoing projects in our laboratory addressing the molecular and evolutionary mechanisms of coral adaptation to climate change. This study, like others recent published [23, 35], shows a cost-effective way to produce a large number of gene-associated SNPs from transcriptome data obtained by 454 sequencing. QualitySNP is a newly developed program and uses a haplotype-based strategy to detect reliable SNPs without requiring sequence traces, quality scores, or genomic sequence data . In our study, 70% of predicted SNPs were confirmed experimentally. These findings clearly demonstrate the utility of SNP mining from 454 transcriptome sequences, and provide the most extensive genetic marker resource currently available for A. millepora.
In this study, we describe methods that will facilitate transcriptome sequencing for organisms that present important biological questions but lack fully-sequenced genomes. The laboratory procedures outlined here provide a rapid and cost-effective method for deep transcriptome sequencing in a single 454 sequencing run, improving substantially on previous methods. The analytical strategies we describe for adaptor trimming, assembly, joining, and annotation can all be accomplished using publicly available programs and basic scripting tools, and produce useful collections of annotated sequences from short reads. Our findings provide a nearly complete description of the genes expressed in coral larvae, a resource that is expected to be immediately useful for measurements of gene expression in reef-building corals, in addition to a large number of genetic markers for studies of genetic connectivity and structure. The single 454 sequencing run employed in this study produced more than 600,000 sequencing reads, allowing the identification of ~11,000 genes and ~30,000 SNPs. These findings represent a substantial contribution to the existing sequence resources for reef-building corals. Application of these resources will greatly enhance our understanding of the potential for corals to adapt to increasing environmental stress during climate change.
Because adult coral colonies contain algal symbionts that could result in contamination of extracted RNA, our experiments focused on a developmental stage (planulae larvae) that, in this species, lacks those symbionts. Adult colonies of Acropora millepora were collected at Magnetic Island (Townsville, QLD, Australia) and placed separately into individual bins prior to spawning. Corals were allowed to spawn naturally, following the synchronized mass spawning schedule typical for this and many other coral species , and then returned to their original locations on the reef. For each colony, gamete bundles were gently sieved through a 130-μm pore-size nylon mesh to separate sperm from eggs. Eggs from one individual colony were suspended in 1 l of 1-μm filtered seawater (FSW), and fertilized with sperm from a different individual colony. After allowing ~5 h for fertilization, excess sperm was removed by rinsing with FSW and embryos were stocked at 2 per ml in 2-l plastic culture vessels filled with FSW.
To maximize the diversity of expressed genes in our experimental material, we exposed these larvae to a range of different treatments expected to affect gene expression. These treatments included temperature stress and known inducers of settlement and metamorphosis, in order to induce expression of the genes associated with those processes. Three replicate cultures were maintained for 5 d at a standard culturing temperature of ~28°C, and another three replicates at an elevated temperature of ~32°C. Culture water was exchange for fresh FSW regularly throughout development.
Larvae grown at the standard (lower) temperature were pooled between culture replicates at 5 d post-fertilization and those pooled larvae used in short-term treatments expected to affect gene expression. One group was incubated at the higher temperature for 4 h (heat stress). Another group was not given any additional treatment, to serve as a control for any effects of handling during these short-term treatments. A third set of larvae from these same cultures was incubated with a known natural inducer of larval settlement : crustose coralline algae (CCA). Other larvae were incubated with an artificial inducer of metamorphosis, a GLW-amide peptide called EPLPIGLW-amide . Using light microscopy, we confirmed that the CCA treatment induced behavioral and morphological changes associated with settlement, and the GLW-amide treatment induced metamorphosis directly, bypassing those subtle changes. After 4 h incubations in each treatment, the larvae or recruits were preserved in RNALater (Ambion, Austin, TX, USA). Larvae were sampled from the different long-term culture conditions (28 and 32°C) on that same day and stored in the same manner, for a total of six different samples. RNA was extracted from these samples using the Ambion RNAqueous-micro kit (Ambion) according to the manufacturer's instructions.
The methods developed in this study for cDNA synthesis, amplification, and fragment library preparation in preparation for 454 sequencing are illustrated in Fig 1. RNA preparations were purified by precipitation with lithium chloride prior to cDNA synthesis. First-strand cDNA was produced using this purified RNA according to Clontech's SMART cDNA synthesis kit, with the following modifications. The primer used for first strand synthesis was a modified oligo-dT primer (AAGCAGTGGTATCAACGCAGAGTCGCAGTCGGTACTTTTTTCTTTTTTV). The poly-T stretch is broken by the inclusion of an internal C to minimize the potential for 454 sequencing problems in this homopolymer stretch. The remainder of cDNA synthesis protocol was conducted according to the manufacturer's instructions.
cDNA was amplified using the "CAP" primer (AAGCAGTGGTATCAACGCAGAGT) according to the manufacturer's instructions (SMART cDNA synthesis, Clontech), taking advantage of the PCR Suppression effect to preferentially amplify longer molecules and thus enrich for full-length transcripts . Throughout these cDNA amplification procedures, precautions were taken to minimize distortions of the samples based on previously described principles . Following amplification, PCR products were evaluated by gel electrophoresis to confirm that the appropriate range of molecular weights had been produced (0.5–3 kb), and that the libraries were not over-amplified (i.e., more than 200 ng PCR product per 30-μl reaction), which can distort expression profiles and lead to adaptor concatenation. To maximize the amount of fully double-stranded PCR product, the reactions were 'chased' with additional cDNA amplification primer (1 μl of 10 μM stock) and incubated at 78°C for 1 min, 65°C for 1 min, then 68°C for 3.5 min. Multiple PCR reactions were conducted for each library, then pooled and purified using the Qiaquick PCR Purification kit (Qiagen, CA, USA). Finally, equal quantities of cDNA from each of the six libraries were pooled to produce a library expected to include all genes expressed in the different treatments.
To minimize differences among the abundance of different transcripts (i.e., of genes expressed at different levels), 2 μg of the pooled, amplified cDNA was normalized using the Trimmer kit (Evrogen, Moscow, Russia) according to the manufacturer's instructions. This sample contained equal amounts of cDNA from each library (i.e., two different long term culture conditions and four different short term treatments). Following normalization, the cDNA was amplified for 15 cycles as described above. 16 reactions (30 μl each) were amplified, then pooled and purified as described above (Qiaquick PCR purification).
cDNA was sheared by sonication to produce short, random fragments appropriate for 454 sequencing. This was accomplished by first preparing 5 μg of amplified, normalized cDNA in a 100-μl volume. The pooled sample was submerged in an icewater bath and sonicated using a Misonix S3000 (Misonix, NY, USA) in 30 sec bursts followed by 30 sec rests, with power set at 0.5–1 (18–30 W). Fragmented cDNA samples were evaluated by gel electrophoresis to visualize the effectiveness of the process. A range of different sonication durations were tested, and the 3 minute treatment selected because it produced fragments in the appropriate size range (ca. 300–400 bp).
Oligonucleotide adaptors were ligated to the fragmented cDNA to facilitate 454 sequencing. First, the fragmented cDNA was 'polished' using Klenow and T4 DNA polymerases (New England BioLabs, Ipswich, MA, USA) to fill in single-strand overhangs that might remain after sonication. A partially double-stranded adaptor containing the standard 454 sequencing primer 'B' was prepared by combining oligonucleotides 'B+1' (GCCTTGCCAGCCCGCTCAGACGAGCGGCCA) and its partial complement, 'anti-B+1' (TGGCCGCTCGT) at a final concentration of 10 μM each. A second partially double-stranded adaptor containing the standard 'A' sequence was prepared by combining oligonucleotides 'A+' (GCCTCCCTCGCGCCATCAGCCGCGCAGGT) and 'anti-A+' (ACCTGCGCGG). The partially double-stranded structure of these adaptors ensures that they are ligated in the correct orientation and minimizes tandem ligations, since T4 DNA ligase requires a double-stranded DNA template. Adaptors were ligated to fragmented cDNA using T4 DNA ligase (Promega, WI, USA) in an overnight reaction at 12°C. Following ligation, constructs were purified using the Qiaquick PCR Purification kit to remove excess adaptors and reagents (Qiagen, CA, USA).
Ligations were tested by PCR amplification using different combinations of primers to verify that the correct constructs had been produced. Constructs were then amplified through a 'step-out' PCR scheme, using primers 'A' (GCCTCCCTCGCGCCATCAG) and 'B' (GCCTTGCCAGCCCGCTCAG), each at 0.2 μM final concentration, and 'A+CAP' (GCCTCCCTCGCGCCATCAGCCGCGCAGGTAAGCAGTGGTATCAACGCAGAGT) at 0.01 μM. Amplification was carried out for 17 cycles as described above, including chasing with an additional aliquot of primers and column purification following PCR. Finally, the constructs were labeled with biotin by repeating the amplification using a 5'-biotin-labeled 'A' primer. For this final labeling reaction, primers 'biotin-A' and 'B' were each used at 0.2 μM, and the reactions were amplified for 3 cycles. The final product was again column purified, to remove unincorporated oligonucleotides and biotin, and stored at -20°C prior to sequencing. Note that because our library preparation method explicitly controls for adaptor structure (i.e., only the correct constructs are amplified), the biotin labeling steps, which are required in typical 454 library protocols, are not actually required for our method.
A step-by-step protocol for the library preparation method outlined above is provided in the additional files accompanying this article (additional file 4). That protocol includes minor modifications that we have recently developed to reduce the number of manipulations required, improve reproducibility, and to further reduce adaptor concatenation. The improved protocol in that file is recommended for researchers interested in using our methods for 454 sequencing of transcriptomes.
The biotin-labeled library described above was further processed prior to 454 sequencing to enhance yield and sequencing quality. First, the pooled sample was purified using AMPure size exclusion beads (Agencourt, MA, USA). This size-selection is essential for efficient emulsion PCR (emPCR) because small fragments would be preferentially amplified, causing eventual sequence loss by decreasing the overall average read length. Next, the biotin-linked fragments were captured onto streptavidin-coated paramagnetic beads (Invitrogen, CA, USA) and denatured using sodium hydroxide. This procedure was used to purify the fragments containing both adaptors (biotin-A and B). Incorrectly ligated fragments containing B adaptors at both ends failed to bind to beads initially, allowing their removal. After denaturation, those fragments with two biotin-A adaptors remained attached to the beads, leaving the correct constructs (with biotin-A and B) in the supernatant. Note that, because our library was essentially free of incorrect adaptor configurations (e.g., two A adaptors, etc.), this biotin-streptavidin purification step was not actually necessary, but is described here for completeness. Finally, the size distribution of this purified, single-stranded template cDNA (sstcDNA) library was analyzed using an Agilent Bioanalyzer, and the sample was quantified using the Quant-iT OliGreen ssDNA Assay (Invitrogen).
Following cleanup and quantification, the sample was titrated to optimize yield and sequence quality. This procedure was conducted essentially as outlined in the Long Pair-End Library Preparation protocol (Roche, IN, USA) with the following modifications. First, emPCR reactions were performed in triplicate with different amounts of input sstcDNA (0.5, 1, 4, and 16 copies per bead). The emulsions were broken and the DNA Capture beads collected and pooled by sstcDNA concentration. Each pool was then enriched for beads carrying sstcDNA according to the manufacturer's instructions (Roche). The enriched bead samples were then counted using a Z1 Coulter Counter (Beckmann Coulter, CA, USA) to calculate the percent enrichment (i.e., the percent of initial beads that contained sstcDNA). Based on a linear regression of these percent enrichment values against the initial sstcDNA amounts, we calculated the amount of input DNA expected to produce 10–15% enrichment. In our experience, this range of enriched bead recovery maximizes the number of beads containing a single amplified fragment of sstcDNA, while minimizing the number of beads containing multiple fragments. This procedure is essential for maximizing the number of reads produced by 454 sequencing. Based on these analyses, large-scale emPCR was conducted based on the calculated optimum sstcDNA amount, and sequenced using the 454 GS-Flx instrument according to the manufacturers' instructions (Roche).
All sequence analyses were conducted using publicly available software and custom Perl scripts that relied heavily upon BioPerl modules . Those bioinformatics scripts, and a detailed protocol describing their use, are available at the authors' website .
Adaptor sequences that would prevent correct assembly of the raw sequencing reads were removed following a similar strategy to that used in NCBI's VecScreen tool, based on the known adaptor sequences used in library preparation. Known adaptor sequences were compared against the raw reads using standalone blast (blastn) with permissive settings, to allow detection of weak matches as well as perfect matches (-W 4, -F F, -e 6000). Short matches (≥ 8 bp) were recorded if they fell between two other adaptor matches, or within 10 bp of either end of the read. Longer (≥ 14 bp) matches were required for internal adaptor matches. The adaptor matches so identified, and all sequence distal to those matches, were trimmed from each read using custom scripts.
Most of the trimmed reads (92%) were between 100 and 300 bp in length, but a small proportion fell outside this range. On the assumption that those outlier reads might represent rare sequencing artifacts, we size-selected the cleaned reads to eliminate any reads smaller than 60 bp or larger than 340 bp. Those size thresholds were chosen based on analysis of a preliminary assembly in which all reads within those size ranges failed to assemble, supporting the interpretation that they might represent sequencing artifacts.
The trimmed and size-selected reads were assembled using the Newbler assembly program (Roche). To account for adaptor trimming and sequence quality in this assembly, raw sequence files (SFF format) were modified based on the adaptor trimming information described above, and these modified SFF files processed with Newbler. The assembly also included the publicly available EST A. millepora sequences from NCBI (n = 11,829), which had been cleaned based on comparisons with the UniVec database of vectors and adaptor sequences . The overlap settings used for this assembly were 30 bp and 90% identity, with all other parameters set in Newbler at the default values.
The inclusion of singleton sequences in our dataset was supported by bioinformatic analysis and experimental validation. To evaluate whether singletons contained unique sequences not found among contigs, singletons were compared with contigs (blastn) and both sequence sets were also compared with the Swiss-Prot database (blastx) to exclude those sequences lacking matches to known proteins. The results from these searches were compared to characterize the set of identifiable genes found among singletons but not contigs. Additional experimental support for retaining singleton sequences was obtained by designing primers specific to each of ten randomly-selected singletons using Primer3 . Using 1 ng of the original intact cDNA as template, each target was amplified separately (35–40 cycles of the profiles described above), and the PCR products evaluated by gel electrophoresis to confirm the presence of each sequence in the original cDNA. Finally, the presence of these singleton sequences in intact cDNA samples was confirmed by cloning and sequencing the resulting PCR products (DNA Core Facility at UT Austin).
After assembly of short cDNA sequences, gaps can remain due to uneven sampling across the transcript or the presence of regions that are difficult for the 454 sequencing technique to process (e.g., homopolymer stretches: ). To account for any gaps remaining between contigs corresponding to different parts of the same transcript, assembled sequences were joined according to shared similarity with known proteins. Using blastx, the assembled sequences were first compared with the well-annotated Swiss-Prot database, and those that lacked matches were subsequently compared to the larger nr protein database. The best match, with a significance threshold of e ≤ 10-4, was recorded for each query sequence. For each of these top matches (protein sequences), all queries for which that protein was the best match were considered. If all queries matched to different, non-overlapping regions of the same protein, they were considered to represent different portions of a single transcript. In these cases, the queries (assembled sequences) were concatenated together in the proper order and strand to recapitulate the original transcript. Gaps within these constructs, representing regions of that transcript not sampled in our libraries, were denoted by strings of 'X' characters to mark the junction between each pair of joined sequences. For those cases where two or more queries matched to overlapping regions of the same protein, all assembled sequences that matched that protein were considered to be possible paralogs and were retained as individual sequences.
The procedure used to join contigs into scaffolds was validated by PCR using primers corresponding to scaffold sequences. For each of 10 randomly-selected scaffolds, a forward primer was designed based on one of the constituent sequences, and a reverse primer designed based on the other. Amplification with these primer pairs allowed us to test whether, in the intact cDNA, the two primer binding sites were found on the same cDNA molecule, providing experimental validation of the contig joining procedure. PCR was carried out as described above (but for only 20 cycles), and the products were evaluated by gel electrophoresis then cloned and sequenced at the DNA Core Facility at UT Austin.
Assembly of transcriptome sequences can produce a number of contigs that is larger than the number of expressed genes, reflecting redundancy among the assembled sequences (i.e., more than one sequence per gene). Two complementary approaches were employed to characterize the redundancy remaining after assembly and contig joining. First, the sequences with significant protein sequence blast hits were clustered based on sequence similarity shared between those blast hits. Protein sequences were extracted from NCBI protein sequence databases (Swiss-Prot and nr), based on the top hit accession numbers from the blast report, and clustered using the Blastclust 2.2.15 program from NCBI . This approach assembles single-linkage clusters that include contigs corresponding to different domains of the same gene, as well as paralogs of that gene. A disadvantage of this strategy is that because it depends on protein sequence matches, the sequences lacking protein matches cannot be clustered. To reduce redundancy among those non-matching sequences, the complete set of assembled sequences was clustered based on nucleotide sequence similarity in parallel. The precluster and estcluster programs from the ESTate software package were used for this analysis . Finally, the union of these two sets of clusters was obtained by adding all nucleotide sequence relatives of each protein cluster member to that cluster. Individual sequences were annotated with cluster membership information and maintained as separate sequences for further annotation.
Putative gene names, protein domains, and Gene Ontology terms were assigned to all assembled sequences that shared sequence similarity with previously identified genes annotated with those details. Sequence comparisons and annotation were accomplished using standalone blast from NCBI , WU-Blast 2.0 (04-May-2006) , and custom Perl scripts. For gene name annotation, the assembled sequences were compared against Swiss-Prot and nr protein sequence databases using blastx with a significance threshold of e ≤ 10-4. Gene names were assigned to each assembled sequence based on the best blast hit annotated with a gene name, as determined by parsing the GenBank record features 'CDS' and 'Protein' for that blast hit. Blast hits annotated with uninformative terms were encountered (e.g., 'unknown', 'uncharacterized', or 'hypothetical') were skipped in this analysis. For computational speed, the search was limited to the first 10 significant hits for each query.
Different genes can share certain conserved domains, reflecting a level of sequence similarity not accounted for by simple gene name annotation. To identify conserved protein domains in our library, the assembled sequences were compared with the CDD database of conserved domains . These searches were based on reverse position specific blast (e-value threshold: 10-4), a blast-like algorithm for comparing newly-identified sequences to a set of known profiles . Each sequence that matched a conserved protein domain was annotated with that domain identifier and a description of the domain corresponding to the best rps-blast match for that query.
To annotate assembled sequences with Gene Ontology (GO) terms describing biological processes , molecular functions, and cellular components, sequences were compared against the UniProt-TrEMBL database of protein sequences  using WU-blast. This database was chosen because it has been extensively annotated with GO terms . Each query sequence was assigned GO terms based on those associated with the top blast match for that query, using custom Perl scripts to compare the blast results and the table of annotations. Because the Gene Ontology is hierarchical in structure, the GO terms annotated here can be traced back to one or more parent terms at a given level of detail. For example, inflammatory response (a level 5 term) is a 'child' of the more general level 4 term, response to wounding. Custom scripts were used to compare the annotated GO terms with the full set of parent-child relationships to map these annotations to various levels. To demonstrate the utility of these annotations, we used these GO mappings to search for genes associated with processes and molecular functions known to be involved with stress responses in corals.
Finally, the annotated sequences were compared with the RepBase database of transposable elements (TE) . Assembled sequences were queried against this database using tblastx, and those showing significant matches to known TE records were noted. This allows sequences showing similarity to TEs to be excluded from further analysis when required.
The cleaned sequencing reads produced in this study have been deposited in NCBI's SRA database, along with the assembly output (accession: SRA003728). The assembled and annotated sequences have been deposited in the TSA database (accessions: EZ000637–EZ048765). The complete set of sequences is available as a flat file from the authors' website , and can be searched online at SymBioSys, a sequence database that focuses on cnidarians and their algal symbionts .
Because no genome sequence is available for A. millepora or any other scleractinian coral, the sequences generated in the current study were compared with the predicted protein sequences from the most closely related organism with a fully sequence genome: the anemone Nematostella vectensis . The A. millepora sequences were compared with these protein sequences  using blastx with a significance threshold of e ≤ 10-4. A similar comparison between the N. vectensis genome assembly and the A. millepora sequences was conducted using tblastx, to detect any coral sequences with anemone orthologs that had been missed by gene prediction algorithms. Comparisons between the results from these searches and the annotation of the A. millepora sequences as described above were used to identify genes present in the coral but not the anemone. Reciprocal comparisons of anemone protein sequences against assembled coral sequences (tblastn) allowed identification of orthologous genes. To correct for different database sizes in these searches (~100,000 coral sequences, ~27,000 anemone sequences), e-values for reciprocal searches were adjusted to a standard database size (equal to the nr protein database). Pairs of orthologous genes in coral and anemone were identified based on reciprocal best matches with adjusted e-values ≤ 10-4 or better.
Potential SNPs were detected using the QualitySNP program , which employs a haplotype-based strategy to detect reliable SNPs without requiring sequence traces, quality scores, or genomic sequence data. SNP identification was accomplished through a separate procedure from the main annotation pipeline, consisting of the following steps. First, all 634,070 reads were assembled using the CAP3 program . Next, the contig alignments from this assembly were evaluated to select clusters containing at least 5 reads. Within these clusters, SNPs were identified based on the filters described in .
Twenty predicted SNPs corresponding to 18 genes were chosen for a validation procedure involving primer design, PCR amplification and Sanger sequencing. The 18 contigs containing these SNPs were selected based on sequence similarity with genes presumed to be involved in stress responses in corals. Primers were designed based on several principles as described in  so that all PCR amplifications could be accomplished at the same annealing temperature. PCR reactions were each set up in a 10 μl volume containing 5 ng of the amplified intact cDNA (i.e., prior to sonication), 0.1 μM each primer, 2 mM MgCl2, 0.3 mM dNTP, 1× PCR buffer and 0.5 U Taq DNA polymerase (New England BioLabs). The profile consisted of an initial denaturation step at 94°C for 3 min; this was followed by 30 cycles of 94°C for 30 s, 60°C for 30 s, and 72°C for 30 s; and a final extension step at 72°C for 10 min. After evaluating their molecular weight and specificity based on gel electrophoresis, PCR products were ligated to pGEM-T vector (Promega) and transformed into Top10 E. coli competent cells (Invitrogen). Based on single colonies selected from these transformations, cloned inserts were amplified by colony PCR using M13-27 and M13-41 primers. The resulting PCR products were cleaned with ExoSAP-IT (USB, OH, USA) and sequenced at the DNA Core Facility at UT Austin, using M13-27 as the sequencing primer. Two to twenty colonies were sequenced for each predicted SNP, allowing confirmation (if both alleles were observed) or rejection (if only a single allele was found) of those predictions.
We thank Jeong-Hyeon Choi for assistance in the design and implementation of preliminary sequence cleaning and assembly, and Bart Smith for assistance with PCR and Sanger validation of singletons and scaffolds. This research was supported in part by the Indiana METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc., by the University Information Technology Services (UITS) and by The Center for Genomics and Bioinformatics computing group.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.