De novo transcriptome assembly
The advent of NGS technologies has had an outstanding impact on many fields of biology, including genetics , functional and comparative genomics [46, 47] and molecular ecology . The remarkable potential range of application of these techniques will likely move the focus of high throughput sequencing in the near future from genome and transcriptome sequencing to the use in clinical medicine and diagnostics [49–51]. Due to its potential application to deep RNA-seq, NGS has been praised as a cost-effective and revolutionary tool for transcriptomics since the very early stages of its development . Although great technical advances have been made in a relatively short lapse of time in the improvement of both sequencing technologies and sequencing data management, significant challenges linked with RNA-seq still remain unsolved. The major computational issues in the management of NGS data is represented by the reliable de novo assembly of transcriptomes [53, 54]. This is a complex task, due to presence of alternatively spliced transcript variants, gene duplications, allelic polymorphisms and noise due to suboptimal sequence quality, which often leads to the generation of a high number of short and poorly assembled contigs .
The massive amount of sequencing reads obtained from L. menadoensis liver and testis allowed us to apply stringent filtering criteria, both in the processing of raw sequencing reads and in the filtering of assembled contigs, in order to achieve a final set of high quality transcripts and to overcome the most common pitfalls of NGS assemblies. We chose to use the Trinity assembler, able to efficiently recover full length transcripts across a broad range of expression levels but somewhat redundant because of the inclusion of alternatively spliced variants . The Trinity assembly was used as a reference sequence set to be appropriately refined and enriched, whenever possible, by a second de novo assembly performed with the assembler included in the CLC Genomic Workbench. The choice of integrating the Trinity output with the CLC assembly was made because of the empirical observation of a more effective reconstruction of full length transcripts and because of the operational speed of its assembly algorithm, based on de Bruijn graph. As this method, although extremely fast, is known to produce assemblies which are quite fragmented in comparison with other assemblers , only a selected set of assembled contigs was used to improve the Trinity assembly, with a particular emphasis on protein-coding transcripts.
De novo assembly quality assessment
One of the problems most commonly arising from the de novo assembly of RNA-seq data is represented by sequence fragmentation . In order to minimize this problem, as described in the methods section, all the contigs with an average coverage lower than 5 were removed prior to further analysis, reducing the number of contigs from 105,653 to a final set of 66,308 high quality sequences, reducing the fraction of short sequences with a proportional enrichment in longer transcripts (Figure 3). Furthermore, the contig processing strategy we used, graphically summarized in Figure 1, contributed to significantly reduce the sequence redundancy of the assembly (which was calculated to be 17.39%), in respect with the Trinity output (Figure 2). Although several factors can negatively influence the outcome of a de novo transcriptome assembly, affecting the reconstruction of full length sequences, the ortholog hit ratio analysis highlighted good mean and median ratio values and a high proportion of transcripts assembled to their full length (Additional file 1: Figure S1e). Therefore, despite the inevitable presence of broken transcripts, the results of the de novo assembly were extremely satisfying, highlighting that about half of the sequences, contained in the final set of transcripts, was assembled to the full length or very close to it and that just about a quarter of the contigs were resulting from highly fragmented transcripts.
The analysis of the top hit species distribution resulting from BLAST (Figure 5) reveals Gallus gallus as the first species, followed by Xenopus tropicalis. The first teleost fish of the list, Danio rerio, ranked at the sixth place of the list, after the mammal Monodelphis domestica. These results are clearly biased towards organisms whose genome has been largely and deeply studied and annotated, mainly because of the higher quality of genome assemblies, of the more accurate gene predictions and of the higher number of protein sequences deposited in public sequence databases. Nevertheless, the absence of a prominent species with extended sequence homologies to L. menadoensis, neither in fishes nor in tetrapods, is consistent with the phylogenetic placement of lobe-finned fishes. However, for an in-depth analysis of the phylogenetic relationship between coelacanth and these two major vertebrate groups, and for an extended discussion on the implications on tetrapod evolution we refer to the whole-genome scale analysis reported by Amemiya and colleagues .
Compared to those having a positive BLAST result, a higher number of contigs (42,667) were annotated by InterProScan. Since the presence of InterPro domains is a strong indication of coding sequences, these data point out that 64.35% of the coelacanth de novo assembled contigs are coding for proteins characterized by known InterPro domains.
Divergence between the two coelacanth species
The evolutionary divergence between the two species of coelacanth has been a subject of debate for a long time. Although the complete sequencing of mitochondrial DNA highlighted a sequence identity of 96%, variable divergence times have been proposed, ranging from 6 to 40 Mya [7, 10]. The sequencing of the genome of L. chalumnae permitted to extend the comparison to large genomic regions with the available BACs of L. menadoensis, evidencing an identity of 98.7%. Our transcriptomic data offered the opportunity to assess the sequence identity within the coding regions, which resulted to be surprisingly high, standing at 99.73% (see Additional file 1 for methodological details). Nevertheless, while the massive amount of information gathered permits a rather easy calculation of divergence rates, the estimate of divergence time is not such a trivial task, given the uncertainties related to the calibration of a molecular clock : the slow generation time, the absence of other closely related living species, and the allegedly low rate of molecular evolution of coelacanths.
Although both the genic and genomic divergence between the two species are similar to those observed between human and chimp , which diverged 6–8 Mya, we also performed a phylogenomic comparison between coelacanths and T. rubripes/T. nigroviridis, two organisms with a completely sequenced genome, which evolved in an aquatic environment and were subject to somewhat similar selective pressure and whose divergence, based on paleontological evidence, is estimated between 32.25 and 56 Mya . Based on the alignment of approximately 40 Kb of ortholog transcribed sequences in Latimeria, we estimated the substitution rate to be 0,49/100 bp, whereas the substitution rate in the same set of selected genes in the Takifugu/Tetraodon pair was approximately 16 times higher (8,25/100 bp). A simple molecular clock correlation would indicate that the dating of divergence between African and Indonesian coelacanth should be placed between 1.9 and 3.3 Mya. Nevertheless the slower rate of molecular evolution, as well as the considerably longer generation time have to be taken into account, likely moving the divergence time much back in time to a date close to the lower end of the estimates based on mitochondrial DNA.
In metazoans repeat elements cover a considerable part of genomes. Moreover, the transcriptome analysis allowed the evaluation of the transcriptional activity of transposable elements (TEs) which play a key role in gene evolution and genome plasticity. TEs are divided in two classes: Class I is composed of Long Terminal Repeat retrotransposons (LTRs) and Non-LTRs (subdivided in LINEs and SINEs); Class II is composed of DNA transposons.
The RepeatMasker analysis revealed that 11.17% of contigs harbored a repeat and the most represented elements belong to SINE families. The latter result is in line with the studies performed in the Indonesian coelacanth genome [31–33], in which the activity of SINE elements in Latimeria was inferred. The identification of LF-SINEs and DeuSINEs in L. menadoensis transcriptome might confirm that these elements are actually active. Moreover, since their conservation in higher vertebrates, this movement might predate the common ancestor of Crossopterygians, for more than 400 Myr. On the other hand the occurrence of complete SINEs in contigs bearing protein-coding sequence might reveal the gain of new functional roles (exaptation) , as previously described in tetrapod genomes.
Concerning the activity of LINEs, the second most represented interspersed elements, the InterProScan analysis identified amino acidic domains linked to these autonomous retrotransposons. Chicken Repeat 1 (CR1) elements are the most abundant among LINEs. In contrast to the G. gallus genome where these elements are predominant but, with very few exceptions, nonfunctional , in Latimeria they seem to be active. Fragmented LTRs and ERVs (Endogenous RetroViruses) were identified in the transcriptome. This result is in agreement with the analyses on Foamy-like retroviral elements recently discovered in L. chalumnae genome by Han and Worobey  showing many frame-shifts and stop codons. The abundance of the Harbinger DNA transposons in L. menadoensis genome  suggests that Class II elements represent a remarkable fraction of the coelacanth TEs, however our analysis indicates that few DNA elements are expressed. This discordance may be related to the lack of coelacanth specific sequences belonging to this class in the RM database or to their propagation mode. The identification of mobile elements in transcriptomes sheds light on an unexpected genome dynamicity in an organism considered to be a living fossil even from a molecular point of view [21, 25].
RNA-seq mapping on the African coelacanth genome
More than half of the sequence data generated by the RNA-seq of L. menadoensis liver and testis mapped on the genes annotated by Ensembl on the L. chalumnae genome (Table 4), revealing an overall good annotation of the African coelacanth transcripts, even though in some cases the RNA-seq data produced in this study could provide some evidence of additional exons and alternative splicing, given that the 6.97% of the reads corresponded to regions annotated as introns.
Nevertheless, a rather high proportion of reads, close to 40%, could not be mapped on the genes annotated by Ensembl, consistently with the strategy adopted by Ensembl for the annotation pipeline, which is automated and mainly focused on protein-coding gene models. In fact, almost the 35% of the sequencing reads could map on the assembled genomic scaffolds outside from the annotated gene boundaries, revealing that a relevant portion of the transcripts expressed in the Indonesian coelacanth liver and testis might correspond to genes which were not annotated by the Ensembl RNA-seq annotation pipeline (Table 4). Therefore, the deep RNA-seq of liver and testis can be considered as a fundamental tool for the discovery of novel genes, and in particular, of many not yet annotated non-coding transcripts. As a matter of fact, the NGS sequence data will certainly provide a fundamental source of information for the study of atypical transcripts originated by trans- and circular splicing events, a topic which is currently under investigation (Stadler, personal communication).
Slightly more than 3 million reads did not map on the genomic scaffolds. These sequence data could either correspond to mitochondrial RNA (which was esteemed to account for 3.03% and 2.08% of the reads in liver and testis, respectively) or to coding genes harbored in L. chalumnae genomic regions which were not successfully assembled.
Liver and testis transcriptomes comparison
The expression profile of the two organs analyzed was expected to be quite different, considering the largely different tasks they perform and the highly specialized cellular types involved. This difference was immediately evident by the graphical representation of the expression scatter plot (Figure 6). Among the 20 most expressed transcripts in liver, a large fraction is constituted by plasma proteins, whose synthesis is carried out by this organ (such as the three chains constituting fibrinogen, α-2 macroglobulins, apolipoproteins, hemopexin, vitronectin, lipocalin, serum amyloid P and serum albumin) and which constitute the core of the highly expressed genes in this tissue (Table 6). On the other hand testis invests a significant portion of transcription in genes involved in chromatin and cytoskeletal rearrangements. In particular, a testis-specific histone results to be expressed almost 25 times more than the second most expressed gene, prostaglandin H2D-isomerase, and accounts for about 18% of the global testis transcription. A significant amount of the total gene expression is derived from the synthesis of messengers of protamines, used for the replacement of histones and the effective packaging of DNA in the sperm acrosome . The expression of genes involved in chromatin rearrangement is strictly regulated, as testis-specific histones are transiently and selectively expressed only during specific phases of spermiogenesis . In fact, also sperm nuclear basic protein PL-I and histone H1x-like figure among the most representative testis genes. Furthermore a relevant number of other testis-specific genes can be linked to the meiotic process carried out in the testicular germinal cells and to the cytoskeletal rearrangements consequently required (tubulin α chain testis-specific, tubulin β 2-C and centrin-1). Moreover, specific types of microtubules are required for the correct assembly of mitotic and meiotic spindles and of the flagellum axoneme of spermatozoa [64, 65]. The tubulin genes highly expressed in testis are likely linked to these functions.
Although the expression of a large fraction of genes was clearly strictly tissue-specific, thanks to the sequencing depth applied, a relevant overlap between the two transcriptomes (77.37%) was observed. The issue of transcriptome richness was addressed by analyzing the relative contributions of the expression of each contig to the total observed transcription in the two tissues, and in RNA-seq data of L. chalumnae muscle (Figure 7). Highly specialized tissues are expected to invest the most gene expression in a selected set of genes, thus being transcriptionally poor, whereas tissues involved in many different biological functions, characterized by the coexistence of many different cell types are expected to be transcriptionally rich, as they express a broader range of transcripts. Within this picture, muscle is a classic example of a highly specialized tissue, expressing at particularly high levels a limited set of genes involved in the structural organization of muscle fibers and responsible of contraction. Testis expresses a broader range of transcripts, which is in agreement with the assumption that cells in this organ are characterized by drastic morphological and functional changes linked to gamete generation: in this scenario chromatin structure is constantly rearranged and gene expression may therefore be substantially variable during the different stages of spermatogenesis . Despite being transcriptionally poorer than testis, the RNA-seq of liver likely brought a remarkable amount of additional data as pointed out by the extent of the overlap between the two transcriptomes (Figure 8). Therefore, although the RNA-seq of two different organs like testis and liver was particularly effective to approach the coverage of a complete transcriptome, the incomplete overlap observed between the two L. menadoensis transcriptomes and the L. chalumnae muscle suggests that the sequencing of RNAs obtained from additional samples would be useful in order to effectively describe the complete transcriptome of this organism.