Freshwater planarians are an attractive model for regeneration and stem cell research and have become a promising tool in the field of regenerative medicine. With the availability of a sequenced planarian genome, the recent application of modern genetic and high-throughput tools has resulted in revitalized interest in these animals, long known for their amazing regenerative capabilities, which enable them to regrow even a new head after decapitation. However, a detailed description of the planarian transcriptome is essential for future investigation into regenerative processes using planarians as a model system.
In order to complement and improve existing gene annotations, we used a 454 pyrosequencing approach to analyze the transcriptome of the planarian species Schmidtea mediterranea Altogether, 598,435 454-sequencing reads, with an average length of 327 bp, were assembled together with the ~10,000 sequences of the S. mediterranea UniGene set using different similarity cutoffs. The assembly was then mapped onto the current genome data. Remarkably, our Smed454 dataset contains more than 3 million novel transcribed nucleotides sequenced for the first time. A descriptive analysis of planarian splice sites was conducted on those Smed454 contigs that mapped univocally to the current genome assembly. Sequence analysis allowed us to identify genes encoding putative proteins with defined structural properties, such as transmembrane domains. Moreover, we annotated the Smed454 dataset using Gene Ontology, and identified putative homologues of several gene families that may play a key role during regeneration, such as neurotransmitter and hormone receptors, homeobox-containing genes, and genes related to eye function.
We report the first planarian transcript dataset, Smed454, as an open resource tool that can be accessed via a web interface. Smed454 contains significant novel sequence information about most expressed genes of S. mediterranea. Analysis of the annotated data promises to contribute to identification of gene families poorly characterized at a functional level. The Smed454 transcriptome data will assist in the molecular characterization of S. mediterranea as a model organism, which will be useful to a broad scientific community.
One of the challenges that medical research must address in the near future is to understand why some animals are able to regenerate complex structures, including eyes and even whole bodies, from small body fragments, while others are not. With the recent emergence of the field of regenerative medicine, the future biomedical ramifications of the study of animal regeneration are obvious.
Freshwater planarians are a classic model for studying the fascinating process of regeneration [1–4] because they are capable of re-building a complete organism from almost any small body fragment. This is made possible by a unique population of adult somatic stem cells called neoblasts. During regeneration and constant homeostatic cell turnover, neoblasts differentiate into all cell types, including germ cells in sexual species [5, 6]. In recent years, several studies have begun to unravel the mechanisms by which regeneration is regulated at the molecular level. For example, different genes have been shown to play pivotal roles in axon guidance and neurogenesis , the regulation of neoblast proliferation and differentiation [8, 9], and the re-establishment and maintenance of the anteroposterior (AP) and dorsoventral (DV) body axes . Schmidtea mediterranea and Dugesia japonica are the two planarian species most often used in regeneration studies. There are about 78,000 ESTs (Expressed Sequence Tags) for S. mediterranea in NCBI generated in different projects [11, 12]. Those sequences were clustered to produce a set of 10,000 putative mRNAs which are available from the NCBI Unigene database . The S. mediterranea genome has also been sequenced and assembled  at the Genome Sequencing Center at Washington University in St. Louis (WUSL, USA) after approval of a white paper . However, because of this genome's internal complexity (67% A+T, ) and the lack of a BAC library, its completeness and assembly still needs improvement. A step towards this end was taken when the S. mediterranea genome and EST information were integrated and approximately 30,000 genes were predicted using an annotation pipeline called MAKER. Those gene models, together with ~9,000 mRNAs generated using next-generation sequencing technology, were mapped on the planarian genome and used to improve the assembly . The current assembly contains 43,673 contigs. These are accessible, together with the MAKER annotation data, in the S. mediterranea genome database (SmedGD; ).
In order to expand our knowledge of the planarian transcriptome and to provide a new tool that can be used to improve the S. mediterranea genome annotation, we generated a new transcriptome dataset using 454 pyrosequencing technology . The Smed454 dataset can be freely accessed via a website, and the complete sequence data can be downloaded by anyone from there. Mapping of the Smed454 ESTs onto the genome scaffolds shows that the Smed454 dataset contains more than 3 million nucleotides sequenced de novo. In addition, this mapping extends and connects currently fragmented genomic contigs. Finally, GO annotation of the Smed454 dataset assigns candidate functions to those sequences and facilitates their grouping into distinct gene families. In this way, whole gene families can be analyzed for putative roles in planarian regeneration. Thus we are confident that the Smed454 dataset will improve our understanding of how planarian regeneration works at the molecular level.
Results and Discussion
Construction and sequencing of the Smed454 dataset
In order to obtain the most representative set of planarian genes expressed under different physiological conditions, total RNA was isolated from a mixture of non-irradiated and irradiated intact and regenerating planarians (see Methods). We used planarians regenerating both head and tail to identify the genes specifically expressed in a tissue-specific manner. Similarly, planarians at different stages of regeneration were used in order to isolate genes with different temporal expression profiles. Irradiation destroys planarian neoblasts within 1-2 days, and the animals die within a few weeks because they cannot sustain normal cell turnover. By including irradiated animals, potential transcripts specifically expressed under those conditions will be contained in the 454 dataset.
Using 454 pyrosequencing, 601,439 sequencing reads with an average length of 327 bp were obtained. After sequence cleaning to remove vector contamination, the remaining 598,435 sequences were assembled using different cut-off values for sequence similarity (90%, 95% and 98%). In addition, our 454 sequence reads were assembled together with the ~10,000 S. mediterranea UniGene set available at NCBI, using the 90% similarity criteria. This last set, which was used in most of the analyses reported, is referred to as the 90e set. Table 1 summarizes the number of contigs and singletons obtained in each of those assemblies. The similarities between the three assemblies (90, 98 and 90e) are illustrated in Figure 1 a Venn diagram which shows that 72.68% of the raw sequencing reads were integrated into contigs common to all three assemblies, and 20.51% of the sequencing reads make up a shared pool of single sequencing reads (singletons). Therefore, differences between the assemblies can be explained by differential inclusion corresponding to 6.81% of the sequencing reads.
Average GC content and sequence length and their respective distributions were similar for all three assemblies (Table 1 and Figure 2). GC content is distributed around 35%, the expected value for coding sequences in this species. The 90e length distribution shape was slightly shifted towards larger sequences. This shift was mainly due to a set of long sequences (> 800 bp) from the NCBI Unigene ESTs included in this assembly. This causal relationship was evident in the comparison of the following four subsets of sequences from the 90e set (lightblue violin plots on Figure 2 right panel): singletons (136,271), contigs that do not contain UniGene ESTs (46,958), contigs including Unigene ESTs (6,909), and finally, Unigene ESTs not assembled into a contig (2,495).
Mapping the 90e assembly onto the genome
The 90e assembly (192,633 sequences, 70,274,612 bases in total, average length of 365 bp per sequence) was aligned to scaffolds from the S. mediterranea WUSL genome assembly, version 3.1  (43,294 sequences, 901,626,601 bases in total, average length of 20,8 kilobases per scaffold). Figure 3 shows all possible high-scoring segment pair (HSP) relationships between those two sequence sets. From almost 30 million initial HSPs, around 7 million were selected using a combination of thresholds, as described in the Methods section. Discarding singleton sequences in a second round of filtering further reduced the number of HSPs to 5 million, and HSP coverage dropped from 25.36% and 77.24%, for scaffolds and 90e respectively, to 10.57% and 37.93%. However, when the total nucleotide length was considered only for the contigs (56,363 sequences, 32,518,399 bases in total, with an average of 577 bp per sequence), HSP coverage for 90e rose to 81.97%. This means that most of the significant HSP hits are retained after the second round of filtering. In total, 8,831 contigs from 90e did not map to the genomic contigs (3,242,054 bp that are completely novel and also transcribed, see column A in Figure 3). Conversely, 5,138 genomic contigs did not match a sequence from 90e (column B). Of the 90e contigs, 322 extended a genomic sequence from the left (column C) and 3,051 from the right (column J). The largest intergenic distance was 42,209 bp, with an average value of 1,102 bp (column H). The largest intron was estimated to be about 9,300 bp, the average length being 238 bp (column E). Finally, there were 20,504 HSPs connecting different genomic sequences via 8,604 different 90e contigs (column I). Of the 8,831 90e contigs not found on the genome, 3,480 had a BLAST hit to the NCBI NR protein database (39.41%), and, of those, 2,401 had a hit to a protein with GO annotation (27.19%). After discarding abundant actin-like sequences (1,503), ATP/ADP transporter proteins (722) and sequences matching bacterial, protozoan or fungal genes (1,234), 71 90e contigs remained as new sequences not mapping on the genome (see Additional File 1).
In order to validate exonic structures, 6,226 90e contigs mapping 1-to-1 over genome sequences were selected. After re-aligning the 90e/genomic sequence pairs, 4,739 contained at least one putative intron (see the corresponding splice sites boundaries in Additional File 2). In total 8,609 introns were retrieved from the genomic contigs. Figure 4 shows the number of introns per 90e contig, as well as the length distribution for those introns. Pictograms summarize the nucleotide frequencies for the donor and acceptor splice sites, both for the U2 (canonical) and U12 (non-cannonical) introns. The splice sites patterns resemble those from other metazoan , taking into account that the genome of S.mediterranea is A/T-rich .
Also, 50 randomly picked 90e contigs that either mapped or did not map to the genome were validated by RT-PCR (see Additional File 3 containing a list of the selected 90e contigs, as well as information on the primers used to amplify them). Additionally, 20 out of those 50 genes were further validated by sequencing. Finally, to further confirm the quality and coverage of the sequences from the 90e dataset, the S. mediterranea genes already annotated in NCBI GenBank  were compared with those sequences. After discarding 18 S and 28 S ribosomal RNA genes and alpha-tubulins, 124 known genes were aligned to the 90e sequences. In total, 108 of these genes had at least one significant similarity hit with one 90e sequence, and two matched 5 sequences from 90e. On average, the known genes had co-linear similarity hits against 1.32 different Smed454 sequences. Minimum and average similarities were 8.35% and 85.34% respectively, and 71 sequences had more than 95% similarity. Mean coverage dropped to 77.63% when each hit was considered separately. A summary of these similarity analyses is shown in Additional File 4.
Browsing the Smed454 dataset
In order to make the Smed454 dataset useful and accessible to the planarian and non-planarian communities, a public database is available via web . The web site allows users to view contig assemblies along with their read alignments, and to perform BLAST searches against assembled sequences. The BLAST option in the home page menu (1 in Figure 5) allows the user to BLAST sequences of interest against the 90, 98, and 90e databases (1.2 in Figure 5). Both nucleotide (BLASTN) and protein (BLASTP) searches can be performed (1.1 in Figure 5). Clicking on the Search button (1.3 in Figure 5) brings up a new window displaying a list of hits. When a score value is selected (1.4 in Figure 5), the alignment between the query sequence and the Smed454 hit is shown. The site also offers the option of downloading Smed454 sequences of interest (1.5 in Figure 5). The contig or singleton accession number can be browsed directly from the main home page (2 in Figure 5). When the user searches for a specific contig, a new window appears showing the alignment of all the sequencing reads assembled in that contig. At the bottom of that window, the result of a pre-computed BLAST on the contig consensus sequence is displayed. When a contig, singleton or read name is selected (2.1 in Figure 5), a new window will display the requested sequence. All raw and assembled sequence data are available from that web site too.
Functional annotation of 90e sequences
In order to characterize the gene families that can be found on Smed454, we annotated the three datasets; we will focus on 90e dataset here. In total, 42.42% of the sequences had a similarity hit with at least one protein sequence in the NCBI NR protein database . Of these, almost two-thirds had 250 or more hits (see Figure 6), but the BLASTX output was limited to a maximum of 250 hits per 90e sequence owing to the large number of HSPs reported by BLAST for some of them. The Gene Ontology (GO)  database was used to computationally annotate all the sequences (see Additional File 5 for 90, 98, and 90e datasets) by mapping onto them the functional codes already assigned to known proteins from NCBI NR. Many of these sequence hits matched to a short ATP-binding domain, in most cases corresponding to proteins of the actins family. Consequently, that functional class, which was also anomalously over-represented, was discarded from the total number of annotations for the 90e set, as shown in Table 2.
Among the most abundant GO annotations at the biological process level, leaving aside metabolism-related features, 'response to stress' was found for 1,070 sequences (6.58%). This finding was expected because the original biological sample was a mixture of intact and regenerating planarians, both normal and irradiated. 'Regulation of biological process' was in the same range, with 1,012 sequences (6.22%). At the GO molecular function level, 'binding' was the most common annotation, although where possible a more specific annotion was provided by drilling down to the 2nd level child annotations on the GO graph. It is interesting to find, among others, 3 'selenium binding' activities, since it has been reported that selenium may play an important role in cancer prevention, immune system function, male fertility, cardiovascular and muscle disorders, and prevention and control of the ageing process . Finding selenium-binding proteins would be evidence of the presence of selenoproteins, which are thought to be responsible for most of the biomedical effects of selenium across eukaryota . When looking at the cellular component level and discarding many of the 'intracellular organelles' due to their co-occurrence with 'nucleotide binding', there are a notably large number of 'protein complexes', 2,918 sequences (40.79%). With 819 sequences (11.45%), another important term on this level is 'vesicle', which correlates with secretory functions, apoptosis, and autophagy.
To prove the usefulness of the Smed454 dataset, we performed several searches on specific groups and gene families for which only scant data has been reported to date in planarians. Planarians are mainly known for their remarkable regenerative capabilities, which depend upon the presence of stem cells named neoblasts. Because of the unique properties of these cells, some studies have used a microarray-based strategy to detect neoblast-specific genes [25, 26]. In our Smed454 dataset we were able to identify, in addition to known neoblast markers such as piwis, histones, bruli, vasa or tudor, several other genes annotated as involved in cell cycle or DNA damage and repair (Additional File 6). Within these gene set we find many cyclins and cell cycle division-related genes but also genes related to replication and chromosome maintenance. Finally, genes related to stress response and DNA damage were also identified, probably owing to the use of irradiated animals in the generation of the Smed454 dataset. In addition to these neoblast-related genes we were able to identify large collections of much less well-characterized families in planarians, such as neurotransmitter, peptide and hormone receptors, homeobox domain-containing genes, and genes related to eye function in other animals.
Prediction of planarian transmembrane proteins
Transmembrane (TM) proteins regulate a number of biological processes ranging from catalytic processes in intracellular and extracellular transport to cell-to-cell communication. TM proteins have become particularly interesting as many of them are key initiators of signal transduction pathways, and they can be easily manipulated by small molecule- or antibody-based drugs. To identify putative TM proteins from the planarian transcriptome, we mined the 454 dataset for putative TM protein-encoding messages (see Methods). Considering only the proteins that at least two application predicted would contain one or more transmembrane domains, resulted in a list of 8,597 predicted transmembrane proteins (see Figure 7a), which represents 15,3% of the complete protein database. Protein-BLAST searches were then used to align sequences to each other, and redundant sequences were removed from the predicted transmembrane set. The resulting database contained 4,663 sequences. Functional categorization using the UFO web-server  allowed us to assign PFAM protein families to 1,474 of the sequences and gene ontology classifications to 2,464. The top ten PFAM domains (~33% of all assignments) included, for example, the classifications for 'major facilitator superfamily' (a ubiquitous transporter family), '7 transmembrane receptor (rhodopsin family)' and 'ion transport protein' (see Figure 7b). The top ten gene ontology profiles (~49% of all assignments) included 'membrane' (cellular component), 'transport', and 'G-protein coupled receptor protein signalling pathway' (both biological processes, see Figure 7c). The enrichment of our database with proteins that have a predicted function in transport and receptor signalling supports the reliability of our approach. A complete list of the 4,663 predicted transmembrane proteins, the number of predicted transmembrane domains, predicted topology, and functional categorizations (PFAM and GO) are shown in Additional File 7.
Neurotransmitter and hormone receptors in Schmidtea mediterranea
Despite our growing knowledge about how planarian neoblasts are regulated at the molecular level [9, 25, 26, 28–31], we are still far from characterizing the complete repertoire of factors that control neoblast biology. Receptors for neurotransmitters, peptides and hormones are among the candidates for a role in the regulation of neoblast proliferation, differentiation and migration. In planarians, some of the data suggest that molecules such as dopamine [32, 33], serotonin , substance P , somatostatin  and FMRFamide  can accelerate or delay the regeneration rate, probably by regulating neoblast proliferation and/or differentiation. A model has been proposed in which neoblasts express receptors for some of these factors, which in turn regulate the fate of these cells . We found 288 contigs and singletons in the annotated Smed454 dataset with significant homology to neurotransmitter and hormone receptors (Table 3 and Additional File 8), providing a list of potentially interesting candidates.
Homeobox-containing sequences in Schmidtea mediterranea
Since the first homeobox-containing genes were characterized in planarians , a large number of Hox and ParaHox genes that could be accommodated into the classical series of paralogous groups from Plhox1 to Plohox-9 and Xlox to cad/Cdx [39, 40] have been described. Some of them show a differentially axial nested expression; while others are ubiquitously expressed [41–43]. Most of this work has been done in the planarians Girardia tigrina and Dugesia japonica. Recently, the first expression of an S. mediterranea Hox gene has been reported . We identified 50 contigs and singletons with significant sequence similarity to homeobox gene sequences in the annotated Smed545 dataset (Table 4), including Hox genes and homeobox-containing genes, some already characterized in other planarian species.
Eye genes in Schmidtea mediterranea
The structural simplicity of the planarian eye in conjunction with the regenerative abilities of these organisms provides a unique system for dissecting the genetic mechanisms that allow a simple visual structure to be built [45, 46]. Despite great morphological differences, there is evidence that the early morphogenesis of animal eyes requires the regulatory activity of Pax6, Sine oculis (Six), Eyes absent (Eya) and Dachshund (Dach), a gene network known as the retinal determination gene network (RDGN) [47–50]. Most of the genetic elements of the RDGN have been characterized in planarians [51–54]. In addition, the following planarian genes have been identified as being involved in eye regeneration: Djeye53, Dj1020HH; Smed-netR, Smed-netrin2; Gt/Smed/Dj ops[46, 57]; Djsnap-25; and Smednos. In order to characterize new S. mediterranea eye network genes, we analyzed the Smed454 annotated dataset and found a collection of genes, ranging from transcription factors to eye-realizator genes, which have been implicated in eye development in other systems. These are good candidates for expanding our knowledge about the genetic network responsible for planarian eye regeneration (Table 5 and Additional File 9).
The inherent complexity of the planarian genome and methodological difficulties initially prevented the complete genome assembly of S. mediterranea. High-throughput sequencing technologies are now well established and help molecular biologists to unravel the molecular components of organisms. We present a 454 sequencing dataset that can be used to decipher the transcriptome of the planarian S. mediterranea, an organism that has great potential for the study of regeneration processes.
We obtained more than half a million sequencing reads and assembled them into different datasets using a number of different similarity thresholds. The complete dataset has been made publicly available via web . About 50,000 contigs in one of those sets (90e) were mapped against the most up-to-date genome scaffolds and to the set of known proteins from NCBI NR. Interestingly, we found a large number of transcribed sequences not covered by the genome sequence (more than 3 Mbp). The novel 454 contigs will allow us to extend current genomic sequences and connect up to 8,000 pairs of genome scaffolds. Furthermore, a preliminary analysis of the planarian splice sites was made on a collection of 454 contigs mapped univocally to the genome. Annotation of the sequences yielded a number of gene candidates in different functional categories that will be useful for further experimental studies. However, many of the novel contigs have no similarity to known proteins and will require further validation if we want to understand the transcriptional inventory of the planarian at a functional level. We also provided a preliminary gene annotation for S. mediterranea, focusing our rankings on four different gene families; these serve as applied examples of the usefulness of this new sequence resource.
Animals and RNA isolation
Schmidtea mediterranea from the BCN-10 clonal line were used. Animals were starved one week prior to experiments and irradiated at a lethal dose of 100Gy. Total RNA was isolated from a mixed sample of planarians that contained non-irradiated intact and regenerating planarians (1, 3, 5 and 7 days of regeneration) as well as irradiated intact and regenerating animals (1, 3, 5 and 7 days of regeneration). RNA was extracted with TRIzol® (Invitrogen) following the manufacturer's instructions.
cDNA library construction and 454 sequencing
First, 5 μg of total RNA was used to construct a cDNA library. RNA quality was assessed in a Bioanalyzer 2100 (Agilent-Bonsai Technologies). 5 μg of full-length double-stranded cDNA was then processed by the standard Genome Sequencer library-preparation method using the 454 DNA Library Preparation Kit (Titanium chemistry) to generate single-stranded DNA ready for emulsion PCR (emPCR™). The cDNA library was then nebulized according to the fragmentation process used in the standard Genome Sequencer shotgun library preparation procedure. The cDNA library was sequenced according to GS FLX technology (454/Roche). Reads were assembled by MIRA version 3 using enhanced 454 parameters.
Mapping to genomic and functional annotation
BLAT was used with default parameters to map the Smed454 90e dataset on the S. mediterranea draft genome assembly v3.1  since the 454 sequences should be very similar to the corresponding genomic sequences, except for the lack of introns. Perl scripts were developed to classify all HSPs into the categories shown in Figure 3. 90e contigs having two or more collinear HSPs covering more than 100bp of the contig, and for which HSPs had more than 90% identity to the genomic contigs and length of the HSP larger than 50 bp, were chosen as 1-to-1 matches to genome. Once the sequences of the 90e/genomic contig pairs were retrieved, exonerate was used to refine the alignments over the splice sites (using as parameters model = est2genome and bestn = 10). Perl scripts were used to retrieve the splice sites coordinates from exonerate output, as well as the sequences from genomic contigs. After clipping the donor and acceptor splice sites for each intron, nucleotide frequencies were computed and the corresponding position weight matrices for U2/U12 sites were drawn as pictograms using compi. Known S. mediterranea genes were compared with contigs from 90e using BLASTN with the following cut-offs: e-value = 0.001, identity score > 80%, HSP length > 50 bp.
GO functional annotation was computed on the BLASTX results of the three assembly datasets (90, 98, and 90e) against all proteins from NCBI NR. BLASTX parameters were set to e-value = 10e-25 and maximum number of descriptions and alignments to report = 250, which produced around 26 million HSPs for each set. After that, only HSPs with a minimum length of 80 bp and a similarity score of at least 80% were considered. GO annotation was performed on those HSPs using the e-value selection criteria and supporting sequences described for Blast2GO. Further Perl scripts were used to summarize the data shown in Table 2 and Additional File 3.
In order to validate the expression of a random subset of novel 454 transcripts, RT-PCRs were performed on planarian cDNA generated with Superscript III (Invitrogen) following the manufacturer's instructions. Additional File 3 includes a list of the contigs validated and the primers used for each of them.
Prediction of transmembrane proteins from ESTs
A total of 53,867 assembled ESTs (90e database) and 2,495 additional mRNAs were translated into all six reading frames using the 'transeq' program from the EMBOSS package . The longest open reading frame for each EST/mRNA was then extracted and used as a protein database (containing 56,362 protein sequences overall) for the prediction of membrane-spanning proteins. We followed an approach described by Almen et al.  basing our analysis on consensus predictions of alpha-helices and using three applications: Phobius, TMHMM2.0, and SOSUI. Phobius and TMHMM2.0 both use hidden Markov models based on different training sets to predict membrane topology. SOSUI evaluates proteins for their hydrophobic and amphiphilic properties to make its predictions. The use of all three programs should improve prediction accuracy. We first ran Phobius, which can predict both transmembrane helices and signal peptides. Signal peptide sequences are similar to transmembrane segments owing to their hydrophobic nature . To avoid false positive predictions, we excluded signal peptides before running TMHMM2.0 and SOSUI.
Fernández-Taboada E, Moritz S, Zeuschner D, Stehling M, Scholer HR, Saló E, Gentile L: Smed-SmB, a member of the LSm protein superfamily, is essential for chromatoid body organization and planarian stem cell proliferation. Development. 2010, 137 (7): 1055-1065.
Sánchez-Alvarado A, Newmark PA, Robb SM, Juste R: The Schmidtea mediterranea database as a molecular resource for studying platyhelminthes, stem cells and regeneration. Development. 2002, 129 (24): 5659-5665.
Zayas RM, Hernandez A, Habermann B, Wang Y, Stary JM, Newmark PA: The planarian Schmidtea mediterranea as a model for epigenetic germ cell specification: analysis of ESTs from the hermaphroditic strain. Proc Natl Acad Sci USA. 2005, 102 (51): 18491-18496. 10.1073/pnas.0509507102.
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez-Alvarado A, Yandell M: MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008, 18 (1): 188-196. 10.1101/gr.6743907.
Castellano S, Novoselov SV, Kryukov GV, Lescure A, Blanco E, Krol A, Gladyshev VN, Guigo R: Reconsidering the evolution of eukaryotic selenoproteins: a novel nonmammalian family with scattered phylogenetic distribution. EMBO Rep. 2004, 5 (1): 71-77. 10.1038/sj.embor.7400036.
Eisenhoffer GT, Kang H, Sánchez-Alvarado A: Molecular analysis of stem cells and their descendants during cell turnover and regeneration in the planarian Schmidtea mediterranea. Cell Stem Cell. 2008, 3 (3): 327-339. 10.1016/j.stem.2008.07.002.
Rossi L, Salvetti A, Marincola FM, Lena A, Deri P, Mannini L, Batistoni R, Wang E, Gremigni V: Deciphering the molecular machinery of stem cells: a look at the neoblast gene expression profile. Genome Biol. 2007, 8 (4): R62-10.1186/gb-2007-8-4-r62.
Kreshchenko ND: Functions of flatworm neuropeptides NPF, GYIRF and FMRF in course of pharyngeal regeneration of anterior body fragments of planarian, Girardia tigrina. Acta Biol Hung. 2008, 59 (Suppl): 199-207. 10.1556/ABiol.59.2008.Suppl.29.
Bayascas JR, Castillo E, Saló E: Platyhelminthes have a hox code differentially activated during regeneration, with genes closely related to those of spiralian protostomes. Dev Genes Evol. 1998, 208 (8): 467-473. 10.1007/s004270050204.
Saló E, Pineda D, Marsal M, González J, Gremigni V, Batistoni R: Genetic network of the eye in Platyhelminthes: expression and functional analysis of some players during planarian regeneration. Gene. 2002, 287: 67-74.
Inoue T, Kumamoto H, Okamoto K, Umesono Y, Sakai M, Sánchez-Alvarado A, Agata K: Morphological and functional recovery of the planarian photosensing system during head regeneration. Zoolog Sci. 2004, 21 (3): 275-283. 10.2108/zsj.21.275.
Cebrià F, Newmark PA: Planarian homologs of netrin and netrin receptor are required for proper regeneration of the central nervous system and the maintenance of nervous system architecture. Development. 2005, 132 (16): 3691-3703.
Handberg-Thorsager M, Saló E: The planarian nanos-like gene Smed nos is expressed in germline and eye precursor cells during development and regeneration. Dev Genes Evol. 2007, 217 (5): 403-411. 10.1007/s00427-007-0146-3.
Almen MS, Nordstrom KJ, Fredriksson R, Schioth HB: Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. 2009, 7: 50-10.1186/1741-7007-7-50.
Kall L, Krogh A, Sonnhammer EL: Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server. Nucleic Acids Res. 2007, W429-432. 10.1093/nar/gkm256. 35 Web Server
Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.
Genomic sequence data was produced by the Washington University Genome Sequencing Center in St. Louis. DNA library construction, 454 pyrosequencing and the contig browser were generated by Skuldtech (Montpellier, France). We thank Ann King for language revision and editorial advice, and Franziska Konert for technical assistance. SF and GRE are supported by FPI fellowships, and BC is supported by an FPU fellowship (both MICINN, Spain). TH is supported by a fellowship of the Studienstiftung des deutschen Volkes. FC is a Ramón y Cajal researcher (MICINN, Spain). This work was funded by grants BFU2008-01544 (MICINN) to ES, BFU2008-00710 (MICINN) to FC, BFU2009-09102(MICINN) to JFA, and 2009SGR1018 (AGAUR) to ES and FC. Finally, K.B. was funded by a FRONTIER grant from the University of Heidelberg DFG Excellence Initiative.
Authors and Affiliations
Departament de Genètica, Facultat de Biología, Universitat de Barcelona (UB), Av, Diagonal 645, edifici annex, planta 1, 08028, Barcelona, Catalunya, Spain
JFA performed the computational analyses on the assemblies, the GO characterization, the mapping into the genome and the analysis of spliced sites, and prepared all the corresponding figures and tables. GRE analyzed the coverage of known annotated genes and generated the corresponding table. TH performed the sequence analysis of planarian transmembrane proteins, generated the corresponding figure and table and designed primers for RT-PCRs. GRE, SF, FC and KB designed the primers and performed the RT-PCRs. FC and SF analyzed the annotated data to characterize the neurotransmitter, peptide and hormone receptors and prepared the corresponding tables. ES and BC analyzed the annotated data to characterize homeobox-containing and eye-related genes and prepared the corresponding tables. JFA, KB, FC and ES conceived of the study, participated in its design and coordination, and helped draft the manuscript. All authors read and approved the final manuscript.
Josep F Abril, Francesc Cebrià contributed equally to this work.
Additional file 1:GO annotation for 90e contigs not mapping onto the WUSL 3.1 genome assembly. 8,831 90e contigs were not found in the genome. 3,480 had a BLASTX hit to a sequence of NCBI NRprot; yet only 2,401 had a hit to a protein functionally annotated in the GO database. This file contains the description of the best HSP for 71 of those annotated contigs, after filtering out as described above. (Header: CONTIG ID = Smed454 sequence identifier, E-VALUE = BLASTX HSP E-value, ALN_SCORE = HSP alignment score, IDENTITIES = number of identical amino acids, POSITIVES = number of similar amino acids, SEQUENCE ID = Protein sequence identifier, ACCESSION NUMBER = Protein sequence full accession number, SEQUENCE DESCRIPTION = Full protein GenBank description). (XLS 30 KB)
Additional file 2:Splice sites for a subset of Smed454 sequences mapped onto theSchmidtea mediterraneagenome. (Header: GID = Genomic contig IDentifier from WUSLv3.1 genome assembly--including the start and end nucleotide coordinates for the complete match--, CIG=90e contig IDentifier, INTNUM = Intron number within the 90e contig, EXO = splice signals found by exonerate, ORI = sequence orientation--here -1 means that the match was found on the reverse strand of the genomic contig--, CEXO = corrected splice site signals after reverse complementing the genomic sequence when required, ILEN = Intron length in bp, IORI = Intron start--relative to the match coordinates--, IEND = Intron end--relative to the match coordinates--, STRAND, SSSEQ = Splice sites sequences--where a point separates three nucleotides from the 5' and 3' exons, and the three dots in the middle denote intron sequence not shown for clarity--). (TBL 1 MB)
Additional file 3:List of 90e transcripts validated by RT-PCR. (Header: # = Number, CONTIG=90e contig ID, PRIMER_FORWARD = 5' to 3' sequence of the forward primer used, REVERSE_FORWARD = 5' to 3' sequence of the reverse primer used, AMPLICON SIZE = Size amplified in bp, SET = refers to the subset of origin of the 90e contig: no hit genome, hit genome, - blast (no BLASTX hit), +blast (BLASTX hit)). (XLS 37 KB)
Additional file 4:Smed454 sequences matching knownSchmidtea mediterraneagenes. (Header: ACCESSION NUMBER = Known gene sequence identifier as target, NAME = Description for that sequence, LENGTH = Nucleotide length for that sequence, A&T CONTENT = Sequence composition, 454 90e CONTIG/SINGLETON = Smed454 sequence identifier as query, LENGTH = Nucleotide sequence length for this sequence, ALIGNMENT LENGTH = HSP length, START = Start nucleotide of alignment on target, END = Final nucleotide of alignment on target, IDENTITY = Identity score, BITSCORE = Alignment bit score, E-VALUE = HSP BLAST e-value, HIT LENGTH = Un-gapped length of the alignment on the target, %COVERAGE = Sum of co-linear HSPs on target coordinates divided by the total length of the target, #SEQs = Number of co-linear HSPs considered, avg%COV = The coverage divided by the number of co-linear HSPs). (XLS 324 KB)
Additional file 5:Gene Ontology for all three Smed454 sets: 90, 98 and 90e. Level one and two GO codes are shown in order to simplify the listings. Although there are small changes in GO frequencies, annotation is consistent throughout all three sets. (Header: GO = Gene Ontology unique identifier, Count = Number of sequences with a given GO annotation, Freq% = Frequencies for every GO annotation. The total shown does not include the un-annotated and over-represented features, that is, the first two rows on each table). (XLS 55 KB)
Additional file 6:List of cell cycle, cell division, DNA repair or DNA damage candidates. Short list of candidates annotated as genes involved in cell cycle, cell division, DNA repair or DNA damage. (Header: ID = Smed454 sequence identifier, BLASTX HIT = Description of the best sequence hit, ACCESSION NUMBER = Sequence identifier of the best sequence hit, E-VALUE = BLASTX e-value for that sequence hit). (XLS 55 KB)
Additional file 7:Summary report for the consensus set of 4,663 predicted transmembrane proteins including functional annotations. (Header: Sequence_ID = Protein sequence identifier, Sequence_AA = Amino acid sequence, Length[aa] = Length of amino acid sequence, Phobius_TM = Phobius prediction of number of transmembrane domains, Phobius_SP = Phobius prediction of signal peptide, Phobius_Top = Phobius prediction of membrane topology, TMHMM_TM = TMHMM2.0 prediction of number of transmembrane domains, TMHMM_Top = TMHMMv2.0 prediction of membrane topology, SOSUI_TM = SOSUI prediction of number of transmembrane domains, SOSUI_Top = SOSUI prediction of membrane topology, UFO_PFAM = UFO annotation of Pfam protein families, UFO_GO = UFO annotation of gene ontologies). (XLS 2 MB)
Additional file 8:List of neurotransmitter, peptide and hormone receptor sequence candidates. Complete complement of Smed454 dataset contigs and singletons showing homology to neurotransmitter and hormone receptors, totalling 287 sequences. (Header: ID = Smed454 sequence identifier, BLASTX HIT = Description of the best sequence hit, ACCESSION NUMBER = Sequence identifier of the best sequence hit, E-VALUE = BLASTX e-value for that sequence hit). (XLS 58 KB)
Additional file 9:List of eye-related gene sequence candidates. Complete complement of Smed454 dataset contigs and singletons showing homology to eye-related genes, totalling 95 sequences. (Header: ID = Smed454 sequence identifier, BLASTX HIT = Description of the best sequence hit, ACCESSION NUMBER = Sequence identifier of the best sequence hit, E-VALUE = BLASTX e-value for that sequence hit). (DOC 80 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.