Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty?

Background The complete proteome of the starlet sea anemone, Nematostella vectensis, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of Hydra magnipapillata and Monosiga brevicollis, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes. Results We found that 11-16% of N. vectensis proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the N. Vectensis proteome has about 3300 unique TR-units, but only a small fraction of them are shared with H. magnipapillata, M. brevicollis, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra. Conclusions While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.


Background
Repeated domains in proteins are associated with a broad spectrum of functions. They have originated from repetitive DNA elements, including transposons, retroviruses, and other parasitic elements that impact genome dynamics. While most of these events are deleterious, some may lead to new proteins or protein domains, as was shown for Alu repeats in primates [1]. Repeated sequences may account for the majority of the genetic material in many eukaryotic genomes, among them the human genome [2]. Some repeated sequences may result in open reading frames (ORFs) of substantial length. If these sequences are integrated into an actively transcribed region, subsequent events such as recombination and gene conversion can spread this repeated sequence, resulting in fast amplification of the original ORF [3].
Tandem repeats (TR) in proteins are of various types. At one extreme are those based on a single amino acid repeat. Slippage during DNA replication could yield a repeated unit of exactly 3 nucleotides, resulting in the gain of a single amino acid. It has also been found that most mono-amino acid repeats in primates are encoded by GCrich isochores [4], but there exists a strong purifying selection against such short repeats at the DNA level [5]. The over-represented mono-amino acid repeats in eukaryotes are glutamine (Q), proline (P), glutamic acid (E), asparagine (N), aspartic acid (D), and serine (S). Mono-amino acid repeats appear mostly in transcription factors and protein kinases [6,7], where they occupy flexible regions and loops. Interestingly, the number of repeated glutamines (Q) in key proteins correlates with the severity of Huntington disease and Fragile X Syndrome [8].
A TR protein may contain short repeat units (3-4 amino acids), which occur tens and hundreds of times in tandem. In such cases, the TR units typically form an extended structure, with this phenomenon quite common in structural proteins such as collagens, keratins, antifreeze proteins, and spider silks. Such structural proteins and others, like the FG-rich proteins of the nuclear pore [9], possess unique biophysical and thermodynamic properties.
About 250 InterPro entries [10] are characterized as 'repeats'. These are protein regions that do not fold into a globular domain on their own [11]. Often, several of these repeats are needed to form a globular domain. Known examples for repeated domains that provide the scaffold of the protein are ankyrin (33 amino acids), keltch (50 amino acids), TRP (34 amino acids) and armadillo (40 amino acids) repeats. The stable 3D structures of these domain families rely on the rearrangement of their repeated units; for example, the repeat unit of WD40 represents a four-stranded β-sheet arrangement. In fact, many of the repeated domains are rich in β-sheets, where these proteins typically participate in a network of interactions such as in cell-cell contact, pathogen adhesion [12], and cell signaling.
Several methods and tools for detecting repeats in large protein databases are available [13][14][15][16]. TRs that are characterized by a predefined periodicity are categorized in the TRIPS database [14]. Tools such as HHRep [17] and PEP-PER [18] efficiently address genome-scale search for multiple repeats in proteins. The latter is suitable for analyzing sequences of fibrous proteins. TRPpred is a tool that was optimized for identifying repeated units in several protein families including TRP [19]. The Xstream server [20] is a sensitive tool for detecting TRs on a large-scale. Furthermore, it permits flexibility in defining TRs, by fine-tuning the parameters to successfully detect TRs in DNA or protein sequences. The Xstream server does not tolerate nonconserved spacers between the TR units. Therefore, proteins having multiple repeated domains (e.g., WD40, Kelch) do not necessarily fulfill the Xstream definition of TR-proteins.
Currently, most multicellular model organisms that have been studied come from Bilateria. An important step toward the study of metazoan evolution comes from the eumetazoan phylum Cnidaria, a lineage that includes corals, sea anemones, jellyfish, and hydroids. A genomic perspective on the metazoan origin comes from the recently sequenced genomes of the choanoflagellate Monosiga brevicollis [21], the starlet sea anemone Nematostella vectensis, and Hydra magnipapillata [22]. The addition of genomic information from several Porifera (sponges) contributed to the reconstruction of the evolutionary position of Metazoa with respect to Fungi [23]. Surprisingly, many genes that were initially regarded as vertebrate-specific have in fact been identified in the N. vectensis genome [22]. These genomes allowed researchers to estimate the branch linking Cnidaria and the Bilateria to have occurred 650-750 million years ago [24].
Herein, we focus on the prevalence of TRs in the sea anemone proteome. When comparing it to other genomes, a surprising enrichment in the size and complexity of its TRs is evident. The diversity of the TR sequences from N. vectensis is considerably higher than that identified in any other multicellular eukaryotic genome. The properties of the TRs in N. vectensis are discussed in view of their evolutionary potential as a source for gene novelty. We discuss the advantages of expansion and increased diversity of TRproteins as a robust mode for constructing new genes with unique functional properties.

Scarlet sea anemone proteome is exceptionally rich in tandem repeats
An increased number of sequenced genomes led to a revised view of the origin of metazoan life. The eumetazoan clade contains taxa with nervous systems and muscle cells: Cnidaria, Ctenophora, and Bilateria [25]. In addition, the complete proteome of the starlet sea anemone, Nematostella vectensis, provides insight into gene invention dating hundreds of millions of years ago back to the last common Cnidarian-Bilaterian ancestor [26]. The N. vectensis genome allows us to examine the evolution of an ancestral genome that is rich in TRs. Examples of TR-proteins that fulfill this definition are shown in Figure 1A.
Proteins A7SW76 and A7S5V7 (UniProt IDs) share a repeated unit of 49 amino acids (yellow box) with a copy number of 3.5 and 4. In addition, two other repeated units are found in A7SW76, with a copy number of 6 (7 amino acids unit) and 3 (21 amino acids unit). The remaining portions of the proteins are not identified as TRs. TR-units are defined by at least 3 repeated units, each with a minimal length of 3 amino acids, with no intervening sequences. These are non-overlapping sequences that are identified according to the Xstream tool [20] (see Methods). The number of unique TRs may differ from that of TR-proteins as the same TR-unit might occur in numerous proteins and the same TR-unit often appears in several segments on a particular protein ( Figure 1).
We compared the occurrence of protein TR sequences in N. vectensis in view of other complete proteomes, including fly (D. melanogaster), human, mouse, frog, and more. We restricted our analysis to those TRs for which all repeats share >80% sequence identity with the respective consensus repeat sequence (see Methods).
The complete proteome of N. vectensis includes 24,906 predicted proteins, where half of these are marked as fragments. The number of proteins identified with TRs was as high as 3875, of which 3315 are unique TR-units. When comparing the TRs in N. vectensis and other proteomes, the most striking difference is in the number of unique TRs. In N. vectensis, ~16% of all protein sequences are TRcontaining proteins. When employing the same parameters, TR containing sequences account for only 3% of the Drosophila melanogaster proteome ( Figure 1B).
An exhaustive comparison of the TR properties for representative proteomes, spanning an evolutionary range from N. vectensis to human, was performed. The selected organisms represent major evolutionary branches, covering worm (C. elegans), insects (fly, bee and beetle), plant TR-containing proteins from N. vectensis Figure 1 TR-containing proteins from N. vectensis. Graphical representation of two TR-containing proteins: A7SW76 and A7S5V7 (UniProt). Proteins contain TR with a unit length of 49 amino acids (yellow, TR-unit). This TR-unit appears twice on A7SW76, while the other two TR-units are unique to A7SW76. These TR-units are of length 7 and 21 amino acids (blue and green), respectively. For consistency, we count the number of unique TRs in a genome (i.e. unique non-overlapping sequences that fulfill the TR definition). TR-proteins sum all proteins that contain at least one TR-segment (TR-segment, TR-seg). The minimal TR-segment is composed of at least 3 successive TR-units. (B) Statistical comparison of N. vectensis and D. melanogaster TRproteomes.
When the fraction of TR containing proteins (TR-proteins) relative to the total Nematostella proteome is considered (Figure 2A), we noted a 2-5 fold enrichment in the fraction of such proteins relative to the other organisms tested.

Most N. vectensis tandem repeat units are unique
While the fraction of TR-proteins in N. vectensis is higher than in any other tested organism, the usage of any particular TR is typically restricted. Most TR-units are used only once in N. vectensis ( Figure 2B). In mouse and human, each TR appears on average in 2.7 and 2.1 proteins, respectively. For example, in the mouse proteome, there are ~3000 TR-proteins, yet they are composed of onlỹ 1100 unique TRs. While the genomes of human and mouse are quite active in reusing their repeated sequences [27], no clear view was presented for the dynamics of TRs in N. vectensis [22]. Moreover, the fraction occupied by a TR within the TR-protein (i.e., TR-coverage) is substantially higher in N. vectensis ( Figure 2C). This extreme TRcoverage (50%) reflects the fact that almost half of N. vectensis proteins are marked as fragments due to missing exons (and failures of genome prediction tools). The 25% coverage measured for human ( Figure 2C) and mouse (not shown) is still higher than the coverage determined for the other organisms tested, mainly insects and frog.

Characterizing the properties of TR-containing proteins from N. vectensis
The >3300 TRs of N. vectensis were statistically analyzed to determine the most prevalent repeat properties in terms of number of TR units, their length, and the overall regions they occupy. Figure 3 shows the abundance of these TRunits. It is evident that a unit length of 10-12 amino acids is most prevalent ( Figure 3A). Furthermore, an inverse correlation is observed between the number of TR-units within a TR-segment (defined in Figure 1) and the length of that basic unit (tested for a range of 6-20 amino acids). Thus, in most cases, the total region that is occupied by TRs in N. vectensis remains within the range of 120 to 180 amino acids ( Figure 3B, pink). The average length of a TRsegment in a protein is 153 amino acids. Figure 3C shows the accumulated level of variations relative to the TR consensus, where TRs with deviations ranging from 0 to accumulated changes in 20% of its amino acids (marked 0.2) are analyzed. The TRs from human and N. vectensis demonstrate clear differences. While most TRs in human have diverged substantially, this is not the case for N. vectensis (peaks at 0.06). It is possible that TRs in human and N. vectensis differ in their tendency to accumulate and main-tain variations. A substantial number of TRs in human (229) and in N. vectensis (235) shows no variations (marked 0) but most of them (80%) are short TRs (<6 amino acids). An attractive idea suggests that the higher variation observed in mouse (not shown) and human ( Figure 3C) coincide with the high usability of each TRunit in human and mouse but not in Nematostella ( Figure  2B). A similar trend was proposed for the accelerated evolution of duplicated paralogous genes [28]. A similar analysis performed for the ~1200 TR-segments of Hydra indicated an intermediate rate of variation (Additional file 1).
We expanded the comparative study to proteomes that are evolutionarily closer to N. vectensis, such as Hydra.
Recently, a Hydra genome (H. magnipapillata) was sequenced and a predicted protein set of 17,586 models was presented (Craig Venter Institute, ~6X sequence coverage, 1.3 Gb genome). The fraction of unique TR-units within the Hydra complete proteome is ~5% ( Table 1).
The difference in the absolute number of proteins and the fraction of proteins marked as 'fragments' (Table 1) is a proxy for the knowledge and annotation quality (exceptions are missing annotations for L. major and Hydra). The high fraction of partial proteins (i.e., 47% 'fragments' for N. vectensis) is consistent with their shorter average length (Table 1). We included in the analysis Leishmania major, a protozoan parasite that underwent rapid evolution and for which many proteins are known to contain TRs. Table  1 shows that the exceptionally high proportion of proteins with TRs in N. vectensis is a unique property of this organism and has not been reported to such an extent even in protozoan parasites.
To ensure that the trends seen in Table 1 do not solely reflect the poor quality of the assembly and protein annotations reported [29], we repeated the analysis but eliminated all sequences that include undefined nucleotides (indicated by nucleotide 'x') or sequences that could not match exactly their transcripts (see Methods). Even after such filtration, the fraction of TRs for N. vectensis still remains exceptionally high (11% of the entire proteome).

Amino acid composition of N. vectensis tandem repeats
We analyzed the over-representation and under-representation of amino acids in TR-proteins, in comparison with all other N. vectensis proteins. We found that several amino acids, most notably histidine (H), cysteine (C), tyrosine (Y), proline (P), and threonine (T) were enriched, with C and H being the most significantly so ( Figure 4). On the other hand, we observed that polar and charged amino acids are strongly depleted in the TR sequences. Specifically, the most depleted amino acids are phenylalanine (F), glutamic acid (E), and lysine (K). Similar N. vectensis TRs relative to representative proteomes along the metazoan evolutionary tree The relationship between TR-unit length, total length of the TR-segment (in pink) and the average number of repeats (copy number, blue). As the length of the TR-unit increases, the copy number of the repeats decreases.
(C) Comparison of the variation rate within TR-segments from N. vectensis (blue) and human (pink). Variation rate is measured up to a 20% cumulative difference in the sequence of the TR-segment relative to the consensus TR-unit. For details, see Additional file 1. C Frequency A enrichment and depletion in amino acid preference were evident when compared to the SwissProt database (~400,000 sequences). In addition to the distinct biophysical properties of these under-represented amino acids, it is of interest that their codons tends to be AT rich and they are mostly coded by a limited number of codons (2 codons each).
A significant difference in amino acid composition exists between N. vectensis TR and non-TR proteins ( Figure 4A). We thus tested whether the amino acid composition in TRs of N. vectensis is similar to that of other organisms. Specifically, we compared the N. vectensis TRs to the ~970 unique TRs from human. When comparing the TR sequences ( Figure 4B), a relative enrichment in tyrosine (Y) is evident, and to a lesser extent C, I, L, and V. Interestingly, relative to the human repeats, the TRs from N. vectensis are more enriched with amino acids that tend to form ordered structures ( Figure 4B, blue), suggesting that these repeats may be better suited to form structural units. Similar trends were evident in comparisons with other vertebrate TR-proteins. The most dominant predicted secondary structure associated with the TR repeats is the βsheet (not shown).

Sequence robustness of tandem repeats revealed by multiple open reading frames
Among the 24,906 ORFs predicted for N. vectensis [22], 3875 TR-proteins were selected (Table 1). We tested the extent of valid alternative reading frames among this large set of TR-proteins. In fact, due to the shortage in experimental evidence (ESTs and cDNAs), and the absence of any direct protein information, the reading frame may not be correct. Typically, computational inference for a spe-cific frame is based on an appropriate Kozak sequence near the initiating methionine, on the knowledge of codon usage bias, and on conservation criteria from homology and paralogy. At present, ~50% of the TR-proteins lack an initiating methionine and a similar fraction (45%) holds for the rest of the predicted proteins.
We analyzed the properties of the alternative reading frames for TR-proteins. For this analysis, the set of all TRsegments from N. Vectensis was compiled (4437 segments, JGI proteome). For each such TR-segment, we inspected all alternative reading frames (ARFs), a total of 22,185 potential ORFs. Averaging over all 5 frame shifts, 37.5% of these ARFs do not contain a stop codon ( Figure 5A), i.e., they adhere to the minimal definition of being valid ORFs. To put this result within a statistical context, we performed a simulation for 4437 random ORF sequences, generated based on the nucleotide composition of the coding sequences of the N. vectensis TR-proteins. In this random simulation, only 4.3% of the ARFs did not contain a stop codon, thus hinting at the possible significance of this beneficial use of repeated sequences to yield valid ARFs. Somewhat surprisingly, the high fraction of valid ARFs also carried over for those of the reverse complement strand ( Figure 5A). In our analysis, we found that onlỹ 600 TR-proteins (14%) can be read exclusively in the annotated reading frame. On the other hand, for most of the sequences, there are at least 2-3 additional ORFs (Figure 5B), many of which are potentially long. For example, when limiting the analysis to ARFs (ORF +2 or ORF +3) with length >500 nucleotides, 107 such instances were found (Additional file 2). Amino acid composition in TR-segments

$ 5 ' 9 3 * 7 6 4 * 3 5 6
An analysis of the TR-proteins showed that the overall amino acid composition is rather similar for all 3 reading frames on the coding strand, when the alternative frames are open; ( Figure 5C, p-value for paired t-test <0.007). Note that in cases when the reverse complement frames are also valid, the translated sequences are also often similar in composition to the original frame (see Discussion).
It is important to note that there is a unique scenario, for which repeats of a nucleotide unit will necessarily result in the same amino acid sequence for all 3 reading frames on a strand, with cyclic shifts only in the start and end of the sequence. Figure 5D demonstrates an example where a repeat of 13 nucleotides leads to a TR-unit of 13 amino acids. This phenomenon is actually a byproduct of the basic repeat unit (at the DNA level) having any length that is not a multiple of 3. We found that only for 18% of all TR-proteins (806 instances) was this phenomenon present. Clearly, in such cases, the amino acid composition will remain fixed. Nonetheless, we found the composition to be fixed for almost all TR-proteins ( Figure 5C).

Limited conservation of N. vectensis TR-units along the phylogenetic tree
We tested to what extent the 3300 TRs from N. vectensis is evolutionarily conserved. We focused on representative proteomes for comparison with N. vectensis. Figure 6A shows a tree view of the major branches from the metazoan-fungi separation and within the metazoa kingdom. The Venn diagrams of TR-proteins indicate the number of TR-proteins that are shared among human, mouse, and Nematostella ( Figure 6B). A broader evolutionary perspective is presented by comparing Nematostella with Hydra (Cnidaria) and Monosiga ( Figure 6C). Shared TRproteins are defined by the identity of their TR-units, as calculated by Xstream [20]. We noted that even human and mouse share only 343 TR-proteins, while the overlap between N. vectensis and human and N. vectensis and mouse is even lower (160 and 112, respectively). The analysis of unique TRs shows that 6% of the unique TRs from N. vectensis are also found in human and mouse  ( Figure 6B). Among these, only 64 TR-proteins are shared among all 3 proteomes. While some of the TR-units that are shared among the tested proteomes are rather long, 50-60% of them are shorter than 7 amino acids. A similar comparison relative to the Hydra proteome indicated that 10% and 7.4% of the TR-proteins are shared with Hydra and Monosiga, respectively ( Figure 6C). Only 20 TR-proteins are conserved among all 5 tested species. We concluded that a limited number of TR-units are shared in evolution and the expansion of the TR-proteome is indicative of Nematostella and to a much lesser degree of Hydra. For more details, see Additional file 3. Table 2 summarizes instances of TRs that are shared among N. vectensis, human, and mouse and additional organisms. Among the longest TRs that were identified in human, mouse, and N. vectensis (with almost no variability in their sequence) is a repeat of 76 amino acids that represents multi-ubiquitin proteins. For example, D. mel-anogaster has 4 proteins with such TR-units, with copy number (n) of 4 (Q8MT02), 7 (Q9W418) 10 (A4V1F9) and 14 (Q8MSM5). In N. vectensis, this TR-unit is detected in 2 independent sequences with a copy number of 3 and 7 units (A7SV54 and A7SUP6, respectively). This ubiquitin domain of 76 amino acids appears within a wide spectrum of the taxonomical tree. Representative proteins containing this TR, with n = 3 to 17, are shown in Table 2. Hs1-Cortactin is a TR of 37 amino acids that appears with n = 3 to 7. The 114 amino acid repeat of Calx-beta appears in 7 protein sequences in N. vectensis. It is widely spread throughout the taxa, ranging in copy number from 3 to 41. The Calx-beta motif is present in the cytoplasmic domains of Na-Ca exchangers and in integrin-β4, which mediates signaling across the plasma membrane (see Additional file 4). In all these examples, the TR units are those reported by the Pfam family collection [30]. All N. vectensis examples listed in Table 2 are supported experimentally and confirmed by EST expressions from embryo, larva, and unfertilized egg (not shown).

A functional perspective of N. vectensis repeats
The fraction of Pfam and InterPro entries that are represented among all N. vectensis proteins is slightly lower than that for well-studied genomes (68% compared to 77% for all proteins in UniProt). When the set of TR-proteins is considered, only 31% appear with a Pfam entry that is repeated at least twice. Thus, most of the TR-proteins are undefined by Pfam and InterPro databases. Of course, the strict requirement of tandemness in the TRs excludes proteins with non-conserved linkers between domains (e.g., Annexin repeats) from the analysis.
We thus set out to test the appearance of repeats in the N. vectensis proteome in view of the 252 repeat types that are reported by InterPro (Additional file 5 humans, no marked difference in copy number is shown for most of these repeats (Figure 7). For 2 such repeat domains, N. vectensis exhibits a moderate preference for a higher copy number. For almost half of the instances, the opposite tendency is detected. Since many of the N. vectensis proteins are incomplete (i.e., annotated as fragments), this analysis may underestimate the actual copy number of the TRs in the full sequences. We conclude that, overall, the copy numbers for the Pfam repeats that are well-represented in N. vectensis are rather similar to the copy numbers of these repeats in human (Figure 7).

Evolutionary divergence rate for tandem repeat units
The large number of TR-proteins in N. vectensis and the observation that they are mostly uniquely used in its proteins raises the question: which evolutionary forces act on such repeats? We set out to study whether the evolution of these tandem repeats has been subject to neutral, purifying, or positive selection. We thus applied an analysis Pfam repeated domains in N. vectensis   based on the ratio of asynonymous to synonymous substitutions (Ka/Ks ratio). Due to the shortage of experimental evidence for many of the predicted proteins from N. vectensis, we limited this analysis to the TR-proteins that are supported by ESTs (obtained from JGI genome center). For the 94 sequences analyzed, ~40% of them exhibit a ratio of Ka/Ks > 1.0, while ~50% show the opposite trend. Among the instances of TR-proteins with Ka/ Ks>1, several of them seem to exhibit extremely high ratios, which strongly supports the notion of positive selection on these tandem repeats.

N. vectensis proteome quality and missing annotations
All analyses we performed for the N. vectensis and H. magnipapillata TR-proteins are susceptible to the quality of assembly from the shotgun libraries [29]. The genome of N. vectensis is composed of a large number of disconnected contigs. Transposable elements constitute ~26% of the assembled genome, and they are associated with ~500 families that are considered evolutionary young. The number of DNA transposon types found in the sea anemone is the highest among eukaryotic species [31].
In N. vectensis, only 146,000 ESTs were used to support the proteome annotation. These ESTs led to the validation of ~8000 protein sequences. In addition, ~15,000 paired clone end sequences were generated, resulting in 27,273 predicted gene models. It has been estimated that about a third of the predicted proteins may be subjected to future annotation efforts [22]. We conducted our analysis on 24,906 proteins that are supported by UniProtKB (~91% of JGI predicted proteins). About 2000 TR-proteins are reported following an analysis using stricter criteria (for sequences that do not contain any undefined nucleotides or gaps in the TR consensus). Currently, collections from Cnidaria ESTs (sea fans and soft coral) are available: Anemonia viridis (~40,000), Metridium senile (~30,000), and Acropora palmate (~14,000); and Porites astreoides, Aiptasia pallida, Acropora millepora with ~10,000 ESTs each. We expect that the coverage of TRs in N. vectensis ( Figure 2C) will be revised upon improved annotations and additional experimental data. We are currently testing the expression of predicted TR-proteins from N. vectensis.

DNA repeats as a source of novel coding sequences
Most multicellular eukaryotic genomes contain repetitive sequences. These sequences provide the basis for the study of genome dynamics, recombination, and gene conversion [32]. In general, DNA repeats are believed to serve as a principal source of gene novelty. Repetitive nucleotide sequences such as micro-and mini-satellites have been subjected to rapid gains and losses throughout evolution [15].
A survey on internal repeats in protein sequence databases [33] indicated that eukaryotes have a 3-4 fold higher incidence of internal repeats, as compared to prokaryotes or archaea [34]. Furthermore, long proteins often resulted from the concatenation of internal TRs. We propose that the N. vectensis TR basic architectures are attractive for creating new proteins through recombination and copy number adjustment. For a simple sequence (devoid of TRs) above a certain length, it is rare to detect valid ARFs. Usually, only one ORF (the correct one) can be found, while frame shifts of this ORF are unavoidably punctuated by stop codons. However, we noted that for ~800 TR instances, both ARF frame +2 and frame +3 are valid as ORFs.
It was recently suggested that ~40 long overlooked ARFs are expressed in humans [35], in addition to several previously studied ARFs (GNAS1, XBP1, and INK4). When we applied similar criteria to identify ARF candidates among N. vectensis TR-proteins, 107 such instances were found (Additional file 2). The average fraction of valid ARFs for the TR-proteins in N. vectensis is disproportionately high (37.5%), when considering only the size distribution of these proteins. The analysis of all TR-proteins shows that all 3 frames provide a rather similar overall picture (pvalue < 0.007, Figure 5C) of amino acid composition, even when this is not mathematically guaranteed.
The probability for a stop codon (in alternative frames and antisense strand) decreases as GC content increases [36]. In Nematostella TR-proteins, the strong bias towards multiple ORFs was not associated with high GC content. The GC content for the TR-units is similar to that of the proteome as a whole (47.4%). For humans, the GC content of the entire proteome is higher (52.7%) and similar to that of its TR units (51.9%). We conclude that the length of TRs and their complexity is not a reflection of the GC content.

Tandem repeats with structural and functional characteristics
The energetics and kinetics of TR-containing proteins provide new insights into folding rates and protein stability [37]. In N. vectensis, the TR-units are of an average length of 10-12 amino acids ( Figure 3A), but a TR-segment is of 150 amino acids on average ( Figure 3B). This is in the range of a foldable unit and the size of most structural global units in proteins [38]. For most TRs, there is no functional or structural information. In fact, the annotation of 'predicted protein' covers 96% of the N. vectensis proteins and 99.9% of the TR-proteins.
We tested the TR segments of N. vectensis for their secondary structure propensities using the Jpred3 server [39]. For this analysis, we used all valid ARF sequences that were at least 100 amino acids long (296 such sequences in total). Since Jpred, as other prediction methods, strongly rely on homology through PSI-BLAST, it was not surprising that for most ARFs, no globular homolog was found. On average, 83% of the positions were predicted as disordered coils (C), 7% of the positions predicted as beta sheet (E), and 10% as alpha helix (H). Nonetheless, some of these ARFs (20 of them) were predicted to have between 30% and 74% beta sheet, and others (25 ARFs) 52%-97% alpha helix. Interestingly, most InterPro-defined repeat domains are >30 amino acids long and form non-globular elongated structures. Structure-based functionality of these repeated domains has been illustrated for tetratricopeptide (TPR) [40] and spectrin [41]. Specifically, some repeats serve as building blocks in superstructure and thus steric constraints limit the expansion of the repeated units [11].
The parameters used in this study for identification of TRproteins in N. vectensis excluded most of the Pfam domains [42]. However, overall, the Pfam-defined repeated domains from N. vectensis exhibit a copy number similar to their mammalian counterparts (Figure 7). At present, the validated number of repeated TR elements in N. vectensis cannot be determined. We expect a fraction of them to be eliminated following an exhaustive transcriptomic study. On the other hand, TRs with low sequence identity (<80% amino acid identity with the consensus) were not considered in our study.
Currently, functional information for N. vectensis (i.e., GO annotations [43]) is very limited. Annotations that appears in >40 proteins are sparse and include 'Zinc ion binding' (46 proteins), 'G-protein coupled receptor protein signaling' (42 proteins), and 'Calcium binding' (45 proteins). While the overlap of N. vectensis TR-units with TR-units from Hydra is rather limited ( Figure 6A), about 50 of the TR-units are shared only among Cnidaria representatives and are not detected in human or mouse (Additional file 3).

Evolution dynamics -expansion and deletion of repeated units
Representative genomes from Porifera (sponges) have provided molecular information on the evolutionarily oldest, yet still extant, phylum. Analysis of the genomes of Demospongiae (Suberites domuncula and Geodia cydonium), Calcarea (Sycon raphanus), and Hexactinellida (Aphrocallistes vastus) revealed an increase in gene number due to a burst of gene duplication events. This process was accompanied by efficient domain shuffling. This gave rise to the evolution of new domains (i.e., adhesion molecules) and elements of adaptive immunity [23].
We have shown that the expansion and contraction of repeats carries the potential for an increase in functional variability. The large numbers of TR-segments and additional repeated units in N. vectensis proteome may have an impact on the fitness of Nematostella to cope with the environment. We noted a wide variation in the extent of repeats throughout the evolutionary tree (Table 2). Repeat expansion has shaped many families of proteins (e.g., leucine-rich repeats [33]). The mechanism that underlies the gain and loss of TRs (of the type and size discussed in this study) is most likely a non-reciprocal mode of recombination and gene conversion. Over evolutionary time, gene conversion within repeated sequences has been an important mode in shaping eukaryotic genomes [44]. It is likely that most TR-units in the representative organisms studied ( Figure 6) have evolved independently. The TR-proteins that are shared between human, mouse, hydra, sponge, choanoflagellate, and sea anemone genomes serve as a basis to study the process of their evolution. Exciting examples for the importance of the actual numbers of TRunits within coding sequences are known from yeast to human [45,46].

Conclusions
The proteome of Nematostella vectensis is uniquely characterized in terms of its tandemly repeated amino acid sequences. About 3000 TR-units from N. vectensis were identified, a number that exceeds all other multicellular organisms tested. Comparing TR-proteins to additional representatives of metazoan origin (Hydra and Monosiga)substantiated the uniqueness of the phenomenon in N. vectensis but also confirmed that TR-units are in general less conserved even among closely related organisms (human and mouse). We suggest that the TR-proteins present in early metazoan life may have served as a robust source for novel genes with previously overlooked characteristics. These TR-proteins have unique properties in term of amino acid composition, structural properties, and evolutionary dynamics. As such, they may have unexpectedly functioned as a distinctive and rich origin for gene novelty within the framework of animal evolution.

Data Sources
For the complete Nematostella proteome, we used Uni-ProtKB [47] covering 24,906 proteins (UniProt version 14.0). For each predicted protein, the translated cDNA sequence was extracted from the complete genome of N. vectensis from the Joint Genome Institute (JGI). The cDNA sequences of all other organisms analyzed were extracted from Ensembl using BioMart and its script suite [48].
The data used for assessment of ORF properties and for measuring the DNA sequence divergence rate (Ka/Ks) excluded those proteins for which the amino acid sequence and its corresponding cDNA conflicted. 19,283 sequences from the JGI sea anemone genome project version 1.0 http://genome.jgi-psf.org/ are associated with their protein sequence. Each of the ORF sequences was matched with the cDNA sequence and genome positioning (when available). Analyses were performed on this proteome-genome matched set. The sequences of the TRsegments were extracted to create a TR-only collection. This set covers 40% (of the amino acids) of the original data presented by the N. vectensis JGI complete genome.

Bioinformatics resources
For sequence comparison, we applied NCBI resources and tools including Gene [49], NCBI-BLAST2 ftp:// ftp.ncbi.nlm.nih.gov/blast/ and the JGI genome browser http://genome.jgi-psf.org. Functional annotations are based on InterPro and Pfam assignments. Pfam annotations from StellaBase [42] include 7868 entries for the entire proteome. Using annotations from the UniProtKB cross-reference indicates that 68% of all predicted proteins from N. vectensis have at least one PfamA entry. Inter-Pro version 20.0 [10] lists 18,469 entries, 252 of them assigned as 'repeats'. 49 InterPro entries have no Pfam support and were not further analyzed. The rest of the repeats (203 entries, covering 175,000 UniProt proteins) were analyzed according to InterPro, GO 'molecular function' and 'biological process' terms, and Enzyme Classification (EC), as obtained from the UniProtKB resource.

Prediction methods and tools Tandem repeats
The locations of TRs in the proteins and the transcripts were determined using the Xstream web tool [20]. The parameters of Xstream for protein are (i) TRs are >70% identical (ii) The minimal length of the repeated unit is 3 amino acids (iii) The minimal domain length (defined as the total length of the repeated units) is 10 (iv) the repeated unit appears at least 3 times (v) each repeat unit shares >80% identity to the consensus sequence; (vi) there are at most 3 gaps in the repeats. For nucleic acids, the parameters that define TRs are (i) TRs are >70% identical (ii) The unit is of a minimal length of 10 nucleotides (iii) the minimal domain length is 30 nucleotides. Parameters iv-vi are identical for both proteins and nucleic acids.

Divergence rate
To calculate synonymous (Ks) and asynonymous (Ka) substitution rates, we used the Ka/Ks Calculator developed by the Beijing Genomics Institute [50]. The Ka/Ks ratio compares the rates of asynonymous and synonymous substitutions. A ratio that is <1 indicates that asynonymous mutations tend to be removed from the population. Positive selection (i.e., adaptive evolution) generates a ratio >1. For Ka/Ks measurement, we used only the ~400 TR-proteins that have EST evidence. For each protein, the Ka/Ks ratio was calculated relative to the first repeat unit, which was used as a reference. TR-proteins that show a variation rate <0.05 and TRs that are very short were excluded from the analysis, leaving 94 sequences.
Amino Acid properties Composition Profiler [51] was applied for the analysis of amino acid properties for the TR-proteins, where the basic concept is to compare the amino acid probabilities estimated from a protein sample relative to those estimated from a background sample. P-values are calculated based on the hypothesis that the amino acid composition of the samples comes from the same underlying distribution (corrected for multiple tests). We used a set of 500 randomly chosen non-repeat proteins from the N. vectensis proteome as a background for the composition profile of ORF +1, and another such set for the similarities between the compositions for all 3 ORFs. Note that the composition for ORF +1 was nearly identical as compared to either background.

Statistical estimation and measurement
In order to estimate the statistical significance of the number of valid (with open reading frame) ARFs per protein, we used simulated protein data. To construct such a protein in silico, we created a randomized set of protein sequences and rejected sequences containing stop codons in the first frame, thus exclusively generating proteins for which ORF+1 is valid. Based on the length distribution of the TR segments of the 4337 TR-proteins taken from the actual data, we created 4337 such simulated proteins. For each such protein, we evaluated all possible reading frames, and tallied the appearance of stop codons.