Transcriptome of Sphaerospora molnari (Cnidaria, Myxosporea) blood stages provides proteolytic arsenal as potential therapeutic targets against sphaerosporosis in common carp

Background Parasites employ proteases to evade host immune systems, feed and replicate and are often the target of anti-parasite strategies to disrupt these interactions. Myxozoans are obligate cnidarian parasites, alternating between invertebrate and fish hosts. Their genes are highly divergent from other metazoans, and available genomic and transcriptomic datasets are limited. Some myxozoans are important aquaculture pathogens such as Sphaerospora molnari replicating in the blood of farmed carp before reaching the gills for sporogenesis and transmission. Proliferative stages cause a massive systemic lymphocyte response and the disruption of the gill epithelia by spore-forming stages leads to respiratory problems and mortalities. In the absence of a S. molnari genome, we utilized a de novo approach to assemble the first transcriptome of proliferative myxozoan stages to identify S. molnari proteases that are upregulated during the first stages of infection when the parasite multiplies massively, rather than in late spore-forming plasmodia. Furthermore, a subset of orthologs was used to characterize 3D structures and putative druggable targets. Results An assembled and host filtered transcriptome containing 9436 proteins, mapping to 29,560 contigs was mined for protease virulence factors and revealed that cysteine proteases were most common (38%), at a higher percentage than other myxozoans or cnidarians (25–30%). Two cathepsin Ls that were found upregulated in spore-forming stages with a presenilin like aspartic protease and a dipeptidyl peptidase. We also identified downregulated proteases in the spore-forming development when compared with proliferative stages including an astacin metallopeptidase and lipases (qPCR). In total, 235 transcripts were identified as putative proteases using a MEROPS database. In silico analysis of highly transcribed cathepsins revealed potential drug targets within this data set that should be prioritised for development. Conclusions In silico surveys for proteins are essential in drug discovery and understanding host-parasite interactions in non-model systems. The present study of S. molnari’s protease arsenal reveals previously unknown proteases potentially used for host exploitation and immune evasion. The pioneering dataset serves as a model for myxozoan virulence research, which is of particular importance as myxozoan diseases have recently been shown to emerge and expand geographically, due to climate change.


Background
The relationship between parasites and their hosts is under constant pressure; parasites must invade, replicate and feed whilst avoiding the host immune system. Proteases are the weapon of choice for parasites to overcome these challenges within the host, and can be specifically adapted for cleaving host proteins or modifying their own proteins for immune system avoidance [1][2][3]. Proteases are often high priority proteins for investigation as they have essential roles in development, invasion or feeding [4]. However, proteases are involved in other cellular functions e.g. transport and activation of other peptidases, and it can often be unclear which peptidases are essential to parasite survival or success [5,6]. Drug or interference targets can be difficult to identify in a wash of uncharacterized proteins, however proteases linked to an essential cellular pathway or localised to a particular organelle e.g. lysosome can be considered useful targets for life cycle or development disruption [5,7].
Anti-parasite drugs currently available have been identified by screening sets of compounds in vitro culture systems and by borrowing compounds that have worked in other pathogens and applying those to a new parasite model [8]. Firstly, this limits progress to organisms and life stages that can be isolated and cultured; secondly it relies on applicable compounds having been found in related organisms; and thirdly it limits discovery as it looks at one target at a time for feasibility. In silico drug target discovery in contrast has the attractive attributes of speed, low cost and no requirement for living parasites. In the case of non-model organisms this is likely the first step before prioritising any protein for further experimentation with the aim of anti-parasite treatment development.
Myxozoans are parasitic cnidarians that are important pathogens to both wild and cultured fish populations and yet there are no drug targets specified for this group and limited proteolytic studies to examine activity or function of selected proteins [9,10]. Myxozoans are suggested to have reduced genomes compared to their free living cnidarian relatives [11,12] which could have an impact on the range and diversity of the peptidases expressed. Many aspects of myxozoan biology are still unknown or inferred by comparison with other parasites to infer biology such as their metabolism (Thelohanellus kitauei - [12]), their replication [13] or proteins interacting with the host immune system (reviewed in [14]). Myxozoans are entirely parasitic in their life cycle, they alternate between a vertebrate and an invertebrate host with two entirely different types of transmissible spores in each developmental phase [15][16][17][18]. Myxospores are often hardy stages that are capable of being exposed to the environment for long periods of time waiting for uptake by their invertebrate hosts. The actinospores are generally more fragile and only viable for a limited period of time as they are released into the water column to encounter a suitable vertebrate host [19]. There are two main sources of material for genomic and transcriptomic analysis, plasmodia or cysts of developing myxospores from the vertebrate [11,12] or actinospores released from their invertebrate host [11]. Spore development represents the final step prior to transmission with the genetic arsenal related to their production of durable spores often expressed in cysts, separated from the host immune response by connective tissue, while actinospores are collected from the environment, prior to infecting their vertebrate hosts. Therefore, they do not provide many insights into what proteins are helping the parasite feed or replicate or evade immune detection.
Sphaerosporids are a major clade of the Myxosporea, with a large proportion found in bony and cartilaginous fish, and amphibians [20][21][22][23][24]. A specific trait that has only been identified in this clade is the presence of large, extracellular stages circulating in the blood stream of their fish hosts [25][26][27]. The parasites not only use the blood for transport to their target organ but proliferate within it and are present almost all year round (Fig. 1, [26,28,30]). Sphaerospora molnari is a parasite of the common carp in Central Europe with motile blood stages that provoke a strong immune response [29] and are a likely co-factor for developing Swim Bladder Inflammation [30]. S. molnari blood stages (SMBS) are prime targets for parasite intervention therapy, as they are 1) responsible for massive proliferation in the earliest stages of infection of fish, 2) freely circulating in the blood and any drug targeting the SMBS would not need to be applied to host tissue or taken up by host cells; 3) they are circulating in the blood for an extended period and therefore there is a longer window for application of anti-parasite therapies. In addition, preliminary protein studies on SMBS show a high level of sequence divergence even in highly conserved proteins such as actin [28] and therefore SMBS could potentially have proteases that are highly divergent from their hosts as well as other cnidarians which would aid protein target assay development. This study examines protease families and groups present in the transcriptome of SMBS to investigate their diversity and divergence. We compare key protease groups with examples known from other parasites that have been successfully flagged as drug or anti-parasite targets. In addition, we provide gene expression data for selected candidates with the goal of identifying stage-specific proteases of interest for future functional studies.

Results
This transcriptome is the first next generation sequencing dataset from any sphaerosporid species, and also the first dataset from a highly proliferative, extrasporogonic developmental myxozoan stage. Pooled host (Cyprinus carpio) blood cells and Sphaerospora molnari blood stages (SMBS) from 8 infected fish were used for this transcriptome. Illumina HiSeq sequencing yielded a total of 52,040,447 clean paired reads, mapping these to the gene models of C. carpio removed 14,849,448 reads (BioProject PRJNA522909). A trinity assembly of these remaining reads gave 157,506 transcripts, (mean length 766 nt, 39.75% GC content). 127, 741 of these transcripts (81.1%) were found in the carp genome with an e-value of 1e − 05 or more and a percentage identity of > 75%. The remaining 29,765 transcripts were therefore assumed to be S. molnari, and were translated into 29,588 proteins (Table 1). To examine the presence of potential chimeric sequences, we amplified a substantial number (n = 15) of our flagged proteases and ribosomal DNA to verify their sequence according to our assembly.
Sanger sequencing of the complete ribosomal DNA yielded 13,486 bp (Genbank Acc. Nr MK533682), and large fragments of all flagged proteases were also verified from cDNA (Supplementary Data).
Screening with BUSCO, identified that S. molnari retained and is expressing at least half of the 978 benchmark metazoan genes [31] (53% of the single copy metazoan genes). We completed the same analysis for myxozoans Myxobolus cerebralis and Kudoa iwatai, and non-myxozoans Polypodium hydriforme, Edwardsiella lineata, Nematostella vectensis (Table 2). S. molnari had the highest number of complete BUSCOs of all the myxozoan datasets, whereas N. vectensis had the highest overall (905/978, 93%), similar results were found for the single copy BUSCOs. S. molnari had the lowest number of missing genes (374/978, 38.2%) within the myxozoans, in comparison N. vectensis was only missing 30 genes (30/978, 3%) ( Table 2).  1 Developmental cycle of Sphaerospora molnari within its host Cyprinus carpio. Sphaerospora molnari blood stages and infected gill images by A.S. Holzer, (blood stages modified from [28,29] We queried the transcriptomic dataset of SMBS for proteases using representative protease sequences downloaded from the MEROPS database. Less than 1% of all transcripts had a strong sequence match. There were 235 homologs identified in SMBS representing 45 peptidase families, the majority of the proteases were cysteine (38%), followed by metalloproteases (31%), serine (15%), threonine (14%) and aspartic groups (2%) (Fig. 2). Families that were highly represented in SMBS (Table 3) were C01 Papain-like proteases, C12 and C19ubiquitinyl hydrolases, M24aminopeptidases and dipeptidyl peptidases, M67ubiquitin releasing proteases often associated with proteasomal degradation and T01proteasome proteases. There were many families that were absent in all examined cnidarians, and even more, missing from only the myxozoans compared to the free living species e.g. S28 lysosomal carboxypeptidase.
To more closely examine the proteases present in SMBS, we looked at enzymes that were in highly represented families, and were transcripts with a high number of reads mapping to them (TPM) or had high similarity to proteases known from other parasite species. In particular, we examined cysteine proteases in the MEROPS family C01cathepsin L, aspartic proteases in the family A22 -Signal peptide and Presenilin-like proteases, metallopeptidases in the M12the metzincins, and S09 the Prolyl oligopeptidase family.
Cathepsins: S. molnari's transcriptome revealed eight cathepsin-L like sequences by sequence homology, however, five were excluded from our further analysis due to 1) incomplete transcript and 2) missing or uncharacterised active sites at either substrate S1 or S2 sites, or 3) a sequence homology that appeared to be closest to cathepsin L but in fact had chymotrypsin-like folds (Ser-His-Asn) or Gly-His-Asn catalytic triads. Three cathepsin Ls analysed (CathL1-3) were all propeptides with signal peptides which may indicate later activation within the cell (e.g. lysosome) or extracellularly [32]. All had conserved a glutamine for the oxyanion hole known for cysteine peptidases, however, Sm_CL3 did not retain       a stabilising asparagine close to the His active site, this was replaced by a negatively charged aspartic acid. We aligned the predicted tertiary structures of S. molnari cathepsin Ls to a procathepsin L from a metazoan parasite with x-ray crystallography evidence (PDB: 2O6X, [33]) ( Fig. 3a-c). Two of S. molnari's cathepsin Ls, Sm_CL 1 and 2 were able to be aligned to the crystal structure with high confidence, however, there was a marked difference in the distribution of hydrophobic resides ( Fig.  3b-c). In particular, there were higher numbers of hydrophilic residues at the active site compared to Fasciola hepatica ( Fig. 3d-f). The number of charged residues were similar overall, however the distribution of positive and negatively charged amino acids was different between all three proteins ( Fig. 3g-i). Aspartic: We identified two aspartic proteases in the transcriptome of SMBS, one presenilin-like and a signal peptide peptidase-like sequence (Sm_SPPL). The presenilin-like protease had a lower TPM value than Sm_ SPPL (12.66272 compared to 146.829). We therefore focused our analysis on Sm_SPPL, which clustered with other invertebrates (Fig. 4a) and conformed to the structure of a signal peptide peptidase including active residues in transmembrane domains 6, 7 and 9 in the predicted model (Fig. 4b, c, d). Sm_SPPL has the appropriate active site residues and was able to align the predicted tertiary structure of a SPPL. Sm_SPPL was distinguished from a presenilin by the presence of QPALLY motif in its last transmembrane domain, like others in this group [34].
Metallopeptidases: Seventy two metallopeptidases in 16 of 29 families were flagged in the S. molnari proliferative stage peptidases. All of the S. molnari metallopeptidase families were shared by one or both of the other myxozoan species (M. cerebralis or K. iwatai) examined as well as two or three of the free living species (Table  3). The highest proportion of SMBS metallopeptidases were in the M12 family (adamalysins and reprolysins). We examine five of these metallopeptidases in more detail (Sm_MP1 TPM = 14.5; Sm_ MP2 TPM = 16.17; Sm_MP3 TPM = 22.07; Sm_MP4 TPM = 2.76; Sm_MP5 TPM = 0.79). All except one of the five have the methionine turn and zinc binding motif HEXXHXXGXXH with either a serine or an aspargine binding residue. Two of the target metallopeptidases had signal peptides, and almost all had reprolysin and disintegrin domains. Sm_MP4 has a potential C-terminal transmembrane tail that may anchor it similarly to other well-known ADAM proteins.
Serine: Multiple serine proteases were identified within SMBS transcriptome, the largest was MEROPS family S09. The transcriptome of SMBS yielded a serine protease with sequence and structural homology to dipeptidyl peptidase 4 (Fig. 5b, c). SMBS_DPP contained the catalytic triad (Ser-633, Asp-711, His-743), a predicted transmembrane domain at the N-terminus. Phyre2 was able to model the sequence with 100% confidence according to the models of other dipeptidyl proteases (DPPIV and VIIII). The sequence clustered to Thelohanellus kitauei DPPIV and Hydra vulgaris POP isoforms (Fig. 5a) however, both of the H. vulgaris isoforms contained a transmembrane domain and a high homology to DPPIVs rather than POPs.
We then examined the expression of eight key proteases in blood stages compared to spore forming gill stages by qPCR and also in silico expression (TPM). Three cathepsins Sm_CL1, 2 and 3; a presenilin like aspartic protease (Sm_SP1); a dipeptidyl peptidase (Sm_DDPIV); a metallopeptidase (Sm_MP1) and two lipases (Sm_Lip1 and 2) were used as target proteases. Half of our candidates were upregulated in blood stages rather than gill stages (Fig. 6). Expression of Sm_CL3 was over 500x higher in sporogonic gill stages than presporogonic SMBS (Fig. 6a). Sm_ CL3 was predicted to contain a transmembrane domain and had a relatively low TPM value according to our assembly (2.3503) compared to the other two cathepsins Sm_CL1 and 2 (TPM = 782.87 and 77.28, respectively), both of which had signal peptides and were transcribed as procathepsins. Both Sm_CL1 and 2 showed similar expression in the blood and the gills. An astacin metallopeptidase and two lipases were also upregulated in blood compared to gill samples (Fig. 6b). In contrast, both the aspartic and serine proteases, Sm_SP1 and Sm_DPPIV respectively were significantly upregulated in spore forming stages in the gills. Sm_DPPIV did not have a high relative expression within our transcriptome (TPM = 3.99), however when we compared its expression in blood stages vs. gill stages, we found it was more highly expressed in sporogonic gill stages by almost 100 fold. Sm_SP1 had a higher TPM value than Sm_DPPIV (146.829) and was 42 times more highly expressed in the gills than blood.

Discussion
This transcriptome analysis of S. molnari is the first from this entire clade of sphaerosporid myxozoans, this is the only group for which an invertebrate host is unknown and is the only dataset from an extrasporogonic stage of development. Therefore, it offers a unique insight into the mechanisms of myxozoan development and host interactions. Our focus on proteases, was primarily in the pursuit of identifying proteins that would be worthy of further investigation to understand their role in the host-parasite relationship.
The transcriptome of S. molnari blood stages (SMBS) appears to follow a similar trend as other parasitic and free living cnidarians in its heavy AT richness for many of its genes [35] and its overall reduction in genes [11,12]. Gene divergence and AT richness of the SMBS dataset aids distinction of host and parasite genes in many cases and could also aid targeted gene therapy as seen in other anti-parasite drug design e.g. Plasmodium falciparum [36]. The number of proteins identified could have been limited due to the extensive host filtering we conducted on the dataset, divergent genes and/or the nature of the particular life stage of SMBS. Separation of the SMBS from the host immune cells would significantly improve further transcriptomic analysis to combine physical and bioinformatic filtering. More than half of the known BUSCO metazoan genes were identified in this dataset, similarly to M. cerebralis triactinomyxon stages and sporogonic K. iwatai transcriptomes. Although using a different comparative dataset (CEGMA, previous benchmarking metazoan dataset) Chang et al. [11] also noted a reduced number of core eukaryotic genes in the transcriptomes and genomes of K. iwatai (70% of CEGS, compared to our 56% of BUSCOs) and M. cerebralis (39% of CEGS compared to 55% of BUS-COs) whilst P. hydriforme consistently had more than its myxozoan relatives and was closer in numbers to the free living species (90% of CEGs and 89% of BUSCOs). BUSCO genes are also used to estimate the completeness of a genomic or transcriptomic dataset, whilst it is unlikely that we sequenced the extent of the expressed transcripts of S. molnari due to overwhelming host contamination, it is interesting to note an overall reduction in this gene set for myxozoans in general. For comparison, other early branching metazoans and cnidarians have retained a high percentage of these conserved proteins, e.g. Placozoa species Hoilungia hongkongensis had 90-95.3% [37]; sponge Halisarca caerulea had 92.6% [38]. Free living cnidarians have variable accounts of Fig. 6 Real-time PCR of selected proteases in S. molnari blood stages (non-sporogonic) and gill stages (sporogonic). Relative abundance to two housekeeping genes (Elongation Factor 2 and Glyceraldehyde-3-phosphate dehydrogenase) in cDNA samples of circulating blood stages (n = 3) and spore forming gill stages in common carp (n = 3) including 95% confidence intervals and average for each marker. a Cathepsins (Sm_CL1, 2 and 3) in blood stages (red) and sporogonic gill stages (blue); b Other proteases (A22 -Signal peptide and Presenilin like peptidases (Sm_SP1); M12 metallopeptidase (Sm_MP1); S09 -Prolyl oligopeptidase (Sm_DPPIV 1) and Lipases (Sm_Lipase 1 and 2) in blood stages (red) and sporogonic gill stages (blue) their retained complete BUSCO genes, e.g. Acropora digitifera 69% (675/978) [39]; Exaiptasia pallida 84.5% (826/978) [39]; Scleractinian coral Porites lutea 56.4% [40]. Other metazoan parasites have also retained a higher level of genes e.g. the trematode Microphallus sp. maintained close to 70% of genes from the same metazoan dataset [41]. The transcriptomes of parasitic nematode Teladorsagia circumcincta were found to have only 23-38% of the conserved gene set yet appeared to have good matches to the representative genome [42]. Furthermore, cestode parasites have a range of conserved eukaryotic genes from Taenia multiceps (81.8%) to Echinococcus multilocularis (91.6%) [43]. Additionally, despite removing protein sequences that were more than 90% similar to each other in the S. molnari dataset, there were a high number of duplicated single copy BUSCO genes ( Table 2). This may be caused by the pooling of wild-type individuals sampled from different locations and populations. De novo assemblers such as Trinity, designed to generate alternative transcripts, are more sensitive to sequencing errors and/or highly heterozygous datasets, and thus multiple loci may have been assembled into 'isoforms' of the same parent transcript increasing the amount of duplication [44,45].
We examined some specific examples of proteases of interest within the S. molnari transcriptome that were homologous to other known parasitic proteases involved in important host-parasite functions. There are examples of parasitic proteases that have been exploited as drug targets in all different groups and protease families. Two of the aspartic peptidases we identified in the SMBS transcriptome were also in groups that had been identified as potential drug targets. A presenilin-like and a signal peptide peptidase-like protease (Sm_SPPL) were found in SMBS both are in the MEROPS protease family A22. The proteases are closely related, SPPLs are transmembrane proteases with their catalytic sites buried in bordering transmembrane domains, and have varied functions in eukaryotes depending on their localization within the cell [46,47]. In humans, they are involved in the processing of peptides for the MHC I epitopes and their central role in processing signal peptides indicates an important role in signaling and protein modification as part of the endoplasmic reticulum [34]. In contrast to SPPLs, presenilins require a co-factor and multiple subunits to form a catalytic γ -secretase complex [46]. Both aspartic proteases have been identified as potential drug targets in Plasmodium falciparum [48], and can potentially play a role in the cleavage of peptides for antigen presentation on a cell surface [49]. SPPL blockers have been shown experimentally in parasite-derived homologs e.g. LY411,575 (reduced Plasmodium berghei in mice and humans [50]), and bioinformatically as suggested in protozoan species Toxoplasma gondii, Leishmania infantum, and Trypanosoma cruzi [51]. Both S. molnari proteases are divergent from their hosts and in the case of Sm_SPPL have been shown to be highly expressed during its sporogenesis development in the host's gills. This localization may be useful for targeting future therapeutic assays at such an aspartic protease in this system.
Another group of proteases that are commonly targeted for anti-parasite therapy are the metallopeptidases. These have been associated with many functions of parasitic development, they are linked to digestion e.g. falcilysin in Plasmodium falciparum [52], host immune system disruption e.g. matrix and secreted metalloproteases [1,53]; or host immune system evasion e.g. Leishmania GP63 [2]. SMBS had a large variety of metallopeptidases, it is also important to note families that are missing in the myxozoan species, for example we did not identify any M08 homologues which have been originally characterised in Leishmania to aid immune system avoidance in promastigotes [54]. Nor did we identify any matrix metallopeptidases (M10 family) which have also been previously identified as protozoan parasite drug targets [55,56]. One of the largest groups we identified in SMBS was M12, the metzincins. These have been found in other parasitic organisms feeding in their host's blood and are considered important for digestion or nutrient uptake which may be their role in SMBS as they are circulating in the host's blood. The diversity of SMBS metzincins also indicates varied roles for these metallopeptidases in these developmental stages. Although the function of each specific protease are not known, and they were not considered to be highly expressed (TPM values range from 0.7-22.07), they share some valuable attributes that make them strong candidates for future research. Primarily, they share a key domain and binding site and none of the potential target sequences (Sm_MP1-5) had "cysteine rich" domains identified in their sequence which has been discussed as a possible hindrance to recombinant protein production [57]. Recombinant protein application or other interference with parasite metzincins has been shown to be a successful anti-tick treatment [58] and may potentially be employed with any of these SMBS metallopeptidases. A recombinant protein with epitopes of multiple SMBS metzincins, considering they are so divergent from each other as well as any recognisable fish host homolog, could hold potential for a therapeutic experiment.
In the serine proteases, the largest family found in SMBS was S09. This family has multiple groups of which only some have been identified as putative drug targets. One such group is the dipeptidyl peptidases (DPPs) and prolyl oligopeptidases (POPs), which specialize in the cleavage of proline residues (DPPs cleave at N-terminus, POPs cleave at C-terminal). The S09 protease we looked at as a potential drug target in SMBS had similarities to both groups however we determined it to be a DPP according to its sequence, phylogenetic position and its structural homology to DPPIV (Fig. 5). DPPIV has also been identified in bee and snake venoms and possibly influences vasodilation and constriction, something that may be useful to SMBS in host blood vessels. DPPIV exists as a homodimer (potentially tetramer) targets proline residues and has suggested roles in antigen presentation on the cell surface, cell adhesion and collagen binding activity [59]. In protozoans, DPPIVs have been shown to play a role in the encystment of Giardia, a process that can be blocked by the application of inhibitors [60]. DDPIVs have been associated and inhibited in other blood-feeding parasites including Haemonchus contortus where it is suspected to play a role in fibrogen breakdown and coagulation in the blood [61]. Due to the choice and variety of DPPIV inhibitors that are available for both human and veterinary medical applications, we propose that SMBS DPPIV could be a good target for assessing the impact of its inhibition on the proliferation and metabolism of S. molnari in its host. This is based partially on its unique sequence as evidenced from its relation to other homologs, its predicted structure which could aid the application of known inhibitors, and its expression in a developmental stage that could be feeding on blood.
One of the most well researched protease groups in parasites are cysteine proteases, commonly investigated for their potential as therapeutic targets in parasites as they are known to be involved in a variety of pathways as well as being associated with parasitic development and proliferation [62]. In myxozoans, cysteine proteases have been suggested to be involved in the proteolytic destruction of host tissue based on activity and substrate assays [10,63,64]. They are also associated with host hemoglobin degradation in blood feeding parasites, e.g. many parasitic helminths [65] and maybe the case for S. molnari which proliferates in host blood and interacts with erythrocytes [29]. The replacement of a stabilising asparagine close to the His active site of Sm_CL3 was also seen in a cathepsin L of the myxozoan Kudoa thyrsites. K. thrysites is suggested to use its cathepsins for the degradation of host tissue, i.e. "Milky Flesh Disease" [9]. S. molnari cathepsins appear to have all of their disulfide bridges in the three isoforms we examined in contrast to the cathepsin Ls reported from K. thyrsites [9]. Sm_CL1, 2 and 3 had close homology to Fasciola hepatica cathepsins, these proteases are expressed throughout F. hepatica's development with distinct roles in feeding and invasion based on their binding sites [62,66]. All Sm_CLs had higher numbers of hydrophilic residues at the active site compared to F. hepatica (Fig. 3a-c). The number of charged residues were similar to each other overall, however the distribution of positive and negatively charged amino acid was different between all three proteins (Fig.  3). These changes could potentially impact the activity and substrate affinity for the proteases. Cysteine proteases are particularly studied in parasites due to their roles in moulting, encysting and digestion across parasite taxa, and in particular they are often targeted for anti-parasite therapies or inhibition [62,65,[67][68][69]. Here, we present the first expression analysis of cathepsins in SMBS that are potentially important in the feeding, immune evasion or tissue penetration of this parasite in its host. Characterisation of these proteases (substrate, inhibitors, abundance in the proteome) will advance our knowledge of the roles these cathepsins play in the development of this parasite and further inform our prioritization of protease targets for intervention and control assays.
Cathepsin Ls were the most highly expressed (using TPM as indication) proteases identified within our transcriptome, that combined with the large number of in vitro and in vivo experiments on other parasite cathepsins and their effect on parasite survival make them prime candidates for further development. Some examples use peptides mimicking compounds such as aziridine-2,3dicarboxylate-based inhibitors to inhibit cathepsin L activity e.g. Trypansoma brucei rhodesiense [4] or non-peptidic inhibitors e.g. chalcones in F. hepatica [70]. S. molnari is unusual in the number of cathepsin L's expressed in its blood stages, not just isoforms but distinct proteases. Until in vitro assays are able to better pinpoint which cathepsins are linked to SMBS survival, we rely on sequence similarity and predictive modeling to infer homology to other cathepsins including those that have been inhibited successfully. Sm_CL1 and 2 as described above, had signal peptides for potential secretion and they differed in amino acid composition (37.0% identity). Sm_CL2 had more hydrophobic residues at the S1 active site and Sm_CL1 was more highly expressed in the blood stage than Sm_ CL2. Both had predicted structures that aligned to FhCL1 F. hepatica's cathepsin's structure which has been inhibited as well as another of its cathepsins FhCL3 with flavonoid compounds both experimentally and with computational docking [70]. Both F. hepatica cathepsins exhibited hydrophobic interactions with compound C34, and were successfully inhibited although at different rates depending on the concentration of the compound. Experimental evidence would be needed, however we propose Sm_CL1 and 2 as good candidates for inhibition with flavonoids such as C34 in future in vitro assays. The divergence in sequence identity from their host, their expression in both blood and gill stages and their similarity to proteases that have been successfully blocked in such assays are all good evidence that these could be the first drug targets for S. molnari.
Comparison of the expression of some of the proteases in non-spore forming blood stages and gill sporogonic stages gave further insights into the function of some groups. The cathepsin Ls that we examined each had a different expression profile during SMBS development, with Sm_CL3 being the main cathepsin expressed in the gills. Localisation and expression would influence which cathepsin to target for future work, and may indicate roles in sporogenesis for Sm_CL3 compared to CL1 and CL2 in the blood. CL1 and 2 may act as other cathepsins in blood feeding parasites and break down hemoglobin for sustenance. Lipases and metallopeptidases are also associated with feeding in parasites and were comparatively higher in SMBS than in the gills. Astacin metallopeptidases have also been shown to break down complex proteins such as hemoglobin or migrate through tissues [4,71,72], and its upregulation in the proliferative developmental stage rather than spore forming could indicate a role in nutrient acquisition or penetration of target tissues (Fig. 1) in SMBS. Lipases have been suggested to play role in parasite cell surface protein presentation as well as digestion of host protein [73,74], the two candidates we show are upregulated in SMBS should be further investigated for their substrate and localisation within the SMBS to further characterise if they are integral to parasite survival. Disruption of such enzymes could be of interest to uncover the proteomic arsenal SMBS use to digest and navigate its host for survival. Conversely, the aspartic (Sm_SP1) and serine protease (Sm_DPPIV) would be more informative to investigate where they appear to be highly expressedthe gills. These two proteases have been associated with antigen presentation and host immune system avoidance, it could be significant that they are relatively abundant when the parasite is bound within host target tissue, rather than its extracellular blood stage, where it would be directly in conflict with the host immune system. Localisation and further characterisation will be vital to learning more about the role these proteases play in the developmental stages we identified them in and determine if they are interesting targets with anti-parasitic potential.

Conclusions
We produced and explored the first transcriptomic dataset of early proliferative myxozoan stages to date and identified family expansions in cysteine, metallo and threonine clans. We did not identify any myxozoanspecific radiations in particular groups, however, all of the myxozoan proteases we examined were highly divergent from each other as well as from other cnidarians. Vaccine development against a number of metazoan parasites is based on proteases as antigens of interest [75][76][77]. However, with regard to myxozoans, the function and involvement of these enzymes in host-parasite interaction first need to be elucidated as a major lack of knowledge exists with regard to metabolomics and the molecular means of host interaction. The general strategy for therapeutically targeting proteases is to identify a specific inhibitorgenerally a small moleculethat blocks the active site. Discovery efforts for new inhibitors have typically been based on the structure of known protease substrates, presenting a substantial challenge for the development of peptidomimetic compounds that have the pharmacokinetic characteristics needed to be suitable as a drug.
This study advances our knowledge of myxozoan protease sequence, predicted structure and in some cases hydrophobicity and amino acid changes. This information furthers investigation into the potential role these proteases play in the development, sustenance and host immune evasion of these important parasite stages. Vaccination plays an important role in commercial largescale fish farming and is a key reason for the success of salmon aquaculture, however, available vaccines are aimed at bacterial and viral pathogens, while parasite vaccines for fish are still inexistent [78], likely due to an insufficient knowledge of potential parasite target molecules of fish parasites, when compared with parasites of human or veterinary importance. It is now more than timely to explore new genomic data for such targets as epidemical models predict major emerging disease outbreaks and an increased geographic range of myxozoan species such as T. bryosalmonae, in relation to temperature increase [79,80], with new records from northern European territories [81][82][83] and recent reports of massive fish killings from the Yellowstone river [84,85], Hutchins et al. 2017). Exploring the enzymes expressed during early establishment and proliferation of myxozoan infections is essential to finding putatively relevant vaccine targets that can inhibit rapid multiplication of cryptic parasite stages in fish, long before the onset of disease. The proteases we discuss here are putative targets for further research, confirmation of their expression in different stages of S. molnari's life cycle (SMBS vs. gill sporogonic stages vs. extracellular secretion) would be an invaluable method of testing their activity and function and therefore their use in anti-parasitic development.

Animal and sample collection for transcriptome analysis
Common Carp (Cyprinus carpio) were obtained from two localities, Štrmilov in Czech Republic (49.1644°N, 15.2031°E) and Hortobágy in Hungary (47.3542°N, 21.0000°E) during 2013-2015. All fish were obtained commercially, were less than 2 years old and were transported live to the laboratory. Fish were anaesthetized with clove oil and blood was taken, all animal procedures were performed in accordance with Czech legislation (section 29 of Protection of Animals Against Cruelty Act No. 246/1992) and approved by the Czech Ministry of Agriculture. 8 fish were found to be infected with Sphaerospora molnari blood stages (SMBS), and whole blood was centrifuged for 5 min at 3500 rpm in heparinized hematocrit tubes to isolate host white blood cells mixed with SMBS.

For qPCR analysis
Presporogonic S. molnari blood stages and sporogonic gill stages were isolated from experimentally infected fish (n = 3) in the Czech Republic or recirculation system in Hungary (n = 3). Specific parasite free fish tissues were selected from laboratory cultures within the Czech Republic (n = 3). Fish were euthanized according to the ethics license and methods above. with bowtie2, using -very-sensitive parameter. All remaining reads were then assembled with Trinity v2.4.0 and the transcripts were compared with the carp genome again for further host transcript identification (BLAST parameters: tblastx, 1e − 05 ), those with a percentage identity of > 75% were removed to create a "non-host" dataset which was translated to protein using OrfPredictor [86]. Redundancy of the translated non-host dataset was removed with CD-Hit using 0.9 cutoff [87] and blastp (1e − 05 ) annotated with NCBI nr database.

Comparative assemblies
Transcriptomic sequences for other cnidarian species were downloaded from NCBI: Myxozoans -Myxobolus cerebralis triactinomyxon stages from Tubifex host (PRJNA258474), Kudoa iwatai myxospores from cysts of Sparus aurata tissue (PRJNA248713). Non-myxozoan cnidarians with endoparasitic life stages -Polypodium hydriforme non-parasitic stolons (PRJNA251648), and Edwardsiella lineata (mix of parasitic and free living life stages) (downloaded from EdwardBase site when was active in 2014 [89]. Finally, to compare with a completely free living relative, genome derived RefSeq mRNA sequences of the anthozoan Nematostella vectensis PRJNA19965 were downloaded from NCBI. Transcripts were translated into peptide sequences by OrfFinder [86] and searched for proteases as above, the redundancy was removed (0.9 cutoff) and the non redundant dataset was screened for BUSCO genes as above.
Target protease groups: Four groups of proteases were more closely examined (A22 -Signal peptide peptidases and presenilins, C01-Cathepsin Ls, M12metalloendopeptidases, S09 -Prolyl oligopeptidases and dipeptidyl peptidases). Representative S. molnari sequences with other sequences from Genbank and Uniprot (including fish sequences) were analysed phylogenetically with fish and other cnidarian and parasitic-derived sequences using RaxML (L + G + I). Tertiary structures were predicted for key proteases using the Phyre2 server [90] and models were compared and manipulated in PyMol ver.

(The PyMOL Molecular Graphic System).
Sanger sequencing of key predicted proteases and rDNA RNA was extracted from 3 biological replicates of S. molnari proliferative blood stages (pooled samples from several individuals) and spore-forming stages (individual fish). Total host+parasite RNA was isolated using the Nucleospin RNA Kit (Machery-Nagel) including a DNase treatment step. RNA concentration and purity was checked using a Nanodrop (ND-1000) Spectrophotometer (NanoDrop Technologies) and cDNA was synthesised using the Transcriptor High Fidelity cDNA synthesis Kit (Roche). Primers were designed to amplify full length sequences of selected proteases and ribosomal DNA as single amplicons or with long overlaps between individual sections, to confirm the assembled transcriptome sequences.

Quantification of stage-specific expression of candidate proteases
Gene-specific primers were designed to amplify short, 70-150 bp regions suitable for qPCR (Supplementary Data). All primers were tested for functionality and specificity using conventional PCR prior to performing qPCR. qPCR was performed using the FastStart Universal Sybr Green Master Mix (Rox) on LightCycler® 480 Real-Time PCR System (Roche). Reactions contained 12.5 ul of FastStart Universal SYBR Green PCR Master Mix, Roche, Germany (2X conc.), 1 μl of each forward and reverse primer (10 μM conc.), 5.5 μl of PCR grade water, and approx. 500 ng of cDNA, resulting in a final volume of 25 ul. Cycling conditions were as follows: Denaturation at 95°C for 5 min, followed by 50 cycles of 95°C for 10 s, and 58°C for 10 s and 72°C for 10 s. Melting curve analysis were performed after each qRT-PCR to ensure primer specificity. The relative expression ratio of each sample was calculated according to Pfaffl [91], based on the take-off deviation of sample versus controls at each time point and normalized relative to Elongation Factor 2 and Glyceraldehyde-3-phosphate dehydrogenase (housekeeping genes, [29]). Confidence intervals, and box plots made in R.