funRNA: a fungi-centered genomics platform for genes encoding key components of RNAi

Background RNA interference (RNAi) is involved in genome defense as well as diverse cellular, developmental, and physiological processes. Key components of RNAi are Argonaute, Dicer, and RNA-dependent RNA polymerase (RdRP), which have been functionally characterized mainly in model organisms. The key components are believed to exist throughout eukaryotes; however, there is no systematic platform for archiving and dissecting these important gene families. In addition, few fungi have been studied to date, limiting our understanding of RNAi in fungi. Here we present funRNA http://funrna.riceblast.snu.ac.kr/, a fungal kingdom-wide comparative genomics platform for putative genes encoding Argonaute, Dicer, and RdRP. Description To identify and archive genes encoding the abovementioned key components, protein domain profiles were determined from reference sequences obtained from UniProtKB/SwissProt. The domain profiles were searched using fungal, metazoan, and plant genomes, as well as bacterial and archaeal genomes. 1,163, 442, and 678 genes encoding Argonaute, Dicer, and RdRP, respectively, were predicted. Based on the identification results, active site variation of Argonaute, diversification of Dicer, and sequence analysis of RdRP were discussed in a fungus-oriented manner. funRNA provides results from diverse bioinformatics programs and job submission forms for BLAST, BLASTMatrix, and ClustalW. Furthermore, sequence collections created in funRNA are synced with several gene family analysis portals and databases, offering further analysis opportunities. Conclusions funRNA provides identification results from a broad taxonomic range and diverse analysis functions, and could be used in diverse comparative and evolutionary studies. It could serve as a versatile genomics workbench for key components of RNAi.


Background
RNA interference (RNAi), a term first coined in research on Caenorhabditis elegans, was originally thought to be a host defense mechanism against invasion of viral genomes or transposable elements [1]. However, several molecular studies revealed that it is also involved in diverse cellular, developmental, and physiological processes [2][3][4][5]. Gene silencing by RNAi begins with recognition of aberrant RNA (aRNA) or introduction of double-stranded RNA (dsRNA), such as viral genomes. RNA-dependent RNA polymerase (RdRP) is responsible for the generation of dsRNA from aRNA. Dicer slices dsRNA into small (21-25 nt) pieces. Argonaute then acts on these fragments by forming an RNA-induced silencing complex (RISC), which is subsequently guided to target mRNAs, resulting in gene silencing.
Despite the importance of this universal machinery in eukaryotes, many studies on RNAi focused only on functional, physiological, and molecular aspects, rather than comparative genomics. It is known that particular fungal taxa, for example budding yeasts, do not have the key components of RNAi [16]. Hence a systematic, extensive identification and evolution analysis are needed to determine the clear distribution of the genes and to trace their evolutionary histories. Furthermore, considering that Argonaute-encoding genes were found in a few noneukaryotic species [17], the taxonomic distribution and phyletic trajectory of these important genes could tell us more about their ancestral origin. As a solution, we developed funRNA http://funrna.riceblast.snu.ac.kr/ to provide a gene catalogue based on 1,440 genomes and a comparative, evolutionary genomics platform for the genes encoding Argonaute, Dicer, and RdRP.
In this paper, we discuss the following: i) the taxonomic distribution of the key components of RNAi; ii) sequence analysis of predicted RdRPs by multiple sequence alignment; iii) auxiliary domain variation in Dicers; iv) evolutionary analysis of the putative genes encoding Argonautes by gene duplication and loss; and v) database and web functionalities available on the funRNA website.

Results and discussion
Content and distribution of the identified genes encoding Argonaute, Dicer, and RdRP In order to predict putative genes, 1,440 genomes were searched using protein domain profiles (Figure 1; see Methods for details). 1,163 Argonaute-encoding genes, 442 Dicer-encoding genes, and 678 RdRP-encoding genes were predicted (Additional file 1). In order to evaluate the accuracy of the pipeline, a test set was prepared by retrieving sequences annotated as Argonaute, Dicer, and RdRP from UniProtKB/TrEMBL [18]. Assuming that the annotation provided by UniProtKB/TrEMBL is correct, the funRNA pipeline correctly captured 93.50% of the test set.
This result supports the accuracy and robustness of the funRNA gene identification pipeline. According to the prediction results, the average numbers of genes encoding Figure 1 Identification pipeline for funRNA. The identification pipeline for funRNA consists of two steps: i) defining domain profiles from protein sequences encoded by the reference sequences; and ii) scanning 1,440 proteomes with domain profiles for Argonaute, Dicer, and RdRP. In "Domain analysis", colored boxes indicate essential domains: blue, IPR003100 (Argonaute/Dicer protein, PAZ); red, IPR003165 (Stem cell self-renewal protein Piwi); purple, IPR005034 (Dicer double-stranded RNA-binding fold); green, IPR000999 (Ribonuclease III); orange, IPR001159 (Double-stranded RNA-binding); and gray, IPR007855 (RNA-dependent RNA polymerase, eukaryotic type). In addition, sequences collected from funRNA can be subjected to bioinformatics analysis on the funRNA website as well as in CFGP 2.0 by data exchange through the Favorite Browser.
Argonaute and Dicer were significantly higher in Metazoa and Viridiplantae than in Fungi (t-test: P ≤ 2.53e -6 for Argonaute and P ≤ 7.38e -5 for Dicer). In the case of RdRP, the Viridiplantae kingdom presented the largest average number of genes (5.09), followed by Fungi (2.04) and Metazoa (1.27) (t-test: P ≤ 2.45e -4 ). No genes encoding Argonaute, Dicer, or RdRP were detected in 1,059 and 51 proteomes of bacteria and archaea, respectively. However, two archaeal species (Methanocaldococcus jannaschii DSM 2661 and Pyrococcus furiosus DSM 3638) and one bacterial species (Aquifex aeolicus VF5) were predicted to have an Argonaute-encoding gene. In fact, PfAgo, an Argonaute found in P. furiosus DSM 3638, has been structurally characterized by X-ray crystallography [17], and was correctly captured by the pipeline.
In fungi, species belonging to the subphylum Agaricomycotina showed a higher number of genes in all gene families than any other fungal subphylum, with 5.68 Argonaute, 2.46 Dicer, and 6.93 RdRP genes on average (Figure 2 andTable 1). Putative genes were not predicted in the species belonging to the subphylum Ustilaginomycotina, in agreement with previous reports [19,20]. In the phylum Ascomycota, the majority of genes were found in species belonging to the subphylum Pezizomycotina. Species belonging to the subphylum Saccharomycotina, including five genomes of Candida spp., had only a few genes encoding Argonaute and no genes for Dicer and RdRP. Although a recent paper reported the presence of RNAi in Saccharomyces castellii and C. albicans, these species use non-canonical Dicers to generate small interfering RNAs [21]. Meanwhile, Schizosaccharomyces spp., belong to the subphylum Taphrinomycotina, had one gene for each of three gene families (Additional file 1).
In Plasmodium spp., no genes were predicted to encode the key players of RNAi. This is in accordance with previous studies performed with protozoan parasites including Trypanosoma cruzi, Leishmania major, L. donovani, and Plasmodium spp. [22,23]. The only exception is the RNA-mediated gene silencing in P. falciparum [24,25]. Accordingly, it was speculated that Plasmodium spp. might possess a non-canonical RNAi pathway that is yet to be characterized [26].
While the three genes were frequently predicted in two fungal subphyla, Pezizomycotina and Agaricomycotina, the number of genes in the latter subphylum varied more than that in the former. The greatest variance was found in the genes encoding RdRP. The number of RdRP genes in the Pezizomycotina species ranged from two to five, with a standard deviation of 0.64; for Agaricomycotina the range was zero to 14, with a standard deviation of 3.63. Interestingly, no putative genes encoding RdRP were found in two metazoan phyla, Arthropoda (except for in Ixodes scapularis, the blacklegged tick) and Chordata, to which fruit flies and humans belong, respectively. Considering the possibility that virus-encoded RdRPs may play a role in RNAi-like antiviral activity in plants [27], we speculated that the same could be happening in Drosophila spp. and mammals. This was supported by the fact that mouse oocytes with a horizontally transferred RdRP from a virus exhibited RNAi [28]. By contrast, worm species belonging to the phylum Nematoda had multiple genes with copy numbers ranging from three to eight. Meanwhile, in the Viridiplantae kingdom, a clear distinction in the number of genes was found between the Chlorophyta and Streptophyta phyla. Streptophyta species had higher average numbers of the three genes (13.55 for Argonaute, 4.86 for Dicer, and 7.27 for RdRP) than Chlorophyta (0.90, 0.40, and 0.30, respectively) (t-test: P = 1.60e -10 , 2.13e -08 , and 1.69e -10 , respectively). In Chlamydomonas reinhardtii, which belongs to the Chlorophyta, it was presumed that the absence of an RdRP gene is a reflection of its minimalistic genomic nature, and that it thus only exhibits the essential RNAi phenomenon [29]. It was also speculated that C. reinhardtii only recognizes dsRNA to trigger RNAi, since the transformation of single-stranded RNA favors non-homologous recombination [30]. Notably, Lotus japonicus was predicted to have no genes encoding Dicer. In fact, two genes encoding Dicer-like proteins were predicted to have RNase III and dsRNA-binding domains. They had only one RNase III domain, while canonical Dicers are known to contain two separate RNase III domains. Because experimental evidence shows that functional RNAi is present in L. japonicus [31,32], the two genes may encode real Dicers with a simpler domain structure. In addition, monocot plants tend to have more Argonaute genes than dicots. Actually, orthologs of Dicerlike genes in Oryza sativa were found in other monocot plants, such as Zea mays and Saccharum officinarum, but not in Arabidopsis thaliana, supporting gene duplication after monocot and dicot divergence [33].

Evolutionary history of gene duplication/loss and active sites residues in Argonautes
The identification results showed that Argonaute-encoding genes were found in many fungal species in the subphyla Pezizomycotina and Agaricomycotina, as well as in plants and animals (Additional file 1). In order to elucidate evolutionary footprints, reconciliation analysis was performed with a species tree and an Argonaute gene tree. A total of 34 species predicted to have Argonaute-encoding genes were subjected to the analysis, which covered species belonging to multiple kingdoms, including Fungi, Viridiplantae, Metazoa, Bacteria, and Archaea (Table 2). Massive gene duplication events were found in the animal and plant species. In fungi, however, duplications were found only in basidiomycetes and two ascomycetes.
Argonaute sequences found in Homo sapiens, A. thaliana, D. melanogaster, S. pombe, and C. elegans were analyzed for catalytic residues, or the DDH motif [36]. In addition, functional variants of the DDH motif were experimentally identified, giving a relaxed motif definition, DD[HDEK] [34,37]. Archaea, as the most divergent species evolutionarily, have a totally different composition of active site amino acids compared to other species (Figure 3). The DD [HDEK] motif was found in Argonaute sequences from a bacterium (Aquifex aeolicus), fungi, plants, and animals. In fungi, the first two residues were well conserved. The third residue was variable, but was predominantly aspartic acid ( Figure 3 and Additional file 2). Dicers in C. elegans, H. sapiens, and O. sativa have been through more specieslevel duplications, which possibly resulted in greater residue variation in the catalytic motif. The DDH triad was the most frequent motif in Argonautes of animals and plants; most fungus Argonautes had a DD[DEK] motif. "The Others", sequences showing aligned residues other than DD[HDEK], were much more common in animal species, especially C. elegans, possibly due to gene diversification resulting from a number of species-level duplication events.

Differential distribution of accessory domains in putative Dicers
In plants, the copy numbers of genes encoding Dicer-like proteins increased during their evolution, with diversification occurring by duplication and transposition [38]. In fact, our data also show diversification of Dicer-like genes  in plants, including Glycine max, Z. mays, O. sativa, Brachypodium distachyon, and A. thaliana. Diversification of Dicer often presented as alternatively spliced genes that produce multiple products (Additional file 3). Interestingly, this was true for species belonging to the phylum Streptophyta, but not for Chlorophyta. Only 3 of 10 Chlorophyta species were predicted to have one or two Dicerencoding genes, while the other seven were not predicted to have any Dicer genes (Additional file 1). This confirms the previous finding that diversification of Dicers in plants occurred before the divergence of monocot and dicot plants, but after the divergence of green algae and plants [38]. Even though a few fungal genomes provided alternatively spliced transcript information, fungal Dicers showed no evidence of diversification by alternative splicing. In N. crassa, Magnaporthe oryzae 70-15, and Cryptococcus neoformans var. grubii H99, for example, no alternatively spliced form was found for the putative Dicer-encoding genes.
Besides the essential RNase III and dsRNA-binding domains (Table 3), the other 50 additional domains were not universally present in the Dicers in plants, animals, and fungi (Additional file 4). On average, each had 5.28 additional domains other than the essential ones. The genes of the Viridiplantae species showed the highest number of additional domains (6.41), followed by those of Metazoa (5.12) and Fungi (4.87). The top six most frequent additional domains were IPR001650 (Helicase, Cterminal), IPR014001 (DEAD-like helicase), IPR011545 (DNA/RNA helicase, DEAD/DEAH box type, N-terminal), IPR003100 (Argonaute/Dicer protein, PAZ), IPR006935 (UvrABC complex, subunit B), and IPR014720 (Double-  [39][40][41], its function is not clear, although it has been speculated that it may mediate the formation of complexes between proteins of the Piwi and Dicer families by heterodimerization [42]. Future research may focus on the functionality of Dicers without the PAZ domain to demonstrate the essentiality of the domain in fungi. Differences in domain composition were also reflected in a phylogenetic tree that was constructed using the 442 Dicer sequences (Additional file 3). It is noteworthy that the tree was taxonomically divided into four clades: two Metazoa-dominant clades, one plant-dominant clade, and one fungus-dominant clade. In plant species, isoform products were grouped together closely, supporting the diversification reported previously [38]. Interestingly, the putative Dicers from metazoan species formed two distinct clades, one containing minimal domains and the other containing multiple additional domains (Additional files 3 Figure 3 Duplication and loss of Argonaute genes and variation of the catalytic motif. Gene duplication and loss events were estimated by reconciliation analysis. Red and blues dots are shown at internal nodes if duplication and loss were predicted, respectively. Black dots indicate nodes where both duplication and loss were discovered. Numbers of species-level duplication and loss events, and the number of putative genes encoding Argonaute, are shown between the tree and species name. In the rightmost column, amino acid variation of the DDH motif is shown with symbols: i) filled squares indicate that all the genes in the corresponding species had the conserved reference residue; ii) shaded squares indicate existence of the conserved residue and variants; iii) empty squares indicate variants without the conserved residue; and iv) single-letter amino acid codes indicate conserved residues, but not the reference amino acid. For the complete list of partial alignments near each amino acid, see Additional file 2. Pie charts shown in the internal nodes display the distribution of DDH motif variants for each taxon specified. The proportion of genes containing the conserved DDH motif is shown in green; H substituted by D, E, or K is shown in red; H substituted by another amino acid is shown in blue; and other variants are shown in orange. and 4). The two Metazoa-dominant clades suggest that minimal Dicer could be the ancestral form, which acquired additional domains during the evolution of individual organisms. Most of the fungal and plant Dicers possessed multiple additional domains.

Structural conservation analysis of residues in catalytic regions of RdRPs in fungi
In N. crassa, a 2.3-Å-resolution crystal structure of an RdRP (QDE-1) was characterized [43]. QDE-1 was structurally aligned with the protein sequences of bacterial and yeast polymerases. Structurally conserved catalytic motifs, including double-psi β-barrels (DPBB1 and 2), with multiple invariant residues were found. To test the conservation of such amino acid residues, a multiple sequence alignment was performed with 84 putative RdRP sequences from selected fungal proteomes (Table 2), including QDE-1. When counting the residues with 70% or higher conservation, 13 and 22 residues were found to be conserved in DPBB1 and DPBB2 based on the positions in QDE-1, respectively. Some were also reported to be conserved in a previous study. For example, three aspartic acid residues (D) located at positions 1,007, 1,009, and 1,011 in QDE-1 were conserved in 84.52% of the sequences analyzed. The high conservation of these three aspartic acid residues reflects their importance in binding Mg 2+ as a cofactor. Double glycine (G at positions 1,005 and 1,006 in QDE-1) was found in 83.33% of analyzed sequences, although not in bacterial and yeast polymerases [43] (Additional file 5).

Web utility
To provide a user-friendly interface, funRNA adopted the Data-driven User Interface powered by the Comparative Fungal Genomics Platform 2.0 (CFGP 2.0; http://cfgp.snu. ac.kr/) [44]. The genes identified by the pipeline can be browsed by species or gene family. The reference sequences used in pipeline construction are also available. The detail information page for each gene shows the gene structure, sequence information, domain structure, GO terms, information on similarity to the reference sequences, and results from seven additional bioinformatics programs. The statistics page of "Species Browser" provides a kingdom-/subphylum-level summary, giving a glimpse into the macro-taxonomic distribution. funRNA also provides analysis functions, including: i) sequence similarity searches (BLAST [45] and BLASTMatrix [44]); ii) multiple sequence alignment (ClustalW [46]) with fulllength or domain regions; and iii) protein domain analysis and download functions (Figure 4).
Extended analyses through sister web-based systems funRNA supports "Favorite Browser", a personalized virtual storage and analysis hub that was originally developed in CFGP 2.0 [44]. Sequences archived in funRNA have the same identifiers as those used in CFGP 2.0, enabling flexible data exchange with CFGP 2.0, as well as with sister databases [47][48][49][50]. In Favorite Browser in CFGP 2.0, 27 bioinformatics tools are currently available, providing broader analysis options for the sequences collected in funRNA. For example, a sequence collection created in funRNA could be analyzed in Favorite Browser in CFGP 2.0 to find conserved motifs by using the MEME program [51].

Conclusions
funRNA is a web-based workbench that provides an analysis environment for the key components of RNAi. funRNA provides: i) a putative gene archive from 1,440 proteomes over a wide taxonomic range; ii) graphs and summary tables for an overview of the target gene families; iii) detailed information about the predicted genes; iv) job submission forms for bioinformatics tools; and v) Favorite synchronization with CFGP 2.0 and sister databases to offer further analysis. In addition, diverse comparative analyses can be conducted, such as the analyses shown in this paper. In summary, funRNA is a useful resource for comparative and evolutionary genomics analyses of Argonaute, Dicer, and RdRP genes.

Establishment of protein domain profiles
In order to determine protein domain profiles for the genes encoding Argonaute, Dicer, and RdRP, annotated protein sequences were retrieved from the UniProtKB/ SwissProt database [18]. In total, 50, 44, and 13 sequences belonging to the respective gene families were subjected to domain profiling using InterPro scan (version 4.8) [52]. For each gene family, commonly shared domains were determined and used for prediction of putative genes ( Table 3). The domain profiles acquired from the reference sequences were consistent with previous findings [53,54]. According to previous studies, Argonautes contain PIWI and PAZ domains [36], and fungal Dicer and Dicerlike proteins have RNase III and double-stranded RNA binding domains [53]. For Dicer identification, sequences with only one RNase III domain were discarded from the final prediction to improve the results. All sequences used to construct the pipeline are available from "Reference Sequences" under the "Browse Data" menu.

Preparation of proteome sequences to be searched
To identify genes involved in small RNA processing, 1,440 proteomes were scanned with the protein domain profiles (Additional file 1). The target proteomes included 221 fungal/Oomycete genomes, as well as 1,060 bacterial, 53 archaeal, 32 plant, and 41 metazoan proteomes to investigate evolutionary traces in other kingdoms (Additional file 1). All the proteome sequences were obtained from the standardized genome data warehouse in CFGP 2.0 http:// cfgp.snu.ac.kr/ [44].

Evaluation of the pipeline
To evaluate the gene identification pipeline, sequences annotated as Argonaute, Dicer, and RdRP were obtained from UniProtKB/TrEMBL [18]. They did not include sequences used in domain profile determination. The test set consisted of 425 Argonaute, 209 Dicer, and 227 RdRP protein sequences. The sequences were scanned using InterPro scan [52], and searched with the funRNA domain profiles to assess the accuracy. Considering the average length of the sequences used in defining the domain profiles (942 aa for Argonaute, 1,575 aa for Dicer, and 1,081 aa for RdRP), sequences shorter than 500 aa were discarded from the test set.

Assessment of gene duplication and loss
A species phylogeny was created by using CVTree (version 4.2.1; source code distribution) [55]. Whole proteome sequences of the target species were used as the input, and K-tuple length was set to seven, which is known to be optimal for fungal phylogeny construction [56,57]. The output distance matrix was converted into a neighborjoining tree by using neighbor in the PHYLIP package (version 3.69) [58]. To build gene trees, multiple sequence alignment was performed by using MUSCLE in MEGA6 [59]. Subsequently, a phylogenetic tree was constructed with the Minimum Evolution algorithm by using MEGA6 [59]. To investigate gene duplication and loss events, reconciliation analysis was conducted by using Notung software (version 2.6) [60] with the species and gene trees. A total of 34 genomes were subjected to the reconciliation analysis, comprising 25 fungi, one Oomycete, one bacterium, two archaea, two plants, and three animals. Non-fungal species were also included to better understand the evolutionary history (Table 2).
Multiple sequence alignment, phylogenetic tree construction, and visualization of conserved sequences Full-length protein sequences of the 442 predicted Dicers were aligned using ClustalW [46]. A phylogenetic tree was constructed using MEGA6 [59] by the Minimum Evolution method with 10,000 bootstrap replicates. In order to detect conservation of amino acid residues, 84 RdRP sequences were aligned using M-Coffee [61]. One putative RdRP-encoding gene, FOXG_00217 in Fusarium oxysporum, was excluded from the analysis because of its very short domain region (68 aa). Sequence logos were created by using WebLogo [62]. multiple sequence alignment of 84 sequences including an RdRP from N. crassa (QDE-1). Amino acid residues with 70% or more conservation are highlighted in red.