RNA-seq highlights parallel and contrasting patterns in the evolution of the nuclear genome of holo-mycoheterotrophic plants

• While photosynthesis is the most notable trait of plants, several lineages of plants (so-called holo-heterotrophs) have adapted to obtain organic compounds from other sources. The switch to heterotrophy leads to profound changes at the morphological, physiological and genomic levels. • Here, we characterize the transcriptomes of three species representing two lineages of mycoheterotrophic plants: orchids (Epipogium aphyllum and Epipogium roseum) and Ericaceae (Hypopitys monotropa). Comparative analysis is used to highlight the parallelism between distantly related holo-heterotrophic plants. • In both lineages, we observed genome-wide elimination of nuclear genes that encode proteins related to photosynthesis, while systems associated with protein import to plastids as well as plastid transcription and translation remain active. Genes encoding components of plastid ribosomes that have been lost from the plastid genomes have not been transferred to the nuclear genomes; instead, some of the encoded proteins have been substituted by homologs. The nuclear genes of both Epipogium species accumulated mutations twice as rapidly as their photosynthetic relatives; in contrast, no increase in the substitution rate was observed in H.monotropa. • Holo-heterotrophy leads to profound changes in nuclear gene content. The observed increase in the rate of nucleotide substitutions is lineage specific, rather than a universal phenomenon among non-photosynthetic plants.


Introduction
The capability for photosynthesis is the iconic trait of plants and is of the highest importance to the biosphere.
However, some plants, including several thousands of flowering plant species, obtain organic substances from sources other than photosynthesis (Merckx et al., 2009;Westwood et al., 2010). These plants acquire organic compounds either from associated fungi (mycoheterotrophy) or by parasitizing other plants. Most of these species combine photosynthesis and heterotrophy, but several hundred species have totally lost photosynthetic ability and become completely heterotrophic (holoheterotrophs). The acquisition of heterotrophic ability has occurred in the evolutionary history of plants more than 50 times (Merckx et al., 2009;Westwood et al., 2010). The switch to heterotrophy leads to profound changes at the phenotypic level (reduction of leaves, loss of green colour, reduction of the vegetation period) that are highly parallel in different lineages. The genotypic alterations that underlie these changes are for the most part unclear. The Genetic and genomic studies of heterotrophic plants are currently focused on two aspects. The first is the interaction of parasitic plants with their hosts and their adaptations to parasitism (e.g., Yang et al., 2015).
Extensive exchange of transcripts occurs between hosts and parasites (Kim et al., 2014). On an evolutionary scale, a large number of horizontal gene transfer (HGT) events from hosts to parasites have been found in the organellar and nuclear genomes of parasitic plants (Bellot et al., 2016;. A recent large-scale survey of HGT in Orobanchaceae showed that the number of these events correlates positively with the degree of heterotrophy (Yang et al., 2016). The second aspect is the evolution of organellar (mainly plastid) genomes. As expected, the plastomes of heterotrophic plants are reduced in size due to the loss of genes related to photosynthesis, and may retain only ~ 7.5% of the length of a typical plastome (Bellot & Renner, 2015). Despite the high degree of reduction, the plastomes of non-photosynthetic plants retain genes whose products are involved in translation, specifically transfer RNAs, components of the plastid ribosome and two other genes, accD and clpP (e.g., Wicke et al., 2013;Barrett et al., 2014;Schelkunov et al., 2015;Bellot & Renner, 2015;Lam et al., 2016). Although there are several considerations regarding the retention of the plastome, complete loss of the plastome is also apparently possible. At present, there are two known cases of such loss, one in Polytomella, a genus of unicellular algae (Smith & Lee, 2014), and one in the parasitic angiosperm Rafflesia lagascae (Molina et al., 2014), although alternative explanations are possible in the latter case.
Much less information on the mitochondrial genomes of non-photosynthetic plants is available, although there are indications that these genomes are not as extensively reduced in size (Fan et al., 2016).
The changes in the nuclear genomes of nonphotosynthetic plants have not been well studied. To date, the only work to deeply analyse the nuclear genomes of holo-heterotrophic plants was performed in Orobanchaceae, where a holo-heterotrophic species, Orobanche aegyptiaca, was compared with two of its relatives, one hemi-autotroph with obligatory parasitism and one hemi-autotroph with facultative parasitism (Wickett et al., 2011). Surprisingly, the authors found evidence for conservation of the pathways responsible for chlorophyll synthesis. Additionally, in one study of the Hypopitys monotropa plastome, transcriptome analysis showed that many genes related to photosynthesis have been lost (Ravin et al., 2016).
To obtain a more detailed understanding of the evolution of holo-heterotrophic plants, in this work, we analyse the nuclear genomes of Epipogium aphyllum and E.roseum (Orchidaceae) and H.monotropa (Ericaceae) 3 using transcriptome sequencing. These plants are good models for studying the characteristics of holoheterotrophic plants since their plastomes are among the most reduced in size (Schelkunov et al., 2015;Logacheva et al., 2016)  Transcriptome assembly and postprocessing Reads were trimmed with Trimmomatic 0.32 (Bolger et al., 2014). Trimming involved the removal of sequencing adapters, bases with a Phred quality of less than 3 from both the 5' and 3' ends of reads, and bases from the 5' ends of reads starting from a region with 5 consecutive bases with an average score of less than 10 (trimming with a sliding window). Additionally, reads that had average quality of less than 20 after trimming and reads that after trimming became shorter than 30 bp were removed.
Assembly was performed using Trinity version r20140717 (Grabherr et al., 2011). Digital normalization of coverage to 50× was switched on. After assembly, we performed filtration, removing minor isoforms, low-coverage transcripts and contamination. We defined major isoforms as the isoforms to which the highest number of reads (estimated using RSEM (Li & Dewey, 2011)) were mapped relative to other isoforms. After removing minor isoforms, we filtered low-coverage transcripts by mapping reads to contigs using CLC Assembly Cell 4.2 and retaining only transcripts with an average coverage of at least 3. Potential CDSs in the transcripts were then determined using TransDecoder version r20140704 (Haas et al., 2013 Table S3 are the sums of all gene isoforms.

Transcriptome annotation and Gene Ontology analysis
Three types of annotations of the CDSs were performed.
The first was the computation of 1-1 orthologs from the CDSs of each species by reciprocal alignment with Arabidopsis thaliana proteins from TAIR10 database (Berardini et al., 2015) using BLASTP (parameters are the same as indicated above). Also we performed GO annotation using results of BLASTP alignment of all proteins against the NCBI NR (parameters as above) and annotated them with B2G4Pipe 2.5, a command-line wrapper for Blast2GO (Conesa et al., 2005). Then we assigned KEGG Orthology identifiers to the CDSs with the GhostKoala server (Kanehisa et al., 2016b). The orthogroups in the studied species were calculated separately for Orchidaceae and Ericales using OrthoMCL 2.09 (Li et al., 2003) (Yekutieli & Benjamini, 2001).
Since our goal was the analysis of nuclear genes, we excluded genes encoded in the plastid and mitochondrial genomes from the enrichment analysis by excluding all genes that were 1-1 orthologs of plastid and mitochondrial genes of Arabidopsis thaliana. Some plastid and mitochondrial transcripts may not have been present in the assemblies. To compensate for this, in Table S3, we indicated a plastid gene as present in the plastome of a studied species not only when it was found through the 1-1 ortholog method, but also when it was present in the annotation of the plastome of that species provided in GenBank (accessions are provided in

Analysis of protein targeting to organelles
To estimate the number of proteins targeted to various organelles, we predicted transit peptides using TargetP 1.1 (Emanuelsson et al., 2000) and DualPred (Saravanan & Velan Lakshmi, 2015). Unlike TargetP, DualPred is also capable of predicting proteins that are dually targeted to mitochondria and plastids. TargetP classifies predictions into five "reliability classes", where the most confident predictions belong to the first class, and the least confident predictions belong to the fifth class. We considered a protein to be potentially targeted to an organelle if its transit peptide exhibited a reliability class of four or less.
For DualPred, we considered a protein to be dually targeted to plastids and mitochondria if it presented a dualtargeting score of at least 0.5, as suggested by the author of DualPred in a personal communication. Prediction of transit peptides was performed only for proteins whose genes exhibited a completely assembled 5'-end according to TransDecoder and were assigned at least one GO term.
To predict the targeting of ribosomal proteins, we 6 utilized a more elaborate technique. It has been demonstrated that some proteins are dually targeted to plastids and mitochondria not because they have one transit peptide that allows them to enter both organelles, but because their mRNAs exhibit alternative translation start sites, resulting in proteins with different transit peptides (Mitschke et al., 2009 (Finn et al., 2016), using the HMMER server (Finn et al., 2015). To identify proteins that belonged to these families in E.aphyllum, E.roseum and H.monotropa, we conducted a search with hidden Markov models of these families using the hmmscan tool from HMMER package, version 3.1b1 (Eddy, 1998 (Rasmussen, 1995). Despite the fact that we sequenced RNA from the above-ground parts of the plants, we observed a large number of bacterial and fungal transcripts in the assemblies (Fig. S1) Estimation of the completeness of the assemblies based on the set of genes that are expected to be present in all eukaryotic genomes (Parra et al., 2007) showed that > 90% of these genes were at least partially assembled (Table S5).
Notably, the genomes were assembled less completely than the transcriptomes on average, with a median of 95.6% of the genes being assembled at least partially in the transcriptome assemblies, compared with 90.8% of the genes in the genome assemblies. These results show that, given sufficient coverage, RNA-seq is as good as complete genome sequencing in terms of the number of retrieved genes, while being less costly and using an assembly process that is computationally faster.

Gene retention and reduction
Plastid-targeted proteins carry specific amino acid sequences known as targeting signals or transit peptides that interact with the translocon system and enable the import of these proteins into plastids. interpret (e.g., "Single-organism metabolic process" and "Membrane"). The most underrepresented GO terms are listed in Table 2; for a complete list, see Table S3. GO terms that are similar to each other, such as "plastid thylakoid lumen" and "chloroplast thylakoid lumen", are combined here. *-The fraction of genes with a specific GO term is the number of genes with that GO term in a species divided by the total number of genes with GO terms in that species. The median ratio of fractions is a measure of the difference in the numbers of genes with a specific GO term between two groups of species, calculated as a median value among all pairwise comparisons in which the first member in a pair comes from the first group, and the second member of the pair comes from the second group.

GO enrichment analyses between H.monotropa and
three photosynthetic Ericales showed similar results; 17 GO terms were overrepresented, and 9 are underrepresented. The overrepresented terms were related to mobile elements, and the underrepresented terms were related to photosynthesis and plastids (Table 2; Table S3).
Notably, we did not observe changes in the list of GO terms other than related to photosynthesis and plastids, despite dramatic differences in the morphology and physiology of the studied species. This finding indicates that these differences are controlled not at the level of the presence or absence of specific genes, but rather by the regulation of gene expression. To gain insight into this regulation, more detailed transcriptome data are required.
Alternatively, these morphological and physiological changes may have originated from the disruption or loss of a small number of genes which do not produce statistically significant results in GO enrichment analysis.
The statistical analysis of GO terms is quite rough method and may not reflect differences at the level of individual genes. Thus, we also searched for orthologs of genes that are known to participate in processes occurring in the plastids in the model plant Arabidopsis thaliana.  (Shiina et al., 2005). The redox-sensitive components of the plastid translocon, which is the system responsible for the import of proteins from the cytoplasm to plastids, have also been lost, with the exception of TIC32 of Hypopitys.  (Table   3). In particular, components of clp-protease and acetyl-CoA carboxylase whose counterparts (clpP and accD) are encoded in the plastome have been retained. Consistent with this finding, NEP, which transcribes genes not related to photosynthesis, has been retained in both species, as have most components of the plastid ribosome (but see the discussion below). We also found transcripts of most proteins responsible for the replication and repair of the plastome. However, this situation is more complex because many of these proteins are targeted not only to plastids but also to mitochondria and the nucleus, and information on these proteins is sometimes inconsistent between different experiments (Tanz et al., 2013;Huang et al., 2013).
As shown above, the genomes of Epipogium and H.monotropa encode a number of proteins that must be imported into plastids. Accordingly, the components of the plastid translocons for both the outer-and innerenvelope membranes must be retained. Recent studies in A. thaliana have shown that the plastid-encoded protein ycf1 (TIC214) plays an essential role in plastid translocation and that it acts at the inner membrane (Kikuchi et al., 2013). However, the ycf1 gene is absent from both the Epipogium and H.monotropa plastomes. A transcript similar to ycf1 was only found in the transcriptome assembly of E.aphyllum, in which it carried a signal for targeting of the protein to mitochondria.
Homologs of TIC100 and TIC56, which encode proteins that interact with Ycf1 within the TIC complex, were also absent from all three species. It should be noted that Ycf1 and its interacting proteins are also absent from several photosynthetic species, including grasses and Vaccinium macrocarpon, which raises a question regarding the universality of the function of Ycf1 (de Vries et al., 2015).
The current model of TIC postulates the existence of two complexes. The first, referred to as the "photosynthetictype" complex, consists of Ycf1, TIC56, TIC100 and TIC20-I and is a major TIC complex that functions in most land plants to import proteins involved in photosynthesis.
The second complex is a "non-photosynthetic-type" complex, which imports proteins that are not related to photosynthesis (Nakai, 2015b). It has been hypothesized that the switch from the major to the alternative system of protein import occurred in grasses (Nakai, 2015b;Nakai, 2015a) and that the major system then degraded. The pattern of gene loss and retention observed in Epipogium and H.monotropa suggests that this could also be the case in these species. The existence of two complexes, one of which mainly imports photosynthetic proteins, while the other is non-photosynthetic, has also been postulated for the outer translocon TOC (Nakai, 2015b;Nakai, 2015a).

Epipogium and H.monotropa possess orthologs of TOC
proteins, but not a complete set ( Table 2), suggesting that only one of these complexes is retained in these plants. only 1 showed increased dN/dS with a significant p-value (< 0.05 by the likelihood ratio test with Bonferroni correction). The mean dN/dS for these 5 genes in Epipogium was only slightly higher than the mean in its photosynthetic relatives, at 0.11 versus 0.05, respectively (Table S7). However, all of the genes in this pathway were expressed at levels many times lower than in photosynthetic species (Table S7), suggesting that it could be not functional being the remnant of one that was active in a photosynthetic ancestor of Epipogium. Alternatively, chlorophyll a could indeed be synthesized in these species, where it may act in cellular processes that are unrelated to photosynthesis. Chlorophyll a has been found in many heterotrophic plants via chromatography (Cummings & Welschmeyer, 1998) (notably, among these species is Monotropa uniflora, a close relative of H.monotropa); transcriptome sequencing in parasitic Orobanche aegyptiaca has also demonstrated the presence of genes responsible for chlorophyll a synthesis, with no signs of relaxed selection (Wickett et al., 2011). It has been shown that in A. thaliana, pheophorbide a, a product of chlorophyll a catabolism, causes cell death in a lightindependent manner (Tanaka & Tanaka, 2006;Hirashima et al., 2009). Conservation of this mechanism in Arabidopsis and non-photosynthetic plants could explain the conservation of the chlorophyll a synthesis pathway. Despite the differences discussed above, Epipogium and H.monotropa show striking parallelism in the reduction of their nuclear genes, considering that these plants lost autotrophic ability independently and that they diverged approximately 150 million years ago (Hedges et al., 2015).

Mutation accumulation rates
Variation in mutation accumulation rates is a phenomenon that is widely observed in plants (e.g., Bromham et al., 2015). In particular, it has been found that in parasitic plants, the mutation accumulation rate is increased not only in plastome, as expected for heterotrophic plants, but also in nuclear and mitochondrial genomes (Bromham et al., 2013). Concerning mycoheterotrophic plants, information on mutation rates is mostly limited to plastomes, which show increased substitution rates with only rare exceptions (Logacheva et al., 2014;Barrett et al., 2014). Our previous studies of Epipogium and H.monotropa indicated that the mutation accumulation rate in plastid genes, at both synonymous and in nonsynonymous sites, was increased in comparison to the holo-autotrophic relatives of these species, but at different levels (approximately 20 and 2.5 times respectively) (Schelkunov et al., 2015;Logacheva et al., 2016).
Characterization of transcriptome sequences allowed us to test whether this increase was confined to plastid genes.
To calculate the mutation accumulation rate in the nuclear genomes, we used concatenated sequences of genes from orthogroups containing exactly one gene in each species.
There were 4479 and 2547 of these orthogroups in Orchidaceae and Ericales, respectively. The mutation accumulation rates in both Epipogium species were approximately two times higher than the mutation rates in their photosynthetic relatives. The rates of nonsynonymous and synonymous mutation accumulation in Epipogium were increased proportionally (Figs. S6-S8). In contrast, the mutation accumulation rate in H.monotropa was not increased (Fig. 2).  the products of such nuclear copies are targeted to plastids and function as a part of the plastid ribosome, while the corresponding gene having been lost from the plastome (Ueda et al., 2007;Jansen et al., 2011;Park et al., 2015).
Additionally, components of mitochondrial ribosomes can be dually targeted to both plastids and mitochondria (Kubo & Arimura, 2010;Ueda et al., 2008). In E.aphyllum (Schelkunov et al., 2015), 7 ribosomal protein genes (rpl20, rpl22, rpl23, rpl32, rpl33, rps15, rps16) have been lost from the plastome; in E.roseum 6 of these 7 genes, but not rpl20, have been lost. Regarding H.monotropa, we previously considered rps19 and rpl22 to be pseudogenes (Logacheva et al., 2016) due to the presence of a 111-bp insertion in the former and a nonsense mutation that shortens the length of the product by 17% in the latter.
However, because these genes are transcribed and exhibit dN/dS values close to those of the species' photosynthetic relatives (  Table S8).
To determine whether the genes whose products may serve to replace the lost ribosomal proteins are encoded in mitochondrial or nuclear genomes, we used TBLASTN to align the proteins against contigs produced during the assembly of the plastomes of E.aphyllum and H.monotropa in our earlier studies (Schelkunov et al., 2015, Logacheva et al., 2016. We did not observe the sequences of those genes in the mitochondrial contigs and therefore conclude that they are located in the nuclear genomes.

Conclusions
In this study, we analysed and compared the transcriptomes of the mycoheterotrophic plants Epipogium aphyllum, E.roseum and Hypopitys monotropa.
Despite the fact that Epipogium and H.monotropa are very distantly related, belonging to the Monocots and Dicots respectively, and that these species lost photosynthesis independently, we observed a remarkable level of parallelism involving the reduction and retention of similar functional groups of genes. Among the observed differences were a more profound reduction in the chlorophyll a synthesis pathway in H.monotropa and an increased rate of mutation accumulation in Epipogium.
Overall, since there are several hundred holoheterotrophic species of flowering plants (Merckx et al., 2009), with many cases of independent transitions to holo-heterotrophic lifestyle, it is necessary to sequence and analyse more holo-heterotrophic species in addition to Epipogium, H.monotropa and Orobanche aegyptiaca to predict whether this parallelism is universal. Significant help in this determination may be provided by the "1000 Plants Project" (Matasci et al., 2014), in which the sequencing of many holo-and hemi-heterotrophic species is being performed. The question that remains unanswered in this study is the mode of gene loss -did it occur through deletions of large regions carrying photosynthesis-related genes, or through the accumulation of mutations in the protein-coding and regulatory elements of these genes?
We expect that characterization of the nuclear genomes of non-photosynthetic plants will fill this gap.
Online supporting material Fig. S1 Statistics regarding contamination in the studied transcriptomes. A total of 10,000 random transcripts (prior to the removal of low-coverage transcripts and searching for ORFs, but after the removal of minor isoforms) were taken from each assembly, and BLASTX alignment to NCBI NR (maximum allowed e-value of 10 -5 , word size of 3 amino acids, low-complexity sequence filter switched off) was performed. The transcripts were classified according to their best matches. In the distribution plots, the black lines denote median values, and the boxes denote interquartile ranges.         Epipogium, Hypopitys monotropa and their photosynthetic relatives * -"u" indicates underrepresentation; "o" indicates overrepresentation. Genes with a GO term are considered underrepresented in holo-heterotrophic species if their proportions relative to the total numbers of 25 genes with GO terms in those species are significantly lower than in photosynthetic species.
Overrepresentation corresponds to the opposite situation. Table S4 Detailed statistics of the transcriptome assemblies. * -In a dataset deposited by the authors of the Vaccinium macrocarpon genome assembly, several isoforms are provided for some genes. In these cases, we referred to the longest isoform as the "major" isoform, since information on the relative expression of isoforms was not supplied. Table S5 Analysis of the completeness of the assemblies. Table S6 Proportions of genes whose products are targeted to various organelles relative to the total numbers of genes in the species according to TargetP analysis. Only genes with complete 5' ends and at least one assigned GO term are considered. The quality of the different assemblies differs; thus, the number of genes with completely assembled 5' ends also differs. Therefore, the numbers of proteins with transit peptides cannot be directly compared, and only proportions are provided in the table. Table S7 Presence of genes of interest in the studied species and selective pressures acting on them.
Table S8 Results of a search for possible substitutes for ribosomal proteins whose genes have been lost from the plastomes of Epipogium and Hypopitys monotropa.