Characterisation of novel endogenous geminiviral elements in macadamia

The presence of geminivirus sequences in a preliminary analysis of sRNA sequences from the leaves of macadamia trees with abnormal vertical growth (AVG) syndrome was investigated. A locus of endogenous geminiviral elements (EGE) in the macadamia genome was analysed, and the sequences revealed a high level of deletions and/or partial integrations, thus rendering the EGE transcriptionally inactive. The replication defective EGE in the macadamia genome indicates its inability to be the source of new viral infections and thus cause AVG or any other disease in macadamia. The EGE sequences were detected in two edible Macadamia species that constitute commercial cultivars and the wild germplasm of edible and inedible species of Macadamia. This strongly suggests that the integration preceded speciation of the genus Macadamia. A draft genome of a locus of EGE in Macadamia was developed. The findings of this study provide evidence to suggest the endogenization of the geminiviral sequences in the macadamia genome and the ancestral relationship of EGE with Macadamia in the Proteaceae family. Random mutations accumulating in the EGE inform that the sequence is evolving. The EGE in Macadamia is inactive and thus not a direct cause of any diseases or syndromes including AVG in macadamia. The insertion of the EGE in the macadamia genome preceded speciation of the genus Macadamia.


Introduction
Viral DNA that inserts in the nuclear genome of its host organism germline cells and therefore is transmitted between host generations like a normal cellular gene is referred to as an endogenous viral element (EVE). EVEs are very common in the plant kingdom, and often derive from DNA viruses in either the Caulimoviridae or Geminiviridae families [1,2]. As EVEs are essentially molecular fossils of viruses that may have existed millions of years ago, they provide valuable insights into virus evolution [3,4]. EVEs also likely play a major part in plant evolution by contributing novel protein-coding genes or exons [5,6], and may also help shape the transcriptome by being sources of novel promoter regulatory elements or non-coding small RNAs [7]. In some cases, the EVEs represent entire viral genomes, retain replication-competency and can be the origin of new infections [8][9][10][11].
Macadamia, an important nut-bearing tree that originates from Australia but is now commercially cultivated in subtropical regions throughout the world, is typical of many plants by having EVEs. At least three types of EVE are present in Macadamia integrifolia, two of which are derivatives of ancestral petu-and florendoviruses from the family Caulimoviridae [1], and a third that is representative of an unassigned genus in the family Geminiviridae [2]. All EVEs were identified using software pipelines to interrogate plant genome databases for the presence of highly conserved viral genes such as the replication-associated (Rep) protein of the Geminiviridae. The Rep gene in the macadamia genome is flanked by sequences encoding other conserved viral domains, which suggests that a complete or near complete viral genome had been integrated in the macadamia genome. In a phylogenetic analysis, Sharma et al. [2] placed the macadamia Rep gene sequence in a clade of endogenous geminiviral elements (EGEs) from Citrus, Coffea and Camelia spp., which together are basal to the genus Begomovirus. As there are currently no publicly available transcriptome databases for macadamia, Sharma et al. [2] were not able to judge whether the macadamia EGE was transcriptionally active using a bioinformatics approach alone.
In Australia, macadamia is afflicted by a disorder known as abnormal vertical growth (AVG) syndrome. The symptoms of AVG are vigorous upright growth, a reduction in the production of lateral shoots and far fewer flowers, leading to serious yield loss [12]. The cause of AVG has remained elusive despite at least two decades of research on the problem. Analyses of the spatiotemporal dynamics of the AVG epidemic over a 10-year period suggest slow spread over very short distances, consistent with a biotic cause for the disease [13]. It has been suggested that AVG is caused by a vascular-limited organism that has the capacity to modulate plant hormone metabolism [13].
In this study, we hypothesized that the EGE in macadamia is the cause of AVG. To test this hypothesis, a locus containing a putative full-length viral genome sequence was characterized and the replication-competency of this sequence examined by searching for mutations that could disrupt the open reading frames (ORFs). Other tests of replication competency were done by testing for RNA transcripts and examining whether single-stranded, circular DNA molecules that corresponded to the encapsidated form of the virus genome were present in the plant.

Characterisation of EGEs in the genome of M. integrifolia
A tBLASTN search of the most recent chromosome scale assembly of the macadamia genome (NCBI BioProject: PRJNA593881), using the tomato leaf curl virus (TLCV) Rep protein sequence as the query, revealed highly significant matches (E values < 5 × 10 − 40 ) to loci on four pseudochromosomes (4, 5, 9 and 13) and on two other orphan scaffolds (134 and 2940). Only three matches appeared to have full-length Rep gene sequences based on query coverage of greater than 90%. All coding sequences were interrupted by mutations that led to the insertion of at least one internal stop codon within the predicted ORF. Pseudochromosome 9 was of particular interest as it had a 3751 bp stretch of sequence that contained one complete and two partial Rep genes, suggesting a multimeric insertion of geminiviral sequence.
PCR primer walking and Sanger sequencing were done to validate the EGE sequence extracted from the genome assembly of M. integrifolia and both sequences were identical. The sequence that was obtained by Sanger sequencing has been deposited in GenBank under accession MZ474517. Closer inspection of this locus revealed a monomeric unit of 3440 bp, which was partially duplicated immediately downstream (Fig. 1). By BLASTX comparison with the various begomovirus proteins, Rep, AC2, AC3 and AC4 and two AV1 genes in opposite directions were identified, all of which contained mutations that interrupted the ORFs (Fig. 1). At the cut off level of mutations tolerated to find gene sequences, AC3 and AC4 genes were not found internally in the second and first copies of AV1 sequences, respectively (Fig. 1). A conserved geminivirus origin of replication (ori) with the sequence of TAA TAT TAC was found. In addition, an ori sequence of TGA GAT TCC, which had the second base mutated compared to a Becurtovirus ori, was identified 262 nt upstream of the previous ori sequence. The final assemblage of the macadamia EGE (UZVR_3,440 bp) produced using overlapping sequences obtained from Macadamia integrifolia ( Fig. 2a; GenBank accession MZ474517), predicted three putative partial conserved domains, including geminivirus Rep catalytic domain, geminivirus Rep protein central domain and geminivirus C4 protein (Fig. 3). A geminivirus Rep protein central domain (AL1_M, 252-440 bp), Rep catalytic domain (AL1, 563-895 bp) and an N terminus region of C4 or AC4 protein (586-756 bp) were predicted in reading frames 1, 2 and 3 of the final EGE assemblage, respectively (Fig. 3a). The genome organization of the macadamia EGE was similar to monopartite begomoviruses. Although the sequences of above domains were highly conserved in macadamia EGE, most of the other domains of begomovirus were not detected or showed substantial level of mutations.
Many begomoviruses carry betasatellite DNA components, which modulate symptoms and enhance viral DNA accumulation but are nonessential for viral replication. A protein encoded by a complementary sense ORF (βC1) on the betasatellite molecule is the major determinant of pathogenicity [14]. To examine whether the ancestral geminivirus in M. integrifolia had a betasatellite, a tBLASTN search of the aforementioned macadamia genome assembly was done using the C1 protein sequence of a tomato begomovirus betasatellite DNA (GenBank Accession NP_859031) as the query but no significant hits were obtained.

Transcriptional activity and replication competency of EGE in macadamia
While sequence decay was observed at the EGE locus that was examined, it may still be possible that an infective geminivirus genome could be released through several DNA recombination events. Alternatively, the plant may be infected with a descendant of the virus that became endogenized. Two indicators of geminivirus infection are the presence of RNA transcripts and of circular DNA molecules and experiments were done to address each of these points. The search for transcription of the virus genome, with RT-PCR assays targeting the conserved Rep gene using RNA extracts from eight samples including both AVG and healthy trees, revealed no amplification. The primer-binding sites were located in exons that were separated by an intron, allowing amplicons from carryover DNA to be separated by size from those obtained with the spliced RNA transcript. Accordingly, two amplicons of approximately 850 bp (DNA) and 188 bp (mRNA) were obtained with NADH primer pair using a total nucleic extract as template ( Fig. 4; Finally, we were unable to detect circular forms of the geminiviral DNA using a sequence non-specific amplification assay (TempliPhi), which also suggested that there were no active infections ( Supplementary Fig. 2).

Presence of EGE orthologues in the four Macadamia species in Proteaceae
To search for orthologues of 3.4 kb EGE locus in the four Macadamia species, PCR assays were designed to amplify the junction between plant and viral sequence at either end of the locus. The expected amplicons for both junction sequences were present in all the macadamia species (Fig. 5a, b; Supplementary Fig. 1). Furthermore, two additional amplicons representing the upstream junction sequence were obtained for M. jansenii (M51) Fig. 4 PCR amplicons of DNA and cDNA to evaluate the transcriptional activity of Rep protein of EGE in macadamia. a PCR products of total nucleic acid with NADH primer pair (Nad2.1a and Nad2.2b) in 1% agarose gel. b PCR amplicons, which were obtained from cDNA with Rep and NADH primers, in a 1% agarose gel  Fig. 5a; Supplementary Fig. 1), indicating a more complicated integration pattern in this plant species ( Fig. 5a; Supplementary Fig. 1). When the analysis was extended to a more distantly related member of the Proteaceae, Buckinghamia celssia (Ivory curl), evidence was obtained for the absence of the locus using the PCR for the upstream junction sequence, as well using a PCR for the Rep gene sequence (Fig. 5c; Supplementary Fig. 1). Compared to the EGE sequences of M. integrifolia and M. tetraphylla, partial EGE sequences (~ 600 bp) developed in M. jansenii and M. ternifolia showed 60 and 10% mutations, respectively (Fig. 2b, c). The number of mutations were likely to be proportional to genetic relatedness of the Macadamia species.

Discussion
In this study we have characterized the EGE in M. integrifolia that was first identified by Sharma et al. [2], and provided compelling evidence that it is replication-defective and therefore very unlikely to be a direct cause of AVG. The EGE in Macadamia appears to be accumulating random mutations, suggesting that there is not strong selection pressure for maintenance of the viral ORFs. Our study using RT-PCR confirmed the transcriptional inactivity of Rep sequence of EGE in macadamia. Further, a draft genome of the EGE in macadamia was developed.
Our results suggest that the insertion of the EGE in the macadamia genome preceded speciation of the genus Macadamia. Complex evolutionary associations between viruses and their hosts have been revealed by endogenous viral elements [15]. Most endogenous viral elements are functionally defective in hosts, but some endogenous viral elements have retained their role in some hosts or may acquire certain functions, advantageous to the host [16,17]. The fact that the EGE has persisted through several host radiation events without being expunged suggests that it is serving some useful function to the plant. Australia, the place of origin of Macadamia, has ancient soils that are deficient in both nitrogen and phosphorous, the two building blocks of DNA. Hence, there presumably has been strong selection pressure against the development of genome obesity in Macadamia through the accumulation of unwanted repetitive DNA elements because of the extra nutritional demands placed on the plant [18]. The insertion of either an endogenous viral element or a retrotransposon into or near a gene can alter the activity and function of that gene by a variety of mechanisms such as by contributing novel transcriptional regulatory motifs or by inducing transcriptional silencing [19][20][21]. Insertion of the EGE could therefore have enabled rapid change in the metabolism and therefore phenotype of the plant in the face of a new environmental stress. The EGE could also confer resistance to the cognate exogenous virus if it should still exist [17].
Mutations and recombination events in the EGE in macadamia may have led to the reduction in the number of conserved domains found in the Macadamia EGE. Transcriptionally inactive sequences of EGE in macadamia also indicate that the sequences are methylated or decayed. Mutational decay is the major cause for endogenous viral elements to be defective and for reduction in the number of conserved domains [4,22].
The absence of circular forms of geminiviral elements indicates that the EGE is not activated to be the origin of new viral infections. Endogenized viral sequences may be activated under stress condition, resulting in disease in the host. For example, stress factors such as micropropagation by tissue culture was found to trigger the activation of endogenous banana streak viruses in Musa cultivars [23]. The heavily mutated ORF of the Rep sequence of the EGE confirmed that the sequence is replication defective and thus not activated to cause any diseases such as AVG. A functional Rep protein is required for the replication of geminiviral DNA [24]. In particular, motifs 1-3 and helices 1 and 2 in the N-terminal domain of the Rep protein of geminivirus are crucial for the replication of Tomato golden mosaic virus DNA [25]. Heavily mutated motifs and helices of the Rep sequence of the EGE is another indication of its transcriptional inactivity.
The Macadamia EGE falls within 'clade 2' of unassigned EGEs, a clade that also includes EGEs from Camelia, Coffea, Tectona, Diospyros and Argania [2]. Collectively these genera are palaeotropical in distribution, making it difficult to determine the precise origin of this group of viruses. However, at least two of the genera of host plants, Macadamia and Coffea which form a separate sub-clade [2], have clear Gondwanan links. Coffea has centres of origin in central Africa and Madagascar, whereas Macadamia is uniquely Australian and belongs to the Proteacee, which almost exclusively has a Southern Hemisphere distribution. It was evident that the genome organization of Macadamia EGE seems to be unique as virus and complementary sense genes are overlapping, which is not observed in any extant geminiviruses. However, the phylogenetic analysis of Rep sequence of the EGE with other geminiviruses have shown that the 'clade 2' EGEs are basal to the genus Begomovirus [2], which also have centres of diversity in South America and India, and therefore it would seem likely that this whole group of viruses evolved in Gondwana.

Conclusions
The EGE in Macadamia is replication-defective and thus not the origin of new viral infection. Therefore, it is very unlikely that EGE in Macadamia is a direct cause of AVG. The insertion of the EGE in the macadamia genome preceded speciation of the genus Macadamia and it is likely that the EGE is playing some useful role in the plant.

Plant materials and nucleic acid extractions
Leaf samples were collected from commercial macadamia trees with the permission of the owners at each private farm in Bundaberg and from the wild accessions in ex-situ germplasm conservation site in Queensland with the permission of the lead program manager (Professor Bruce Topp, The University of Queensland, National Macadamia Breeding and Conservation). We also collected leaf samples from the common ornamental tree B. celsissima, which is a close relative of macadamia and commercially available for ornamental purpose in Australia. All relevant institutional, national, and international guidelines, legislation and protocols were followed for the collection of the samples [26]. The identity of all plant materials was confirmed by O. A. Akinsanmi, The University of Queensland. Sampling methods used in this study are described in Zakeel et al. [27]. Sources and specimens for all plant materials used in this study are accessible in public collection in Australia. Total nucleic acids and pure RNA were extracted from the lamina and midrib of leaf samples using the CTAB method of Rogers and Bendich [28], and the TRIzol reagent as per the manufacturer's instructions, respectively. RNA was quantified using a μDrop Plate (Thermo Scientific).

Characterisation of the EGE in the macadamia genome
EGEs were identified in the M. integrifolia genome (GenBank Accession GCA_900631585.1) [29] by doing a tBLASTN search using the Rep protein of TLCV (NCBI protein database accession P36279) as the query sequence. Previous GenBank accession (UZVR00000000; NCBI BioProject: PRJEB13765) of M. integrifolia genome has same sequence for this locus. The endogenous geminiviral sequences were then extended by comparing different loci containing identical or near-identical sequences by doing pairwise sequence alignments using BLASTN. To derive a consensus sequence (Supplementary Data 1), fragments of sequence from different loci were assembled using the contig assembly algorithm in Geneious v. 10.2.5 (Biomatters Ltd) operated using default settings.
Using Mac_Gem F2 and Mac_Gem R2 primer pair (Table 1), a PCR was carried out with total nucleic acids extracted from AVG-symptomatic and healthy macadamia trees and the amplicons were sequenced to verify the presence of the viral locus in the NGS assembly. Each PCR was done in a 25 μl reaction containing 2 μl of total nucleic acid, 1 μl of 10 × Mango Taq reaction buffer (Bioline), 4 mM MgCl 2 , 200 μM of each dNTP, 500 nM of each primer, 2% DMSO, 0.04 μg/μl BSA, 1 unit of MangoTaq DNA polymerase (Bioline) and nuclease-free water to the final volume. Thermocycling conditions were an initial denaturation at 95 °C for 5 min, followed by 35 cycles of denaturation at 95 °C for 30 s, annealing at 46 °C for 30 s and extension at 72 °C for 30 s, with a final extension step at 72 °C for 5 min. Amplicons were separated in a 1% agarose gel in 0.5 × TBE, stained in ethidium bromide for 10 min and visualized on a UV transilluminator. PCR products with an expected size of 870 bp were sequenced at the Macrogen Inc. (Seoul, South Korea).
To validate the whole genome shotgun sequence assemblies of loci containing EGEs, PCR products that overlapped each other by 150-200 bp were generated using the primers listed in Table 2. Each 50 μl PCR mix contained 1 × MangoTaq reaction buffer (Bioline), 4 mM MgCl 2 , 200 μM of dNTPs, 200 nM of each primer, 2% DMSO, 0.04 μg/μl BSA, 2 units of MangoTaq DNA polymerase (Bioline), 2 μl total nucleic acid template (≤ 10 ng/μl) and nuclease-free water to the final volume. Using the same thermocycling conditions described above. PCR products were purified using a QIAGEN Mac_Gem R2 CTT AAT GCA TCA TTT ACT GAAC PCR purification or gel purification kits according to the manufacturer's instructions and sequenced at the Macrogen Inc., Seoul, South Korea, using the Sanger sequencing method. Sequences were trimmed, processed and assembled in Geneious.

RT-PCR
To investigate whether the EGE in macadamia is transcribed, reverse transcription PCR was done using the primers listed in Table 1. As templates, total RNA extracts from leaves of eight trees, including those from AVG and non-AVG trees, were used ( Table 3). As an internal control, RT-PCRs were done to detect the single copy mitochondrial gene NADH using the primers of Thompson et al. [30], except the forward primer primer was modified so that it no longer spanned exon junctions and amplified both the DNA copy of the gene and its spliced RNA transcript (Murray Sharman, Pers. Comm.; Table 1). Amplicons arising from contaminating DNA in the RNA extract were able to be distinguished by size, as the DNA contains an intron. To synthesize cDNA, 2 μl of 10 μM reverse primer (either Mac_Gem_RepR1 or Nad2.2b), 2.5 μl nuclease-free water and 3 μl freshly prepared RNA were incubated at 80 °C for 10 min, followed by snap-chilling on ice. An aliquot of 4 μl of master mix containing 2.5 × first strand buffer, 25 mM DTT, 1.25 mM dNTPs, 50 units SuperScript III (Invitrogen) reverse transcriptase and 10 units RNase out was then added to the tube. Reverse transcription was done at 55 °C for 45 min, then the enzyme denatured at 70 °C for 10 min. PCR was done in a 25 μl reaction volume containing 1 μl of

Rolling circle amplification
To amplify circular DNA molecules, rolling circle amplification (RCA) was done using a TempliPhi 100 DNA amplification kit (GE Healthcare, Buckinghamshire, United Kingdom) as per the manufacturer's protocol but with the reaction mixture spiked with 1 μl each of the primers Mac_Gem F2 and Mac_Gem R2 at 10 μM concentration to improve the sensitivity of detection. The plasmid pUC19 was used as a positive control. Three μl of RCA product was then digested with 10 units of either BamHI, EcoRI or HindIII (New England BioLabs Inc.) at 37 °C for 24 h in a 10 μl reaction mixture containing 1 × CutSmart ® buffer (BioLabs Inc., New England). The restriction enzymes were selected based on the presence of recognition sites in the EGE (EcoRI and HindIII) or TLCV (BamHI) which was initially used to identify EGE in macadamia genome. Digested products were separated in a 1% agarose gel, stained in ethidium bromide and visualized under UV illumination.

Search for EGE orthologues in the four Macadamia species
To search for orthologues of the EGE in the different Macadamia spp., PCR primers were designed to anneal on either side of the junction between plant and geminiviral sequence (