Genome and transcriptome sequencing characterises the gene space of Macadamia integrifolia (Proteaceae)
- Catherine J. Nock†1Email author,
- Abdul Baten†1,
- Bronwyn J. Barkla1,
- Agnelo Furtado2,
- Robert J. Henry2 and
- Graham J. King1
© The Author(s). 2016
Received: 14 July 2016
Accepted: 5 November 2016
Published: 17 November 2016
The large Gondwanan plant family Proteaceae is an early-diverging eudicot lineage renowned for its morphological, taxonomic and ecological diversity. Macadamia is the most economically important Proteaceae crop and represents an ancient rainforest-restricted lineage. The family is a focus for studies of adaptive radiation due to remarkable species diversification in Mediterranean-climate biodiversity hotspots, and numerous evolutionary transitions between biomes. Despite a long history of research, comparative analyses in the Proteaceae and macadamia breeding programs are restricted by a paucity of genetic information. To address this, we sequenced the genome and transcriptome of the widely grown Macadamia integrifolia cultivar 741.
Over 95 gigabases of DNA and RNA-seq sequence data were de novo assembled and annotated. The draft assembly has a total length of 518 Mb and spans approximately 79% of the estimated genome size. Following annotation, 35,337 protein-coding genes were predicted of which over 90% were expressed in at least one of the leaf, shoot or flower tissues examined. Gene family comparisons with five other eudicot species revealed 13,689 clusters containing macadamia genes and 1005 macadamia-specific clusters, and provides evidence for linage-specific expansion of gene families involved in pathogen recognition, plant defense and monoterpene synthesis. Cyanogenesis is an important defense strategy in the Proteaceae, and a detailed analysis of macadamia gene homologues potentially involved in cyanogenic glycoside biosynthesis revealed several highly expressed candidate genes.
The gene space of macadamia provides a foundation for comparative genomics, gene discovery and the acceleration of molecular-assisted breeding. This study presents the first available genomic resources for the large basal eudicot family Proteaceae, access to most macadamia genes and opportunities to uncover the genetic basis of traits of importance for adaptation and crop improvement.
KeywordsMacadamia Proteaceae Rainforest Gene space Genome Transcriptome Crop
Early-diverging lineages can provide important insight into genomic evolution [1, 2]. The Proteaceae is a large Gondwanan plant family belonging to the ‘basal’ eudicots, a paraphyletic group comprising several lineages that diverged prior to the origin and spectacular radiation of largest clade of flowering plants, the ‘core’ eudicots [3, 4]. Extensive morphological and ecological diversity in the Proteaceae make it a focus for studies of adaptive radiation and biome evolution (e.g. [5–8]). The long-held view of rainforest ancestry for the Proteaceae is challenged by recent fossil evidence for a great diversity and abundance of major lineages in open, fire-prone habitats in central Australia during the late Cretaceous . Although species diversity is highest in regions with Mediterranean climates including biodiversity hotspots in Southwest Australia and South Africa, generic diversity is highest in rainforests [10, 11].
Macadamia is the most economically important Proteaceae crop. The industry is based on cultivars developed from the Australian subtropical trees Macadamia integrifolia, M. tetraphylla and hybrids [12, 13]. Commercially-grown cultivars are diploid (2n = 28), highly heterozygous and closely-related to their wild progenitors [14–16]. All four Macadamia species are rare and threatened, and the lowland rainforest ecosystems to which they contribute are listed as critically endangered [17, 18]. The subtropical rainforests of eastern Australia are centres of plant endemism, with high rainfall and low fire frequency that acted as stable refugia through Quaternary glaciation and interglacial periods . This habitat is in contrast to the open, fire-prone habitats that support the majority of extant Proteaceae species.
Rainforests are biodiverse and tree survival depends on long-term defense strategies to respond to the biotic stresses imposed by a broad range of insect herbivores and pathogens . Genome sequencing of the rainforest fruit tree Theobroma cacao revealed an expansion of plant resistance (R) genes, and in particular a group of LRR-RLK receptor protein kinase genes involved in pathogen recognition . In comparison to other eudicots, including the model tree Populus trichocarpa, there was also evidence for expansion of flavonoid and monoterpene-related genes involved in plant defense, insect resistance and floral scent. While little is known of the defense arsenal of Macadamia, cyanogenic glycosides have been identified and cyanide has been detected in seedlings [22, 23]. Cyanogenesis is the production of hydrogen cyanide in response to wounding or attack by herbivores. Although this defense strategy is rare among plants including rainforest trees, it is more common in food plants and in the Proteacaeae subfamily Grevilleoideae to which Macadamia belongs [23–25]. Insect herbivores and fungal pathogens are a major cause of yield reduction in macadamia production and the identification of genes that may confer natural resistance would be of great benefit for crop improvement.
Whole genome sequences have been developed for many crop species accelerating the discovery of genes underlying agriculturally important traits [26, 27]. For perennial tree crops such as macadamia with long generation times, selective breeding is a protracted and expensive process. Genomic information can improve the efficiency and precision of plant breeding through marker-assisted selection . Sequence data for macadamia is very limited and the composition of the Proteaceae genome is unknown. Given its position as a large early-diverging eudicot family, its role as a model for adaptive radiation, and the economic importance of macadamia we aimed to characterise the gene space of Macadamia integrifolia through genome and transcriptome sequencing, assembly and annotation.
Genome sequencing and assembly
Macadamia integrifolia genome and transcriptome sequencing, assembly and annotation statistics
Reads post QC millions
Nucleotides post QC gigabases
Illumina GAIIx 480 bp Insert (2x150 bp PE)
Illumina GAIIx 700 bp Insert (2x150 bp PE)
Illumina HiSeq 8000 bp Insert (2x100 bp MP)
Illumina HiSeq Flower (2x100 bp PE)
Illumina HiSeq Shoot (2x100 bp PE)
Illumina HiSeq Leaf (2x100 bp PE)
Minimum size (bp)
Maximum size (bp)
Total assembly length (Mb)
Number of transcripts
Maximum transcripts length (bp)
Minimum transcript length (bp)
Mean transcript length (bp)
Standard deviation (bp)
Total length (bp)
Number of gene models
Average gene length (bp)
Average coding sequence length (bp)
Gene models similar to Arabidopsis thaliana TAIR10a
Gene models similar to Nelumbo nucifera a
Eukaryotic 458 CORE genes availablea
Transcriptome assembly, gene prediction and functional annotation of proteins
Transcriptome assembly using the quality controlled reads from three cDNA libraries (flower bud, young leaf and shoot) in Trinity de novo generated 298,030 contigs (transcripts) including different isoforms per contig. The transcripts had an N50 size of 1339 bp, mean transcript length of 823 bp, maximum transcript length of 17,814 bp and minimum transcript length of 224 bp (Table 1). Initial transcripts were clustered to generate a final set of 244,925 transcripts, which were used as one source of evidence in the evidence-based gene model prediction pipeline. Final annotation using MAKER pipeline and assembled transcripts produced 35,337 high-confidence gene models. Of these, 90.3% (31,908) were supported by expression values, FPKM (fragments per kilobase of transcript per million mapped reads) of 1 or more, and 87.6% (30,940) were supported by at least two RNA-seq reads. Although 3430 gene models lacked RNA-seq read support it is important to note that RNA-seq data was collected from flower, leaf and shoot tissue only. Over 78 and 74% of predicted proteins had at least one significant BLASTP hit (1E-05) against Nelumbo nucifera or Arabidopsis thaliana proteins respectively.
Core eukaryotic genes (CEGs) are 248 highly conserved genes understood to be present in virtually all eukaryotes in a reduced number of paralogs . Among flowering plants, 959 single copy genes have been identified that are shared between Arabidopsis, Oryza, Populus and Vitis . More than 84% of these single copy genes (809 genes) and 96% of CEGs (237 genes) had a significant BLASTP hit (1E-05) against the predicted macadamia genes. Assessment of annotation completeness with BUSCO (benchmarking universal single-copy orthologs)  indicates that the macadamia gene space contains 77.4% of the expected content. Using a 429 single-copy eukaryote gene set, 192 complete single-copy, 90 complete duplicated, 140 fragmented, and 97 missing universal single-copy genes were identified. This compares to 94.6% (23 missing) and 89.7% (44 missing) of the expected content in the high-quality genome assemblies of Eucalyptus grandis and Nelumbo nucifera respectively. In total, 19,794 macadamia genes were assigned to 33,291 InterProScan (IPR) domains and 39,925 GO terms. Predicted macadamia genes with a significant BLAST hit in KASS (KEGG Automatic Annotation Server) were assigned to 349 known metabolic or signalling pathways. The metabolic pathway (ko01100) contained the largest number of genes (826), followed by biosynthesis of secondary metabolites (ko01110, 386 genes), biosynthesis of antibiotics (ko01130, 188 genes) and microbial metabolism in diverse environment (ko01120, 147 genes).
Gene family analysis
Hypergeometric test for significantly enriched biological process gene ontology (GO) terms of macadamia-specific gene clusters compared to those identified among six eudicot species
Six species, total
immune response-regulatory signaling
detection of bacterium
regulation of anion channel activity
defense response signaling pathway
terpenoid biosynthetic process
geranyl diphosphate metabolic process
monoterpene biosynthetic process
obsolete ATP catabolic process
alkaloid metabolic process
(1- > 3)-beta-D-glucan biosynthetic process
There was also evidence for an expansion of genes involved in terpenoid biosynthesis. In total, 78 macadamia gene models were functionally classified using Interpro as belonging to the terpene synthase gene family. Of these, 30 had high protein sequence similarity (1E-025) in BLASTP comparisons to Arabidopsis thaliana TPS-b monoterpene synthases. Among macadamia-specific clusters, significantly enriched GO terms included 25 predicted genes in six clusters involved in terpenoid biosynthetic process (GO:0016114, P = 0.036), and in particular biosynthesis of monoterpenes through geranyl diphosphate metabolic process (GO:0033383, P = 0.004) and monoterpene biosynthetic process (GO:0043693, P = 0.013). Monoterpenes, or C10 isoprenoids are components of essential oils and fragrance in aromatic plants with roles in pollinator attraction, plant-plant interaction and defense with potential as pesticides and antimicrobial agents. While the functionality of these putative genes is yet to be tested, these results suggest that there may have been a lineage-specific expansion in macadamia of gene families involved in monoterpene synthesis.
Identification of candidate genes potentially involved in cyanogenic glycoside biosynthesis
Candidate genes for cyanogenesis in macadamia
Macadamia gene model
CYP79D15, AC genec
CYP71B16 Cytochrome P450a
CYP71B20 Cytochrome P450a
CYP71B34 Cytochrome P450a
CYP71A1 Cytochrome P450d
UGT85B1 Cyanohydrin glucosyltransferaseb
Cyanogenic beta-glucosidase, LI genec
(R)-mandelonitrile lyase, MDL1 genee
Illumina shotgun sequencing was used to develop a draft assembly of M. integrifolia, the first for the large basal eudicot family Proteaceae. A de novo assembly was constructed with 51.57 Gb of quality-filtered DNA sequence data. Transcriptome assembly from 44.6 Gb of RNA-Seq data from leaf, shoot and flower tissue generated 244,925 transcripts. These were used as reference ESTs, and with the proteins of Nelumbo nucifera and Arabidopsis thaliana provided sources of evidence in the gene model prediction pipeline . Using MAKER, 35,337 protein-coding genes were predicted of which over 90% were expressed in at least one of the green tissues examined. Subsequent evaluation of these gene models showed significant similarity to 96% of core eukaryotic genes  and 84% of single copy genes shared by the angiosperm taxa Arabidopsis, Oryza, Populus and Vitis  indicating that our assembly covers much of the functional gene space of macadamia. In comparison to the eudicots Arabidopsis, Vitis, Populus, Eucalyptus and Nelumbo, 1005 gene families were specific to macadamia. The closest available complete genome sequence, that of the aquatic sacred lotus Nelumbo nucifera , is over 110 million years divergent based on fossil evidence and dated molecular phylogenies [5, 36]. Macadamia and Nelumbo shared 587 gene clusters, the highest between any two taxa compared here and consistent with their relatively close taxonomic positions of Proteaceae and Nelumbonaceae among basal eudicot families.
Quality assessment of the draft genome assembly as determined by technical measurements including the number of scaffolds (193,493) and N50 (4745) indicate that it is fragmented in comparison to completed plant genomes, and further work is required to develop a more contiguous genome with scaffolds anchored to chromosomes. However, quality assessment based on expectations of gene content using BUSCO sets  indicate that 77.4% of the expected gene content is represented in our assembly. This compares to 94.6 and 89.7% in the comprehensively assembled and annotated genomes of Eucalyptus grandis  and Nelumbo nucifera  respectively. Ongoing efforts to improve coverage and reduce fragmentation include deeper short read genome sequencing, incorporation of longer PacBio reads, transcriptome sequencing of additional tissues and the development of a high-density genetic linkage map. Macadamia integrifolia is a diploid species with a haploid number of 14 chromosomes . There are no previously published estimates of genome size. In a recent extensive assessment of Proteaceae genome size from flow cytometry-based estimates, a 60-fold range was reported. Most Grevilleiodeae species, however, had relatively small genome sizes with 1C values from 0.64 to 2.87 pg (~625 to 2800 Mb) genome . The kmer-based estimate of 652 Mb from this study is relatively small compared to closely-related species, and suggests that the draft assembly spans approximately 79% of the genome.
Evidence for expansion of plant defense-related gene families
Rainforests are among the oldest and most diverse ecosystems . Australian subtropical rainforests in particular, are ancient refugia with high levels of plant species richness, endemism and rainfall . Recent evidence suggests that insects and pathogens are instrumental in the maintenance of plant species diversity in rainforests . Likewise, elevated predator-pathogen pressure is hypothesised to increase and diversify plant chemical defense systems. Plants have developed a wide range of defense systems to respond to the biotic stresses exerted by the predators and pathogens with which they have co-evolved [41, 42]. Expansion of the receptor-like kinase genes in particular is purportedly in response to fast-evolving pathogens . Comparative genomic analyses suggests that there has been a lineage specific expansion in macadamia of gene families with similarity to Arabidopsis LRR receptor-like serine threonine-protein kinases EFR and FLS2. These encode proteins that play a key role in pathogen recognition and the activation of plant defense response [44, 45] and it has been demonstrated that Arabidopsis EFR enhances bacterium resistance in dicot and monocot transgenic plants including rice . Further research is needed to identify the complete suite of macadamia plant resistance and defense genes and to determine whether polymorphism at sites on candidate genes is associated with resistance to co-evolving pathogens in macadamia as has been previously reported in Arabidopsis [47, 48]. Future growth in macadamia global production is expected following rapid expansion of cultivation and demand particularly in Asia. Germplasm collections, including clones of wild and domesticated trees have been established. These resources, along with wild populations undoubtedly contain genetic variants of interest for breeding, including improved yield, nutritional benefits, pest resistance and capacity to grow under variable climatic conditions. Insect herbivores and microbial pathogens are a major cause of yield reduction in macadamia production and identification of natural resistance would be of benefit for crop improvement.
Genes involved in cyanogenesis
Cyanogenesis is a plant chemical defense response to generalist herbivores involving the release of hydrogen cyanide following tissue disruption and hydrolysis of cyanogenic glycosides (CGs) [49, 50]. Endogenous recycling without cyanide release suggests that CGs serve additional biological roles including nitrogen and carbon supply at specific plant developmental stages  and there is evidence that intermediate compounds produced during biosynthesis of CGs have anti-microbial activity [52–54]. While relatively few plant species are cyanogenic, they are over-represented among food plants  and are common in the Proteacaeae, particularly in the subfamily Grevilleiodeae to which macadamia belongs [23, 24].
In macadamia, cyanide has been detected in seed, root, cotyledon and leaf tissue . While levels in mature kernels of the commercial species M. integrifolia and M. tetraphylla are extremely low, they are much higher in the bitter mature kernels of M. ternifolia and M. jansenii. In almond Prunus amygadalus, bitterness of the kernel is determined by the content of the cyanogenic diglucoside amygdalin . Intraspecific and temporal variation in cyanogenic capacity, and acyanogenic individuals have been reported in a number of cyanogenic plant taxa (e.g. [56, 57]). In white clover Trifolium repens, inheritance follows a Mendelian two-locus model. The Ac/ac (CYP79D) gene controls production of cyanogenic glycosides, and the Li/li (cyanogenic β-glucosidase) gene controls their hydrolysis . There is an apparent selective advantage for acyanogenic individuals in colder climates, and polymorphism is maintained within populations through recurrent gene deletions over time .
We identified macadamia homologues with high sequence similarity to genes encoding enzymes involved in CG biosynthesis in other cyanogenic plants including Sorghum bicolor and Trifolium repens. Based on the relatively high RNA-seq expression values in green tissue, six homologues to genes encoding the enzymes CYP79, CYP71, UGT85 and β-glucosidase are probable candidates in macadamia to target for further analysis. The discovery of candidate cyanogenesis genes in macadamia is likely to be an important step in facilitating the utilization of the smaller tree species M. ternifolia and M. jansenii into breeding programs to reduce tree size while retaining kernel edibility. In previous studies, 28% of the Proteaceae species tested were cyanogenic. This compares to 4.5% of 401 species from 87 families in Australian rainforests, and 4% of Eucalyptus species [24, 60]. The high proportion of cyanogenic plants in Proteaceae indicates that cyanogenesis is an important defense strategy in this family. Further work is planned to validate candidate genes, screen wild macadamia germplasm for natural variants and investigate the interaction between pest resistance, climatic variation and cyanogenesis in macadamia and more broadly across the Proteaceae.
This study presents the first available genomic resources for the large basal eudicot family Proteaceae and provides a platform for comparative genomics. As a recently domesticated subtropical tree crop with a long generation time, macadamia presents unique challenges for crop improvement. Macadamia breeding and the utilisation of wild germplasm resources is presently restricted by a paucity of genomic information. We have assembled genome and transcriptome sequence data and here introduce the gene space of Macadamia integrifolia as a resource to access to most macadamia genes. This presents opportunities to uncover genes and markers associated with variation in traits of importance for conservation, domestication and crop improvement.
Fresh plant tissue was collected from a Macadamia integrifolia, cultivar 741 ‘Mauka’ individual from the Macadamia Varietal Trial plantation M2 at Clunes, New South Wales, Australia and stored at -80 °C. A voucher specimen is deposited in the Southern Cross University herbarium [accession PHARM-13-0813]. Prior to DNA and RNA extraction, leaf tissue was frozen in liquid nitrogen and ground using a tissue lyser (MM200, Retsch, Haan, Germany).
Genomic DNA isolation and sequencing
Total genomic DNA was extracted using a DNeasy Plant Maxi kit (Qiagen Inc., Valencia, USA) for all DNA sequencing with the exception of mate pair (MP) library sequencing where DNA was extracted using a CTAB-based method developed for next-generation sequencing . DNA was quantified using a Qubit dsDNA BR assay (Life Technologies, Carlsbad, USA). Genomic DNA was sheared using a Covaris S220 focused-ultrasonication device (Covaris Inc., Woburn USA). Paired-end libraries (PE) with average insert sizes of 480 and 700 bp and an 8 kb MP library were prepared using Illumina TruSeq DNA Sample Preparation kit v2 following manufacturer’s instructions (Illumina, San Diego, USA). Fragment size distribution and concentration were determined using a DNA 1000 chip on a Bioanalyser 2100 instrument (Agilent Technologies, Santa Clara, USA). PE and MP libraries were sequenced with Illumina GA IIx (150 x 2 cycles) and HiSeq 2000 (100 x 2 cycles) instruments respectively.
Genome assembly and scaffolding
Paired-end sequence reads were trimmed to remove low quality bases and adapter sequences and de novo assembled using CLC Genomics Workbench (CLC) version 6.5 (CLC Bio, Aarhus, Denmark) that has been used in the assembly of plant genomes including Norway spruce Picea abies  and barley Hordeum vulgare . CLC de novo assembler, which utilizes de Bruijn graphs, was used for assembly of Illumina PE reads with the option to map reads back to contigs following previously described parameters . MP reads were also trimmed to remove low quality bases and adapter sequences. We observed very high proportion (>90%) of duplicated MP reads, presumably PCR duplicates, which were filtered using CLC. Genome assembly was performed in the following two steps: preliminary contig assembly using PE reads in CLC, followed by assembly of sequence contigs and filtered high quality MP reads using the scaffolding program SSPACE to obtain a final set of scaffolds . Genome size was estimated based on k-mer analysis and depth of sequencing .
Repetitive sequence analysis
RepeatModeler and RepeatMasker programs were used to identify repeats . Putative repetitive sequences were identified using the RepeatModeler program with default parameters. In parallel, known repetitive sequences were identified using the RepeatMasker program with the latest release of RepBase curated repeat libraries . Searches for simple sequence repeats (SSRs) were conducted using SciRoko  software with default parameters and ‘MISA’ mode.
RNA extraction and transcriptome sequencing
To enable assembly of the transcriptome of macadamia, three tissues (leaf, shoot and flower) of cultivar 741 ‘Mauka’ were selected for deep RNA sequencing (RNA-seq). Total RNA was isolated from frozen tissue using Ambion Plant RNA Isolation Aid prior to extraction using an Ambion RNAqueous Kit following manufacturer’s recommendations (ThermoFisher Scientific, Waltham, USA). Libraries were prepared with Illumina TruSeq Stranded mRNA Library Preparation Kit and PE sequenced with an Illumina HiSeq 2500 (100 x 2 cycles).
Quality control of tissue specific transcriptomic reads involved removal of low quality sequences, adapter sequences and empty reads using BBMap tools (sourceforge.net/projects/bbmap/). Retained high quality clean reads were assembled using the Trinity de novo transcriptome assembly program (version 2.0.2) with default parameters . The Trinity de novo assembly pipeline consists of three different modules Inchworm, Chrysalis and Butterfly. Inchworm assembles short reads into unique sequences of transcripts. Chrysalis clusters the Inchworm transcripts and constructs de Bruijn graphs for each cluster where each cluster represents the full transcriptional complexity for a given gene. Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and resolving transcripts that corresponds to paralogous genes. The initial transcripts were clustered using the CD-hit-est  to generate final set of transcripts, which were used as one source of evidence in the evidence-based gene model prediction pipeline.
Gene prediction and annotation
Annotation of gene models was conducted using MAKER (version 2.31.8) which is an evidence-based gene model prediction pipeline . MAKER combines the power of protein and Expressed Sequence Tag (EST) based homology with ab initio gene predictions to produce polished gene annotations. Trinity assembled transcripts were used as reference ESTs, and proteins of Nelumbo nucifera and Arabidopsis thaliana were used as reference proteins . Macadamia scaffolds were first repeat masked using RepeatMasker . To obtain the homology based genes MAKER aligned reference ESTs and proteins using Blastx  and exonerate  against the macadamia scaffolds. Ab initio gene predictions were made by Augustus  and SNAP  gene prediction programs. MAKER created the final gene set by combining the evidence based and ab initio predictions.
Functional annotation of proteins
Predicted protein coding genes were functionally annotated based on protein signatures and orthology relationships. Similarity search was performed against release (03-2015) of UniProt Swiss-Prot proteins. Functional domains, gene ontology (GO) terms, GO accessions were searched against InterPro using InterProScan software . Functional and gene ontology (GO) domains were assigned using InterProScan as described in  with default parameters. InterProScan integrates a collection of protein signature databases including BlastProDom, HMMPfam, HMMSmart, HMMTigr, ProfileScan, HAMAP, PatternScan, SuperFamily, TMHMM, HMMPanther, Gene3D and Phobius. To inform biological interpretation of macadamia gene function, KEGG (Kyoto Encyclopedia of Genes and Genomes) reference pathway database was used to map macadamia genes to defined pathways . The KASS (KEGG Automatic Annotation Server) was used to assign genes to metabolic pathways using BLASTX with an E-value cutoff of 1E-05 . Tests for annotation completeness were conducted using BUSCO  with the eukaryote 429 gene set and compared results to those of the Eucalyptus grandis and Nelumbo nucifera genomes.
Comparative genomic analysis and gene family identification
Protein sets of five plant species including core eudicots Arabidopsis thaliana, Eucalyptus grandis, Populus trichocarpa, Vitis vinifera and basal eudicot Nelumbo nucifera, were downloaded from respective public repositories. Along with the predicted macadamia proteins they were uploaded into the OrthoVenn web server for identification and comparison of orthologous clusters . To identify orthologous groups OrthoVenn employs the OrthoMCL Markov clustering algorithm, although unlike OrthoMCL it employs UBLAST for the all-against-all similarity search, which is ~350 times faster than conventional BLAST . Following clustering, orthAgogue  is used for the identification of putative orthology and inparalogy relations. To deduce the putative function of each ortholog, the first protein sequence from each cluster is searched against the non-redundant protein database UniProt  using BLASTP. Pairwise sequence similarities were determined among protein sequences of all species with a BLASTP E-value cut-off of 1E-05 and an inflation value of 1.5 for MCL. To test the quality and completeness of the gene space assembly of macadamia we identified orthologous clusters from analyses in OrthoVenn with Swiss-Prot hits to proteins reportedly involved in CG biosynthesis and activation, and conducted BLASTP searches of the macadamia candidates. In addition, reciprocal searches of protein sequences for five enzymes (CYP79, CYP71, UGT85, β-glucosidase and HNL) involved in CG biosynthesis from known cyanogenic plants were conducted with a DeCypher Tera-BLASTP search against all macadamia gene models.
Basic local alignment search tool
Benchmarking universal single-copy orthologs
Core eukaryotic genes
Bacterial elongation factor Tu receptor
Environment protection and biodiversity conservation
Expressed sequence tag.
Bacterial flagellin-sensing 2
Fragments per kilobase of transcript per million mapped reads
KEGG Automatic annotation server
Kyoto encyclopedia of genes and genomes
Long terminal repeat
Short interspersed repeat
Simple sequence repeat
We thank the Australian Macadamia Society, Macadamia Conservation Committee, and Dr Bruce Topp, Dr Craig Hardner, Ian McConachie and Jolyon Burnett in particular for valuable advice and discussions on macadamia biology, conservation and cultivation. Thank you to Gray Plantations and Kim Wilson for access to the varietal plot where the individual 741 tree sequenced during this study is located. We greatly appreciate the technical support of the Plant Science group at Southern Cross University, particularly Asuka Kawamata and Alicia Hidden. Thanks also to Dr Adam Vivian-Smith for fruitful discussion on cyanogenesis.
This work was completed as part of the Macadamia genome project, with support from Southern Cross University, the University of Queensland and HIA project MC15008 ‘Establishing an open-source platform for unravelling the genetics of Macadamia: integration of linkage and genome maps’ funded by Horticulture Innovation Australia Limited with voluntary contribution from the Knappick Foundation Pty. Ltd., Macadamia Conservation Trust, Australian Macadamia Society, Southern Cross University and funds from the Australian Government.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the European Nucleotide Archive (EMBL-ENA) repository, [Study primary PRJEB13765, Assembly accession: ERP015338 http://www.ebi.ac.uk/ena/data/view/PRJEB13765, release date 21 August 2016].
CN, AB and GK conceived this project. CN and AB undertook the experimental design, data analyses and prepared the figures. AB performed the majority of the bioinformatics analyses. RH, AF, BB, GK contributed to experimental design, analyses and conceptual development of the work. CN and AB drafted the manuscript, and all authors edited and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Albert VA, Barbazuk WB, Der JP, Leebens-Mack J, Ma H, Palmer JD, Rounsley S, Sankoff D, Schuster SC, Soltis DE. The amborella genome and the evolution of flowering plants. Science. 2013;342(6165):1241089.View ArticleGoogle Scholar
- Ming R, VanBuren R, Liu Y, Yang M, Han Y, Li L-T, Zhang Q, Kim M-J, Schatz MC, Campbell M. Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.). Genome Biol. 2013;14(5):1–11.View ArticleGoogle Scholar
- Soltis DE, Smith SA, Cellinese N, Wurdack KJ, Tank DC, Brockington SF, Refulio-Rodriguez NF, Walker JB, Moore MJ, Carlsward BS. Angiosperm phylogeny: 17 genes, 640 taxa. Am J Bot. 2011;98(4):704–30.View ArticlePubMedGoogle Scholar
- Nock CJ, Baten A, King GJ. Complete chloroplast genome of Macadamia integrifolia confirms the position of the Gondwanan early-diverging eudicot family Proteaceae. BMC Genomics. 2014;15 Suppl 9:1.View ArticleGoogle Scholar
- Sauquet H, Weston PH, Anderson CL, Barker NP, Cantrill DJ, Mast AR, Savolainen V. Contrasted patterns of hyperdiversification in Mediterranean hotspots. Proc Natl Acad Sci. 2009;106(1):221–5.View ArticlePubMedGoogle Scholar
- Prunier R, Holsinger KE. Was it an explosion? Using population genetics to explore the dynamics of a recent radiation within Protea (Proteaceae L.). Mol Ecol. 2010;19(18):3968–80.View ArticlePubMedGoogle Scholar
- Byrne M, Steane DA, Joseph L, Yeates DK, Jordan GJ, Crayn D, Aplin K, Cantrill DJ, Cook LG, Crisp MD. Decline of a biome: evolution, contraction, fragmentation, extinction and invasion of the Australian mesic zone biota. J Biogeogr. 2011;38(9):1635–56.View ArticleGoogle Scholar
- Carlson JE, Holsinger KE, Prunier R. Plant responses to climate in the cape Floristic Region of South Africa. evidence for adaptive differentiation in the Proteaceae. Evolution. 2011;65(1):108–24.View ArticlePubMedGoogle Scholar
- Carpenter RJ, Macphail MK, Jordan GJ, Hill RS. Fossil evidence for open, Proteaceae-dominated heathlands and fire in the late cretaceous of Australia. Am J Bot. 2015;102(12):2092–107.View ArticlePubMedGoogle Scholar
- Carpenter RJ. Proteaceae leaf fossils: phylogeny, diversity, ecology and austral distributions. Bot Rev. 2012;78(3):261–87.View ArticleGoogle Scholar
- Lamont BB, He T. Fire-adapted Gondwanan angiosperm floras evolved in the Cretaceous. BMC Evol Biol. 2012;12(1):1.View ArticleGoogle Scholar
- Hardner CM, Peace C, Lowe AJ, Neal J, Pisanu P, Powell M, Schmidt A, Spain C, Williams K. Genetic resources and domestication of macadamia. Hortic Rev. 2009;35:1.Google Scholar
- Hardner C. Macadamia domestication in Hawai ‘i. Genet Resour Crop Evol. 2016;63(8):1411–30.
- Stace HM, Douglas AW, Sampson JF. Did ‘paleo-polyploidy’really occur in proteaceae? Aust Syst Bot. 1998;11(4):613–29.View ArticleGoogle Scholar
- Aradhya MK, Yee LK, Zee FT, Manshardt RM. Genetic variability in macadamia. Genet Resour Crop Evol. 1998;45(1):19–32.View ArticleGoogle Scholar
- Nock CJ, Elphinstone MS, Ablett G, Kawamata A, Hancock W, Hardner CM, King GJ. Whole genome shotgun sequences for microsatellite discovery and application in cultivated and wild Macadamia (Proteaceae). Appl Plant Sci. 2014;2(4):1300089.View ArticleGoogle Scholar
- Mast AR, Willis CL, Jones EH, Downs KM, Weston PH. A smaller Macadamia from a more vagile tribe: inference of phylogenetic relationships, divergence times, and diaspore evolution in Macadamia and relatives (tribe Macadamieae; Proteaceae). Am J Bot. 2008;95(7):843–70.View ArticlePubMedGoogle Scholar
- Powell M, Accad A, Shapcott A. Where they are, why they are there, and where they are going: using niche models to assess impacts of disturbance on the distribution of three endemic rare subtropical rainforest trees of Macadamia (Proteaceae) species. Aust J Bot. 2014;62(4):322–34.Google Scholar
- Weber LC, VanDerWal J, Schmidt S, McDonald WJ, Shoo LP. Patterns of rain forest plant endemism in subtropical Australia relate to stable mesic refugia and species dispersal limitations. J Biogeogr. 2014;41(2):222–38.View ArticleGoogle Scholar
- Coley PD, Barone J. Herbivory and plant defenses in tropical forests. Ann Rev Ecol Syst. 1996;27:305-35.
- Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN. The genome of Theobroma cacao. Nat Genet. 2011;43(2):101–8.View ArticlePubMedGoogle Scholar
- Dahler J, Mcconchie C, Turnbull C. Quantification of cyanogenic glycosides in seedlings of three Macadamia (Proteaceae) species. Aust J Bot. 1995;43(6):619–28.View ArticleGoogle Scholar
- Swenson WK, Dunn JE, Conn EE. Cyanogenesis in the proteaceae. Phytochemistry. 1989;28(3):821–3.View ArticleGoogle Scholar
- Miller RE, Jensen R, Woodrow IE. Frequency of cyanogenesis in tropical rainforests of far north Queensland. Australia Ann Bot London. 2006;97(6):1017–44.View ArticleGoogle Scholar
- Jones DA. Why are so many food plants cyanogenic? Phytochemistry. 1998;47(2):155–62.View ArticlePubMedGoogle Scholar
- Edwards D, Batley J. Plant genome sequencing: applications for crop improvement. Plant Biotech J. 2010;8(1):2–9.View ArticleGoogle Scholar
- Varshney RK, Terauchi R, McCouch SR. Harvesting the promising fruits of genomics: applying genome sequencing technologies to crop breeding. PLoS Biol. 2014;12(6):e1001883.View ArticlePubMedPubMed CentralGoogle Scholar
- van Nocker S, Gardiner SE. Breeding better cultivars, faster: applications of new technologies for the rapid deployment of superior horticultural tree crops. Hortic Res. 2014;1:14022.View ArticlePubMedPubMed CentralGoogle Scholar
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463(7279):311–7.View ArticlePubMedGoogle Scholar
- Morgante M, Hanafey M, Powell W. Microsatellites are preferentially associated with non-repetitive DNA in plant genomes. Nat Genet. 2002;30(2):194–200.View ArticlePubMedGoogle Scholar
- Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23(9):1061–7.View ArticlePubMedGoogle Scholar
- Duarte JM, Wall PK, Edger PP, Landherr LL, Ma H, Pires JC, Leebens-Mack J. Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evol Biol. 2010;10(1):1.View ArticleGoogle Scholar
- Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.View ArticlePubMedGoogle Scholar
- Wang Y, Coleman-Derr D, Chen G, Gu YQ. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res. 2015;43(W1):W78–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Anderson CL, Bremer K, Friis EM. Dating phylogenetically basal eudicots using rbcL sequences and multiple fossil reference points. Am J Bot. 2005;92(10):1737–48.View ArticlePubMedGoogle Scholar
- Myburg AA, Grattapaglia D, Tuskan GA, Hellsten U, Hayes RD, Grimwood J, et al. The genome of Eucalyptus grandis. Nature. 2014;510(7505):356–62.PubMedGoogle Scholar
- Jordan GJ, Carpenter RJ, Koutoulis A, Price A, Brodribb TJ. Environmental adaptation in stomatal size independent of the effects of genome size. New Phytol. 2015;205(2):608–17.View ArticlePubMedGoogle Scholar
- Wilson E0. 1988. The current state of biological diversity. Biodiversity. 1988;521(1):3-18
- Bagchi R, Gallery RE, Gripenberg S, Gurr SJ, Narayan L, Addis CE, Freckleton RP, Lewis OT. Pathogens and insect herbivores drive rainforest plant diversity and composition. Nature. 2014;506(7486):85–8.View ArticlePubMedGoogle Scholar
- Jones JD, Dangl JL. The plant immune system. Nature. 2006;444(7117):323–9.View ArticlePubMedGoogle Scholar
- Edger PP, Heidel-Fischer HM, Bekaert M, Rota J, Glöckner G, Platts AE, Heckel DG, Der JP, Wafula EK, Tang M. The butterfly plant arms-race escalated by gene and genome duplications. Proc Natl Acad Sci U S A. 2015;112(27):8362–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Lehti-Shiu MD, Zou C, Hanada K, Shiu S-H. Evolutionary history and stress regulation of plant receptor-like kinase/pelle genes. Plant Physiol. 2009;150(1):12–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Afzal AJ, Wood AJ, Lightfoot DA. Plant receptor-like serine threonine kinases: roles in signaling and plant defense. Mol Plant Microbe Interact. 2008;21(5):507–17.View ArticlePubMedGoogle Scholar
- Bigeard J, Colcombet J, Hirt H. Signaling mechanisms in pattern-triggered immunity (PTI). Mol Plant. 2015;8(4):521–39.View ArticlePubMedGoogle Scholar
- Lu F, Wang H, Wang S, Jiang W, Shan C, Li B, Yang J, Zhang S, Sun W. Enhancement of innate immune system in monocot rice by transferring the dicotyledonous elongation factor Tu receptor EFR. J Integr Plant Biol. 2015;57(7):641–52.View ArticlePubMedGoogle Scholar
- Stahl EA, Dwyer G, Mauricio R, Kreitman M, Bergelson J. Dynamics of disease resistance polymorphism at the Rpm1 locus of Arabidopsis. Nature. 1999;400(6745):667–71.View ArticlePubMedGoogle Scholar
- Shen J, Araki H, Chen L, Chen J-Q, Tian D. Unique evolutionary mechanism in R-genes under the presence/absence polymorphism in Arabidopsis thaliana. Genetics. 2006;172(2):1243–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Tattersall DB, Bak S, Jones PR, Olsen CE, Nielsen JK, Hansen ML, Høj PB, Møller BL. Resistance to an herbivore through engineered cyanogenic glucoside synthesis. Science. 2001;293(5536):1826–8.View ArticlePubMedGoogle Scholar
- Gleadow RM, Møller BL. Cyanogenic glycosides: synthesis, physiology, and phenotypic plasticity. Annu Rev Plant Biol. 2014;65:155–85.View ArticlePubMedGoogle Scholar
- Pičmanová M, Neilson EH, Motawia MS, Olsen CE, Agerbirk N, Gray CJ, Flitsch S, Meier S, Silvestro D, Jørgensen K. A recycling pathway for cyanogenic glycosides evidenced by the comparative metabolic profiling in three cyanogenic plant species. Biochem J. 2015;469(3):375–89.View ArticlePubMedGoogle Scholar
- Bednarek P, Osbourn A. Plant-microbe interactions: chemical diversity in plant defense. Science. 2009;324(5928):746–8.View ArticlePubMedGoogle Scholar
- Møller BL. Dynamic metabolons. Science. 2010;330(6009):1328–9.View ArticlePubMedGoogle Scholar
- Laursen T, Møller BL, Bassard J-E. Plasticity of specialized metabolism as mediated by dynamic metabolons. Trends Plant Sci. 2015;20(1):20–32.View ArticlePubMedGoogle Scholar
- Sánchez-Pérez R, Jørgensen K, Olsen CE, Dicenta F, Møller BL. Bitterness in almonds. Plant Physiol. 2008;146(3):1040–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Gleadow RM, Woodrow IE. Temporal and spatial variation in cyanogenic glycosides in Eucalyptus cladocalyx. Tree Physiol. 2000;20(9):591–8.View ArticlePubMedGoogle Scholar
- Buhrmester RA, Ebinger JE, Seigler DS. Sambunigrin and cyanogenic variability in populations of Sambucus canadensis L. (Caprifoliaceae). Biochem Syst Ecol. 2000;28(7):689–95.View ArticlePubMedGoogle Scholar
- Olsen KM, Hsu S-C, Small LL. Evidence on the molecular basis of the Ac/ac adaptive cyanogenesis polymorphism in white clover (Trifolium repens L.). Genetics. 2008;179(1):517–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Olsen KM, Kooyers NJ, Small LL. Recurrent gene deletions and the evolution of adaptive cyanogenesis polymorphisms in white clover (Trifolium repens L.). Mol Ecol. 2013;22(3):724–38.View ArticlePubMedGoogle Scholar
- Gleadow RM, Haburjak J, Dunn J, Conn M, Conn EE. Frequency and distribution of cyanogenic glycosides in Eucalyptus L’Hérit. Phytochemistry. 2008;69(9):1870–4.View ArticlePubMedGoogle Scholar
- Furtado A. DNA Extraction from vegetative tissue for next-generation sequencing. Cereal Genomics. 2014;1099:1–5.View ArticleGoogle Scholar
- Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A. The Norway spruce genome sequence and conifer genome evolution. Nature. 2013;497(7451):579–84.View ArticlePubMedGoogle Scholar
- International Barley Genome Sequencing Consortium. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491(7426):711–6.Google Scholar
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27(4):578–9.View ArticlePubMedGoogle Scholar
- Smit A, Hubley R, Green P. RepeatMasker Open-3.0. 1996. http://www.repeatermasker.org.Google Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase update, a database of eukaryotic repetitive elements. Cytogen Genome Res. 2005;110(1-4):462–7.View ArticleGoogle Scholar
- Kofler R, Schlötterer C, Lelley T. SciRoKo. a new tool for whole genome microsatellite search and investigation. Bioinformatics. 2007;23(13):1683–5.View ArticlePubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.View ArticlePubMedGoogle Scholar
- Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.View ArticlePubMedGoogle Scholar
- Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinf. 2005;6(1):31.View ArticleGoogle Scholar
- Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nuc Acids Res. 2005;33 Suppl 2:W465–7.View ArticleGoogle Scholar
- Korf I. Gene finding in novel genomes. BMC Bioinf. 2004;5(1):59.View ArticleGoogle Scholar
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33 Suppl 2:W116–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Schatz MC, Maron LG, Stein JC, Wences AH, Gurtowski J, Biggers E, et al. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol. 2014;15(11):506.PubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014;42(D1):D199–205.View ArticlePubMedGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007;35 Suppl 2:W182–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13(9):2178–89.View ArticlePubMedPubMed CentralGoogle Scholar
- Ekseth OK, Kuiper M, Mironov V. orthAgogue: an agile tool for the rapid prediction of orthology relations. Bioinf. 2014;30(5):734–6.
- UniProt Consortium. The universal protein resource (UniProt) 2009. Nucleic Acids Res. 2009;37(Database issue):D169–174.View ArticleGoogle Scholar