De novo transcriptome assembly of drought tolerant CAM plants, Agave deserti and Agave tequilana
© Gross et al.; licensee BioMed Central Ltd. 2013
Received: 25 April 2013
Accepted: 13 August 2013
Published: 19 August 2013
Skip to main content
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Gross et al.; licensee BioMed Central Ltd. 2013
Received: 25 April 2013
Accepted: 13 August 2013
Published: 19 August 2013
Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits.
Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis.
Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development.
The lack of genomic and transcriptomic sequence information for agaves, succulent plants native to the arid regions of North America, limits molecular investigation of their adaptations to the abiotic stresses of xeric environments. Agaves are remarkably resistant to heat and drought stress as they employ crassulacean acid metabolism (CAM)—a water-efficient form of photosynthesis in which the uptake of CO2 into plant tissues through stomata and the fixation of CO2 into organic molecules is temporally separated . CAM plants have high water use efficiency, 4–2X more efficient in water use efficiency than plants employing C3 and C4 photosynthesis . Moreover, an increased CO2 concentration within CAM plant cells increases the efficiency of carbon fixation by Rubisco . Agaves exhibit equally important morphological adaptations to xeric environments that further increase their drought and heat resistance . Specialized leaves [4, 5], cuticles [6–8], and roots [9, 10] further protect agaves from thermal damage and prolonged drought. Agaves thus offer an opportunity to study broad-spectrum heat and drought resistance not necessarily present in all CAM plants, and provide an important model for creating applied solutions to agricultural challenges associated with climate change [1, 11]. Because of adaptations to arid environments [5, 12], agaves have also recently been proposed as a lignocellulosic bioenergy feedstock suitable for marginal land [13, 14].
Agaves have large genomes, estimated to be around 4 Gbp  with a significant amount of gene duplication due to paleopolyploidy  and a high number of repetitive elements , presenting significant challenges for genome assembly. To provide a comprehensive and accurate foundation for molecular studies of agaves, herein we present reference transcriptome datasets of A. tequilana and A. deserti, assembled from deep RNA-seq data. Cross-species comparisons demonstrate high depth and accuracy of the Agave de novo assemblies. Comparative transcriptome profiling provides insights into the molecular and physiological functions along the proximal-distal axis of the A. deserti leaf, and demonstrates broad conservation of leaf development and function across monocotyledonous plants. These reference transcriptomes provide resources for further molecular investigations of the Agave genus to enable their use as models for plant adaptations to abiotic stress, and improve agaves for applied bioenergy technologies.
Both A. tequilana and A. deserti spend the majority of their 5–10 year lifespan as vegetative rosettes (Figure 1A, 1B) before a single flowering event followed by rapid senescence . mRNA was harvested from various Agave tissues (Additional file 1: Table S1, Figure S1), and strand-specific cDNA sequencing libraries of specific insert sizes were prepared for Illumina sequencing (Methods). In total, we sequenced 978 million A. tequilana and 615 million A. deserti RNA fragments using 150 nucleotide paired-end reads (Additional file 1: Tables S2 and S3). To assess coverage of the agave transcriptomes, we plotted the frequency of observing a new unique 25-mer sequence over an increasing number of randomly sampled reads. In both data sets, the 25-mer discovery frequency decreases as sequencing depth increases, and asymptotically levels off at approximately 0.08 (8%) (Figure 1C). While complimentary datasets, such as completed genomes, will be required to conclusively determine transcriptome coverage, these observations suggest the sequencing depth was sufficient to sample the majority of sequence diversity in agave tissues. Reads from the two Agave datasets were separately assembled into contigs by the de novo transcriptome assembly pipeline Rnnotator . Resulting contigs were grouped by sequence similarity into genetic loci to account for alternative splicing and reduce redundancy in downstream analyses (Methods, Additional file 1: Table S4).
Summary of the A. tequilana and A. deserti transcriptome assemblies
No. of Agave loci
No. Agave contigs
Sum length of Agave contigs
No. of protein-coding loci
The transcriptomes of A. deserti and A. tequilana were compared using reciprocal BLAT analyses  using a minimum sequence identity threshold of 90% (Figure 2B). A significant portion of each agave transcriptome aligns to its counterpart, with 94.3% of A. deserti transcripts aligning to A. tequilana, and 88.44% of A. tequilana transcripts aligning to A. deserti. Transcripts aligning between the two Agave species also show a significant similarity in length and long regions of sequence alignment (Figure 2C).
To further substantiate the de novo Agave transcriptomes and perform comparative analyses, Agave proteomes were also clustered by OrthoMCL with the proteomes of 11 additional plant kingdom species obtained from Phytozome  (hereafter, Phytozome Tester Set, or PTS, see Methods for details). The PTS includes both monocotyledonous and dicotyledonous plants and plants exhibiting C3 or C4 photosynthesis, but no other CAM plants as no high-quality datasets are currently available in Phytozome. Between A. deserti, A. tequilana, and the 11 species within the PTS, we obtained 48,133 unique plant protein orthologous groups (hereafter, Plant OGs) from a total of 381,050 proteins .
Using OrthoMCL data, we first compared protein lengths between the inferred proteomes of the PTS and our de novo Agave assemblies to address transcript contig completeness. There are 12,346 Plant OGs shared between either A. deserti or A. tequilana and at least one member of the PTS. These 12,346 Plant OGs encompass 55,676 Agave proteins and 173,611 proteins from the 11 species in the PTS (data available online ). The median lengths of Agave and PTS proteins within each of the 12,346 Plant OGs correlate highly (Pearson r = 0.85) and overall demonstrate 1:1 correspondence in protein lengths (best-fit slope = 0.9942) (Additional file 1: Figure S3A). The median length of Agave proteins within the set is ~11% shorter than that of the PTS (Agave, 356 amino acids; PTS, 389 amino acids; Student’s t-test p-value < 0.05) (Additional file 1: Figure S3B).
To estimate Agave proteome completeness, we compared the inferred A. tequilana and A. deserti proteomes to those of 4 monocotyledonous grass species in the PTS: Brachypodium distachyon, Oryza sativa, Sorghum bicolor, and Zea mays. An Edwards-Venn diagram of Plant OGs (Figure 3C) demonstrates that 8144 of 13,203 (61.7%) of protein families common to the 4 grass species are shared with agaves despite approximately 120 million years of evolution separating these grasses (order Poales ) and agaves (order Asparagales ) [38, 39].
The 4325 Plant OGs common to both agaves but absent in four grass species (Figure 3C) represent either Agave protein families not present in grasses or protein families with enough sequence diversity to escape orthology detection by OrthoMCL. Gene ontology (GO) enrichment indicates regulatory diversity separates agaves from other monocots (Additional file 1: Table S7). Abundant transcription factor families within this set include MYB (InterPro IPR014778; 84 Plant OGs), ethylene response factor-domain (AP2/ERF-domain, IPR001471; 48 Plant OGs), C3HC4 Zinc finger (IPR018957; 44 Plant OGs), and WRKY (IPR003657; 41 Plant OGs). This agave-specific set also includes Hsp20-type heat shock proteins (IPR002068; 18 Plant OGs), suggestive of sequence divergence in these agave proteins regulating responses to heat.
Agaves may have adapted to hot, arid environments through expansion of protein families involved in abiotic stress resistance. A comparison of 41,425 OrthoMCL-defined protein families common to any member of the PTS species and either Agave species failed to discover significantly smaller or larger orthologous protein families in agaves (Wilcoxon rank sum test Benjamini-Hochberg corrected p-values > 0.05). Furthermore, no significant expansion of obvious candidate protein families, such as heat shock proteins (HSPs) , heat-shock transcription factors (HSFs) , and dehydrins  was observed. Thus, using our clustering methodologies, we found no significant expansion of gene families within Agave species suggestive of adaptation to xeric environments. However, the lack of significantly underrepresented PlantOGs supports the completeness of our de novo transcriptome assemblies.
Both wild-growing A. deserti and traditionally cultivated A. tequilana are expected to harbor significant amounts of heterozygosity. Furthermore, though both A. tequilana and A. deserti are cytological diploids , recent work indicates agaves are paleopolyploids resulting from two distinct tetraploidization events , potentially leading to the presence of highly similar paralagous loci in their genomes. While these issues can potentially complicate the de novo assembly of transcriptomes due to the consolidation of transcripts originating from distinct alleles or paralagous loci into single transcript contigs, these expected polymorphisms provide opportunities to demonstrate the utility of de novo transcriptomes to develop strategies for marker assisted breeding. We attempted to identify loci displaying evidence of combined assembly of polymorphic alleles and/or paralagous genes by mapping reads back to the reference consensus assembly and identifying single-nucleotide polymorphisms (SNPs) or insertions/deletions (indels).
Analysis identified 30,035 (33.9%) A. deserti loci and 66,701 (47.8%) A. tequilana loci as having 1 or more high-confidence polymorphism when compared to the reference Illumina de novo assembly (Additional file 1: Figure S4A). The median number of polymorphisms (SNPs or indels) per kilobase (hereafter, PPK) is significantly different between the two species, with 2.066 PPK in A. deserti and 4.39 PPK in A. tequilana (Wilcoxon rank sum test p-value < 0.05). Of loci exhibiting polymorphisms, 16,838 (56.1%) A. deserti and 34732 (52.1%) A. tequilana loci are protein-coding. In A. deserti, non-coding loci exhibit a higher median PPK than coding loci (2.9 vs. 1.6 PPK, respectively, Wilcoxon Rank Sum test p- value < 0.05) (Additional file 1: Figure S4), however this was not observed in A. tequilana (4.46 PPK coding, 4.29 PPK non-coding, Wilcoxon Rank Sum test p-value > 0.05) (Additional file 1: Figure S4B). Full datasets are available online .
In ex vivo experiments, Agave leaf cells can survive temperatures up to 64.7°C , suggesting molecular and cellular adaptations contribute to heat tolerance in a manner independent of Agave physiological and morphological adaptations. Though computational prediction of protein thermostability from primary structure alone is not completely accurate , we tested for protein adaptations to thermal stress using a streamlined version of Thermorank  (Additional file 1: Figure S5A, Methods). However, we found no signatures of global, proteome-wide thermotolerance adaptation in agaves (Additional file 1: Figure S5B). Independent tests of Agave proteins within OrthoMCL-defined PlantOGs failed to find agave proteins with significantly higher thermostability than others within Phytozome Tester Set (Wilcoxon rank sum test Benjamini-Hochberg adjusted p-values > 0.05).
Most plant transposable elements (TEs) are transcriptionally silent , they constitute a significant proportion of Agave genomes  and contribute to the creation of genetic diversity in many plants  including Agave[47, 48]. We identified TE-like sequences in agaves (Methods) (Additional file 1: Table S8), the majority of which are derived from retrotransposons (Additional file 1: Table S8). Very few TE annotations encompass entire contigs (Additional file 1: Figure S6), with only 332 contigs in A. tequilana and 171 in A. deserti entirely covered by a TE annotation (±10 nt from each end). Nearly half of all TE annotations (46.6%) encompass only the 5′ or 3′ end of transcript contigs (Additional file 1: Figure S7), suggesting that transcription initiation or termination can occur within TEs.
To examine tissue-specific differences in transcriptome profiles, we analyzed the data sets used for de novo transcriptome assembly based on their tissue of origin (Additional file 1: Table S1). As expected, transcriptome profiles differ between Agave tissues in proportion to their respective physiological functions (Additional file 1: Tables S9 and S10, Figures S8 and S9). For example, very small transcriptome differences are observed between adjacent sections of the A. deserti leaf (r = 0.98), while the largest differences are observed between roots and leaves in A. tequilana (r = 0.42) (Additional file 1: Table S9) and between the distal tip of A. deserti leaves and roots (r = 0.39) (Additional file 1: Table S10).
In A. deserti, we observed consistent higher expression of 13,961 transcripts in samples derived from folded leaves and meristematic tissues (Methods) (Additional file 1: Figure S9). GO-terms enriched within these transcripts include functions related to DNA synthesis, lipid and membrane synthesis, and targeting of proteins to cellular membranes (Additional file 1: Table S11), all activities typically enriched in actively growing cells within developing leaves and meristems. Two enriched GO terms (DNA integration (GO:0015074) and RNA-dependent DNA replication (GO:0006278)) relate to TE biological functions, and 1524 of the 13,961 transcripts (11%) are TE-like sequences. Further analysis confirms TE-like sequences generally exhibit highest levels of expression in A. deserti folded leaf and meristem tissues (Additional file 1: Figure S10), consistent with developmental relaxation of transposable element silencing (DRTS), observed in meristematic tissues of other monocotyledonous plants .
We also examined expression of several classes of developmentally-important plant transcription factors and hormones (Figure 4D). Most transcription factor families are expressed highest at the leaf base. Notable exceptions to this pattern are the Class II KNOX genes, which tend to have broad patterns of expression , and MADS-box transcription factors, which regulate diverse developmental processes . Hormone synthesis genes are also expressed in gradients along the leaf PD axis (Figure 4D) consistent with their roles in leaves [54, 55]. We observed general PD patterns for auxin, abscisic acid (ABA), brassinosteroids (BR), cytokinin (CK), gibberellin (GA), and ethylene (ETH) hormone biosynthesis (Figure 4D). Taken together, the general patterns observed along the PD axis of the A. deserti leaf mirror those seen in the monocotyledonous grass Zea mays, with transcription factors regulating developmental processes expressed mostly at the leaf base, and functions of the mature leaf, such as photosynthesis, occur more toward the distal end.
The transcriptomes of two agaves adapted to semi-arid (A. tequilana) and xeric (A. deserti) environments offer new resources in which to study CAM photosynthesis and other physiological adaptations to prolonged drought and heat. Comparisons of the Agave de novo transcriptome assemblies to other agave sequences and cross-species proteomic comparisons suggest the de novo assemblies are largely complete and accurate. However, the transcriptomes alone provide limited insight into how agaves survive in their environments. For example, though we have identified known genes central to CAM biochemistry in Agave (Additional file 1: Table S18), a full understanding of CAM biology in Agave requires studying the regulation of photosynthetic genes in response to physiological and environmental conditions [25, 56, 57]. This highlights the need to further functional understanding of Agave transcriptomes through experimentation. Our reference transcriptomes enable molecular investigations of agaves under environmentally controlled conditions to further elaborate the coordinated gene expression underlying CAM, drought resistance, and heat tolerance. As agaves are distinguished from other monocotyledonous plants by regulatory diversity (Additional file 1: Table S7), agave responses to stress may differ from other plants in novel ways.
A simple hypothesis is that agaves adapted to their environments by the expansion of gene families, and the Agave transcriptomes allow preliminary analyses of gene duplication. However, our analysis of inferred Agave proteomes and those of 11 other plant species in the PTS found no solid evidence of gene family expansion in agaves. We cannot, however, rule out the possibility of undetected gene family expansion for two reasons. Firstly, OrthoMCL, our clustering algorithm of choice, is relatively strict compared to alternative clustering algorithms , potentially leading to false negative results. Secondly, as agaves are paleopolyploids  and gene duplication events cannot be resolved cleanly without a reference genome, expansion of gene families with highly similar sequences will go undetected. More detailed studies of the extent and nature of Agave gene duplications will need to be addressed with a completed genome sequence.
Analyses of the inferred Agave proteomes by Thermorank also failed to find solid evidence of large-scale protein adaptations to thermal stress. In fact, Agave proteomes appear to be no more or less thermostable than those of other land plants (Additional file 1: Figure S5B). Interestingly, the proteome of the green algae Chlamydomonas reinhardtii showed the lowest overall thermostability of the 11 species within Phytozome Tester Set (Additional file 1: Figure S5B)—an expected result given its aquatic habitat—suggesting Thermorank can detect broad differences in protein thermostability. However, to detect more subtle differences between land plant proteomes, more robust methods using protein structure and molecular dynamic simulations  may be needed to resolve more subtle protein adaptations to thermal stress.
The de novo transcriptome assemblies are useful to develop molecular markers for further efforts in agave breeding or molecular studies, and we used our data to generate tables of polymorphic sites in the standardized VCF format . Given the species of Agave studied here are primarily outcrossing [61, 62] and more often reproduce clonally , we expected to find a large number of polymorphisms within the transcriptome. Consistent with this hypothesis, large percentages of both the A. deserti and A. tequilana loci display SNPs and indels. We also observed a significantly higher frequency of polymorphisms in A. tequilana than A. deserti, consistent with the source of the materials, as the A. deserti sequence data was generated from two sibling plants, while A. tequilana sequence data was generated from a population of individuals (Additional file 1: Table S1). The true number of heterozygous loci may be much higher as alleles not exhibiting equal expression may escape detection with our RNA-seq analyses.
Our analysis of gene expression along the proximal-distal axis of the juvenile A. deserti leaf demonstrates core classes of genes and biological processes are similar to those observed in Zea mays, supporting evolutionary conservation of monocotyledonous leaf development models . A contrast of A. deserti to Zea mays is the expression pattern of MADS-box transcription factors. MADS-box genes can vary widely in expression and function in Agave floral structures and meristems , but their role in leaf developmental or metabolic processes remains unknown. The location of auxin biosynthesis occurs at a comparatively more distal portion of the blade in A. deserti than Z. mays. These distinctions between Agave and Zea mays could be related to morphological differences between the two species: unlike maize leaves, A. deserti leaves are lanceolate-shaped with marginal spines  and have distinct parenchyma (water storing) and chlorenchyma (photosynthetic) tissues characteristic of succulent plants . Such differences in key developmental transcription factors and hormones are perhaps not unexpected as these may be major determinants of Agave morphological adaptations to xeric environments.
Our Agave transcriptomes exemplify the power of de novo transcriptome assembly from short-read RNA-seq data , which provides both a high-quality sequence resource and insights through transcriptome profiling. Leveraging annotation tools and the scientific work from model plant species facilitated insights into the biology Agave. This rapid production of comprehensive sequence resources for additional species of industrial and biotechnological interest is needed to meet challenges of climate change and bioenergy development . Our de novo Agave transcriptome assemblies provide a guide for such future de novo transcriptome assembly projects. Additional improvements in sequencing length, accuracy, cost, and throughput will make de novo transcriptome assembly an increasingly attractive option for rapid transcriptome exploration.
A. tequilana plants were collected from an A. tequilana plantation in Guanajuato, Mexico. Leaf, root, and stem tissue was collected from 2 different adult plants, each approximately 4 years of age. Juvenile plants from the same field, each approximately 1 year old, were dissected. Equal weights of juvenile roots, leaves, and stems were pooled prior to RNA preparation. A. deserti juveniles were obtained from a local commercial provider (Berkeley, CA) and verified using morphological keys . Plants and tissues were dissected as described (Additional file 1: Table S1, Figure S1). A. deserti tissues were collected from well-watered plants near mid-day.
Agave tequilana RNA was extracted from tissues as described previously . A. deserti RNA was prepared with modifications as follows. Tissues were finely sliced and immediately frozen in liquid nitrogen. Approximately 3 g of plant tissue were finely ground using liquid nitrogen in a mortar and pestle. 7.5 ml Trizol (Invitrogen, Carlsbad, CA) and 1.5 ml chloroform was added and tissue was homogenized. Homegenate was incubated for 10 min at room temperature and centrifuged 4000 g. Aqueous phase was removed, mixed with 1 volume of a 1:1 phenol:chloroform, and centrifuged at 4000 g. Resulting aqueous phase was mixed with an equal volume of isopropanol and 1/10 volume 5 M NaCl. RNA was precipitated at −20°C overnight, then centrifuged at 10,000 g. RNA pellet was suspended in 500 μl RNAse-free H2O. Phenol:Chloroform extraction was repeated as above. Aqueous phase was mixed with 0.6 volumes of 7 M LiCl and incubated at −20°C for 40 minutes prior to centrifugation at full speed in a table top microcentrifuge. RNA pellet was rinsed in 70% ethanol, air-dried briefly, and suspended in 250 μl H2O.
Each library construction was initiated with 25 μg total RNA. Polyadenylated RNA was selected using the μMACS mRNA isolation kit (Miltenyi Biotec, Auburn, CA) and repeated as necessary until rRNA constituted less than 5% of the remaining purified mRNA before hydrolysis into 250 and 500 nt fragments using RNA Fragmentation Reagents (Ambion, Austin TX). First strand cDNA synthesis was performed using SuperScript II Reverse Transcriptase (Invitrogen, Carlsbad, CA) primed with 3 μg random hexamers and 2.5 mM dNTPs. 2nd strand cDNA was prepared in a 100 μl reaction with 2U RNAseH, 40U DNA Pol I, and 10U DNA ligase with 0.3 mM of each dNTP (with dUTP in place of dTTP). Samples were incubated for 2 hours at 16°C. Ten units T4 DNA polymerase was added and incubation was continued for an additional 5 minutes. 2nd strand cDNAs were size selected to either 250 or 500 bp using the Pippin Prep system (Sage Science, Beverly, MA). Indexed TruSeq libraries were constructed using manufacturer directions (Illumina, San Diego, CA). dUTP-labeled strands were destroyed using AmpErase uracil N-glycosylase (Applied Biosystems, Foster City, CA). Libraries were amplified through 10 cycles of PCR using Illumina guidelines. Sequencing was performed at the DOE Joint Genome Institute on an Illumina HiSeq 2000 with TruSeq SBS-v3 reagents (Illumina).
Libraries for Pacific Biosciences single molecule real time (SMRT) sequencing were prepared from A. tequilana 2nd strand cDNAs (see above). Library were constructed according to manufacturers’ guidelines (Pacific Biosciences, Menlo Park, CA) and sequenced on 5 SMRT cells for a total of 751,460 reads. Reads were filtered to remove library artifacts, resulting in 9913 read sequences composed of 27,787 subreads. Filtered Pacific Biosciences subread sequences ≥ 300 nt were corrected with 114,901,038 A. tequilana Illumina reads using methods described previously . From this, 4767 successfully corrected, high quality PacBio subreads were returned. To compare the Pacific Biosciences sequencing to the Illumina de novo assembly, corrected PacBio subreads were aligned to the Illumina A. tequilana assembly using BLAT  with a minimum threshold of 90% sequence identity.
de novo transcriptome assembly of Illumina sequence was performed by Rnnotator . Transcript contigs were binned into loci based on a minimum of 200 bp sequence overlap as determined by an all-vs-all comparison using Vmatch . Following assembly, transcripts were assigned an RPKM  value based on the number of uniquely mapping reads aligning to each transcript using BWA . Each transcript version per locus was numbered according to its relative abundance for that locus (with version 1 being the most abundant). Transcripts present at less than 10% of the version 1 transcript were noted as potential precursor transcripts. Transcripts are named by their respective locus, version (isoform) number, raw RPKM and precursor flag; e.g. Locus1v2rpkm3.45_PRE is the 2nd most abundant isoform of Locus 1 with an RPKM of 3.45, marked as a precursor transcript. Additional details about the design and operation of Rnnotator can be found online .
MEGAN v4.621 build 27  was used to identify de novo assembled transcript contigs with homology to plant sequences and filtered RepeatMasker v. open-3.2.9  and DeconSeq v 0.4.1  with a contaminating match equivalent to > = 94% identity over 90% of the contig length. Sequences unidentified by MEGAN, Deconseq, or Repeatmasker were either retained or removed from the Agave datasets using their abundance (measured in RPKM) assuming most RNA in the sample originates from agave. Thresholds were defined by the lower quartile RPKM of high-confidence plant contigs. Contigs meeting or exceeding this RPKM were retained within the agave datasets (RPKM > = 0.42 for A. deserti, RPKM > = 1.2 for A. tequilana).
Open reading frames were annotated using EMBOSS getorf  with a maximum length of 1 × 106 and a minimum length of 30 amino acids. Working-set proteomes include only proteins encoded on the + strand of v1 (most abundant) transcript isoforms, where each protein must be at least 76 aa in length with a CDS encompassing > = 50% of transcript length. Minimum protein lengths of 76 aa represents the 5th percentile of protein lengths within the Phytozome Tester Set (below). Pfam, Interpro, and GO annotation was performed using InterProScan . KEGG annotation  was retrieved using KAAS . TEs were identified using RepeatMasker (version open-3.2.9)  with RepBase Update 2009-06-04 .
The Phytozome Tester Set (PTS) includes select proteomes from Phytozome v8 : Arabidopsis thaliana (TAIR release 10), Brachypodium distachyon (JGI v1.0 8X assembly of Bd21 and MIPS/JGI v1.2 annotation), Chlamydomonas reinhardtii (Augustus u10.2 annotation of JGI assembly v4), Glycine max (JGI Glyma1.0 annotation of Glyma1 assembly), Medicago truncatula (Medicago Genome Sequence Consortium release Mt 3.0), Oryza sativa japonica (MSU Release 7.0), Populus trichocarpa (JGI release v2.0, annotation v2.2), Ricinus communis (TIGR release 0.1), Setaria italica (JGI 8.3X chromosome-scale assembly release 2.0, annotation version 2.1), Sorghum bicolor (Sbi1.4 models from MIPS/PASA on v1.0 assembly), and Zea mays (Maize Genome Project 5b.60 B73). Proteins were binned into orthologous groups (Plant OGs) by OrthoMCL v2.0.3  using default settings.
All paired-end reads from A. deserti libraries (1,231,372,300 reads) and randomly selected A. tequilana reads (993,931,796 reads) were used to detect SNPs and indels (polymorphisms). A. deserti reads were aligned to the A. deserti v1 transcript contigs and A. tequilana reads were aligned to the A. tequilana v1 transcript contigs using BWA . Polymorphisms within the v1 transcripts, serving as a proxy for a genomic locus, were called using SAMtools mpileup . Based on the quality value distribution of SNPs and indels (Additional file 1: Figure S4A), only those with a quality score of 999 were considered for further analysis, minimizing low-confidence polymorphism calls from poor sequence quality or low-coverage.
Where K is the molar fraction of lysine, E is the molar fraction of glutamic acid, Pos is the molar fraction of positively charged amino acids (R, H and K), Chg is the molar fraction of charged amino acids (D, E, H, K, and R), Sml is the molar fraction of ‘small’ amino acids (A, C, D, G, N, P, S, T, and V), Tiny is the molar fraction of ‘tiny’ amino acids (A, C, G, S, and T), T is the molar fraction of threonine. ASA is calculated as follows: The residue surface accessible area for each amino acid residue (R) in a hypothetical Gly-R-Gly tripeptide was indexed to data obtained by Chothia . The sum surface area for the peptide is divided by the number of amino acids to obtain an average residue surface area. The average surface area (possible minimum of 75 square angstroms (Å2) and maximum of 255 Å2) is divided by 180 Å2 (the range between 75 Å2 and 255 Å2) to create a dimensionless value between 0 and 1. A test of 5000 artificial peptide sequences of random length and amino acid composition found high correlation between our thermostability score generator and Thermorank (Pearson r = 0.873, Additional file 1: Figure S5A).
Detection of Agave OrthMCL-defined PlantOG family memberships, and Thermorank scores were determined by performing a Wilcoxon rank sum text against data obtained from the Phytozome Tester Set. Prior to analysis of PlantOG membership, protein identifiers from agaves and the Phytozome Tester Set were parsed to select non-redundant representative proteomes with a single version 1 (or similarly-labeled) representative protein model per locus. Final p-values were corrected for multiple comparisons by the Benjamini-Hochberg procedure .
Contigs containing rRNA-like sequences as determined by BLASTN  (E-value ≤ 10) against the SILVA v108 database  were removed from reference transcriptomes prior to expression analyses. Reads were trimmed to 36 nt and mapped to reference transcriptome using BWA . The number of reads uniquely aligning to each transcript was normalized by the total number of uniquely-aligning reads in the sample, divided by the length of the uniquely mappable portion of each transcript to obtain an RPKM value . Q-values were obtained as described . Z-scaled locus RPKM values were grouped by K-means clustering, 6 clusters were chosen based on the ‘least within group sum of squares’ method . All enrichment analysis was performed using BiNGO  with default settings (hypergeometric test with Benjamini-Hochberg p-value correction ).
Reads are available through the NCBI Sequence Read Archive (SRA), study accessions [GenBank:SRP019885] (A. tequilana) and [GenBank:SRP019506] (A. deserti). Agave transcriptome assembly contigs meeting NCBI requirements are deposited at the Transcriptome Shotgun Assembly (TSA) accessions A. tequilana: [GenBank:GAHU00000000]; A. deserti: [GenBank:GAHT00000000]. Full sequence assemblies, annotations, OrthoMCL clustering, and expression data for both agave datasets as described are available at the Dryad Digital Repository .
Crassulacean acid metabolism
Developmental relaxation of transposable element silencing
Reciprocal best hit
Reads per kilobase of exon model per million mapped reads
The authors would like to thank Gerald Tuskan, Xiaohan Yang, Joel Martin, Wendy Schackwitz, Erika Lindquist, Feng Chen, Chia-Lin Wei, Cindy Choi, Natasha Zvenigorodsky, Dongwan Kang, Crystal Wright, Devin Coleman-Derr, Stephen Fairclough, Nicole Johnson, and Gretchen North for technical assistance and comments; and Michael McKain for data access . Funding was provided by Lawrence Berkeley National Laboratory Directed Research and Development Program (LB11036). The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.