The transcriptome of the developing grain: a resource for understanding seed development and the molecular control of the functional and nutritional properties of wheat

Background Wheat is one of the three major cereals that have been domesticated to feed human populations. The composition of the wheat grain determines the functional properties of wheat including milling efficiency, bread making, and nutritional value. Transcriptome analysis of the developing wheat grain provides key insights into the molecular basis for grain development and quality. Results The transcriptome of 35 genotypes was analysed by RNA-Seq at two development stages (14 and 30 days-post-anthesis, dpa) corresponding to the mid stage of development (stage Z75) and the almost mature seed (stage Z85). At 14dpa, most of the transcripts were associated with the synthesis of the major seed components including storage proteins and starch. At 30dpa, a diverse range of genes were expressed at low levels with a predominance of genes associated with seed defence and stress tolerance. RNA-Seq analysis of changes in expression between 14dpa and 30dpa stages revealed 26,477 transcripts that were significantly differentially expressed at a FDR corrected p-value cut-off at ≤0.01. Functional annotation and gene ontology mapping was performed and KEGG pathway mapping allowed grouping based upon biochemical linkages. This analysis demonstrated that photosynthesis associated with the pericarp was very active at 14dpa but had ceased by 30dpa. Recently reported genes for flour yield in milling and bread quality were found to influence wheat quality largely due to expression patterns at the earlier seed development stage. Conclusions This study serves as a resource providing an overview of gene expression during wheat grain development at the early (14dpa) and late (30dpa) grain filling stages for use in studies of grain quality and nutritional value and in understanding seed biology. Electronic supplementary material The online version of this article doi: (10.1186/s12864-017-4154-z) contains supplementary material, which is available to authorized users.


Background
Human survival has relied heavily upon the food contained in the seeds of the three major cereals, rice, wheat and maize since the beginnings of agriculture [1]. Wheat (Triticum aestivum L.) feeds nearly 30% of the world population [2]. Bread wheat has a unique genetic makeup, being an allohexaploid (2n = 6× = 42) with three sub-genomes, A, B, D, and a novel seed composition due especially to the presence of the gluten proteins with characteristic features that enable the production of diverse food products [3]. The wheat caryopsis (dry seed) is comprised of three key tissues, the pericarp, embryo, and endosperm [4]. Anatomically, the largest part of the mature seed, the endosperm, consist mainly of starchy endosperm cells with an outer layer specialized layer known as the aleurone [5]. As the wheat grain matures, the viability of the starchy endosperm cells is lost while that of the aleurone cells is maintained [6]. The starchy endosperm cells contain 60-70% starch and 10-15% storage proteins and form the main food source in the seed [7].
Although the anatomy and composition of the wheat caryopsis has been well studied there are few reports on analysis of the role of gene expression in controlling grain formation. There are studies of the transcriptome using microarray [23][24][25] and NGS based methods [26][27][28]; but each study investigated only a single cultivar. In a continuation of our earlier work identifying differentially expressed genes that underlie aleurone and starchy endosperm differentiation [29]; we now report a study of genes expressed during seed development across wheat cultivars.
We report here, transcriptome profiling of the wheat caryopsis studied at two stages, at the boundary of grain enlargement and grain filing (14dpa, representing early grain filling) and at physiological maturity (30dpa, representing late grain filling) using RNA-Seq for 35 diverse cultivars ( Fig. 1) [30]. The differences in expression and functional roles of these genes were characterized [31,32] and their metabolic roles analysed [33]. In addition to explaining the role of gene expression in seed development, this study provides a platform for analysis of genetics of grain traits associated with grain yield, functional performance in food products and nutritional value in human and animal diets.

Results
Sequence reads from 35 genotypes were mapped to the wheat cDNA database of Triticum aestivum gene indices (TaGI) containing 221,925 transcripts (TaGI, release 12.0). In total, 84.9% of the reads from 14dpa were mapped (Table 1 and Additional file 1); leading to an average 70× depth of sequencing. Reads from 30dpa provided a depth of 73× with 84.3% of reads mapped (Table 1 and Additional file 1).

RNA-Seq and differential gene expression analysis
The reads from 14dpa and 30dpa stages mapped against 178,767 and 174,695 transcripts respectively with 167,239 transcripts in common leaving 11,528 and 7456 mapped transcripts unique to the 14dpa and 30dpa stages. Of the 178,767 transcripts mapped at 14dpa stage, 116,148 sequences (~65%) were functionally Fig. 1 Schematic representation of developmental stages studied during wheat grain ontogeny. 14dpa: wheat grains sampled at 14 days-post-anthesis stage for transcriptome experiment; 30dpa: wheat grains sampled at 30 days-post-anthesis stage for transcriptome experiment; D14: differentially expressed transcripts at 14dpa; D30: differentially expressed transcripts at 30dpa; U14: uniquely expressed transcripts at 14dpa; U30: uniquely expressed transcripts at 30dpa annotated. Similarly, 112,608 transcripts were functionally annotated from 174,695 mapped transcripts at 30dpa stage. Transcripts from the 14dpa and 30dpa stages were functionally annotated and classified ontologically into three major aspects, biological process, molecular function, and cellular component using functional annotation software BLAST2GOv3.2.

Functional annotation for mapped transcripts at 14dpa
The top five ontologies attributed to biological process at the 14dpa stage were; cellular component organization, cellular protein modification process, response to stress, catabolic process, and carbohydrate metabolic process. Similarly, for molecular function the top five categories were; nucleotide binding, DNA binding, kinase activity, transporter activity, and structural molecule activity. Cellular component was attributed to plastid, mitochondria, cytosol, ribosome, and plasma membrane. The complete set of biological process, molecular function, and cellular component classes were depicted in a cloud format with font size proportionately reflecting abundance at the 14dpa stage. Functional annotation analyses for the transcripts that are unique to the 14dpa stage (11,528 transcripts) revealed the top five biological process to be carbohydrate metabolic process, photosynthesis, cellular component organization, response to stress, and cellular protein modification process . In the case of molecular function, they were similar except for structural molecular activity being replaced with protein binding . Thylakoid cellular component replaces "ribosome" in the subset of transcripts that are unique to the 14dpa stage. The various biological process, molecular function and cellular component attributes that are unique to the 14dpa stage were organized in a cloud format with the font size reflecting their proportion .

Functional annotation for mapped transcripts at 30dpa
Transcripts involved in biological processes like gluconeogenesis, response to misfolded protein, water transport, Golgi organization, and photosystem II assembly dominate during the 30dpa stage. Major molecular function ontologies mapped at this stage were ATP binding, zinc ion binding, structural constituent of ribosome, GTP binding, and heme binding. Key cellular components in which these biological process and molecular function ontologies makeup the biological system in a functional manner are plasmodesma, cytosolic large ribosomal subunit, apoplast, cytosolic small ribosomal subunit, and plastoglobule. The complete list of biological process ontologies at the 30dpa stage were depicted in a cloud format. Other minor molecular function and cellular component ontologies mapped during the 30dpa stage were also presented in cloud format. Mapped ontologies during the 30dpa stage for biological process, molecular function, and cellular component that constitute less than 1 % of the total expressed transcripts during the 30dpa stage were grouped in "others" category. Ontology mapping for the transcripts that are unique to the 30dpa stage (7456 transcripts) showed the domination of an entirely different set of biological processes, cellular oxidant detoxification, glyoxylate metabolic process, hydrogen peroxide catabolic process, sucrose metabolic process, and negative regulation of endopeptidase activity. Molecular function ontologies pertaining to the unique transcript sets at the 30dpa stage exhibited a similar pattern as at the 30dpa stage except "iron ion binding" which replaces "GTP binding". Cytoplasmic, membrane-bound vesicle and transcription factor complex ontologies replace cytosolic small ribosomal subunit and plastoglobule ontologies among the top five cellular component ontologies when compared between the general set and the unique set of transcripts at 30dpa stage. A word cloud comprising all biological process, molecular function, and cellular component ontologies mapped with unique transcripts expressed at 30dpa stage was also generated.

Differentially expressed transcripts between 14 and 30dpa
Differentially expressed transcripts between the 14dpa and 30dpa stages across all the genotypes were compared. False discovery rate (FDR) corrected p-values with 0.01 as cut-off were used to distinguish the differentially expressed transcripts that are statistically significant from the rest using statistical analyses (mean and count based) to harness the advantages from both the methods. In total, 34,737 and 34,378 transcripts (Table 2) were differentially expressed statistically for mean and count based methods respectively with a FDR cut-off at 0.01. A set of 26,477 transcripts (Table 2) were common to both statistical tests and are in four groups ( Fig. 1) based on fold changes, unique transcripts at 14dpa (U14, 319); higher expression at 14dpa (D14, 16,237conversely downregulated at 30dpa); higher expression at 30dpa (D30, 9740conversely downregulated at 14dpa);  (Table 2) were annotated with one or more gene ontology terms and functions (Additional files 10, 11, 12 and 13). Pathway mapping using KEGG [33,34] from BLAST2GO [31,32] and filtering to discard nonplant related pathways yielded 52 (U14); 2791 (D14); 1121 (D30); and 11 (U30) transcript IDs ( Table 2) that were mapped to various metabolic pathways (Table 3).
Additional file 14 provides a list of pathways, differentially expressed transcript IDs linked with annotated enzymes involved in metabolic pathways. There were 70 and 71 metabolic pathways (Additional files 15, 16, 17 and 18) mapped from the transcriptome at 14 and 30dpa respectively through KEGG pathway mapping of BLAST2GO of which 68 were common. Key genes involved in different areas of metabolism that are significantly differentially expressed between 14dpa and 30dpa stages were tabulated (Table 4) and the most prominent ones are discussed below. Further details on the significantly differentially expressed genes between early and late grain filling for each of the nine areas of metabolism were documented as additional information (Additional file 19).

Discussion
Early analysis of differentially expressed genes in this data set facilitated the discovery of a gene associated with bread making quality (wbm) of wheat [35] that led us to study the global gene expression pattern across all genotypes of this study between the 14dpa and 30dpa stages. More recently, pathway mapping of this data through KEGG helped us discover a C 4 pathway of photosynthesis without Kranz anatomy in wheat grains [36]. A compilation of the differentially regulated genes between 14dpa and 30dpa stages is now presented in this article (Additional file 14; Table 4). Based on the metabolism in which the pathways were involved, the 73 pathways were grouped into nine broad categories, nucleotide metabolism; amino acid metabolism; carbohydrate metabolism; respiratory pathways; photosynthesis; lipid metabolism; hormone biosynthesis; vitamin metabolism; and specialized metabolism (Table 3; Additional file 14).
Although some metabolic pathways are highly conserved across taxa [37] like carbohydrate metabolism, respiratory pathways and nucleotide metabolism, others are highly diversified and specialized in some taxa [38]. Carbohydrate metabolic processes and photosynthesis are the dominant biological process ontologies identified at the 14dpa stage suggesting the likely importance of ear photosynthesis in determining grain yield [39,40]. Phosphohexokinase and hexokinase type IV glucokinase are the key regulatory genes involved in carbohydrate metabolism differing between early and late grain filling stages. Transaminase that converts serine to hydroxy-pyruvate leading to glycerate formation (glyoxylate and dicarboxylate metabolism) is the key differentially expressed gene during early-grain filling that is linked with the photorespiration process (Table 4; Additional files 14 and 19). This data from carbohydrate metabolism also correlates well with photosynthesis and photorespiration being tightly linked and differentially regulated during the early-  Photosynthesis is one of the crucial metabolic process that directly impacts on yield of the wheat grain. Genes involving porphyrin and chlorophyll metabolism are differentially expressed during early grain filling stage (Table 4). Genes involving C 4 photosynthesis are differentially expressed between early and late grain filling stages leading us to discover the C 4 photosynthetic pathway in the pericarp tissues of wheat grain [36]. The anatomy with which non-Kranz C 4 photosynthetic pathway being accomplished were termed as Bose anatomy [41] in honour of his discovery on the role of malic acid (4C) in photosynthesis in Hydrilla sp. that was much later known to be single-cell C4 type. The RuBisCO small subunit was also differentially expressed during different developmental stages ( Table 4; Additional file 14). Amino-acyl tRNA biosynthesis with its ligases and synthase (glutamine-hydrolysing) are the key differentially expressed genes between early and late grain filling in nucleotide metabolism (Table 4 and Additional file 19). Transaminase is the key gene being differentially regulated for various types of amino acid metabolism (Table 4 and Additional file 14). It is also the best example of a transcripts that is being differentially expressed in different metabolic pathways of amino acid metabolism (Additional files 14 and 19). Lactoperoxidase of phenylalanine metabolism under amino acid metabolism is the best example for the transcript IDs that are differentially expressed across the early and late grain filling stages (Additional files 14 and 19). Tetrahydrofolate (THF) dependent aminomethyltransferase (EC2.1.2.10) of glycine, serine and threonine metabolism is a key differentially expressed gene in D14 and D30 stages (Table 4) and involved in photorespiration process of glycine decarboxylase system (GDS) by being the Tsubunit component [42]. Similarly, P-subunit of GDS, dehydrogenase (aminomethyl-transferring; EC1.4.4.2) from glycine, serine and threonine metabolism is the key differentially expressed gene only during the D14 stage and is involved in the photorespiration process (Table 4).
Phospholipase A2 is the key differentially expressed gene that is involved in multiple pathways of lipid metabolism, glycerophospholipid metabolism, ether lipid metabolism, arachidonic acid metabolism, linoleic acid metabolism, and alpha-linolenic acid metabolism (Table 4). Except for ether lipid metabolism which is differentially expressed only during 14dpa stage, the others are differentially expressed both at 14dpa and 30dpa stages with different transcript IDs during different developmental stages (Additional file 14). Different hydrolase genes that are pathway specific and involved in fatty acid biosynthesis and fatty acid elongation pathways are differentially expressed at 14dpa stage; while ligase genes are differentially expressed at 14dpa stage and involved in fatty acid biosynthesis and degradation pathways ( Table 4).
The monooxygenase gene is the key differentially expressed transcript at 14dpa stage involved in the steroid biosynthetic pathway and categorized under hormone biosynthesis (Table 4; Additional file 14). There are other monooxygenase genes that are differentially expressed and involved in amino acid metabolism, lipid metabolism and specialized metabolism and hence caution is required while manipulating this gene. Reductase is the key differentially expressed gene and rate limiting step during late-grain filling for one carbon pool by folate pathway being categorized under Vitamin metabolism (Table 4; Additional file 14). It is involved in conversion of folate to dihydrofolate and tetrahydrofolate (THF)involved in photorespirationinvolved in glycine, serine and threonine metabolism. This suggests that silencing of reductase (EC1.5.1.3) in plants might downregulate THF formation and in turn possibly minimize photorespiration. However, THF is known to be involved as a co-factor in various other one carbon metabolic processes like purine biosynthesis, pantothenate biosynthesis, organellar protein biosynthesis, and methionine biosynthesis [43]. So, how far the downregulation of reductase could be used in selectively restricting photorespiration during the late-grain filling stages without affecting the normal physiology of the plant is unclear. Ammonia-lyase and geranyldiphosphate synthase respectively for phenylpropanoid and terpenoid backbone metabolic pathways involved in specialized metabolism are key genes that are differentially expressed at the 14dpa and 30dpa stages (

Conclusion
This study provides an atlas of transcripts expressed during early (14dpa) and late (30dpa) grain filling in wheat. Among those, the key transcripts involving various metabolic pathways that are significantly differentially expressed at grain filling stages during wheat grain development were compiled (Additional file 14). Next generation sequencing (NGS) based transcriptome analysis (RNA-Seq) in combination with functional annotation has proved to be a very robust tool helpful in discovering novel genes such as the wheat bread making gene [35] and genes controlling flour yield and entire metabolic pathways (grain specific C 4 photosynthesis) [36], improving our understanding on the complex metabolic networks at different developmental stages. The present resource will facilitate breeding selection for key genes in a metabolic pathway or a key metabolic pathway in a network involving specific developmental stages. Selection of genes expressed during grain filling identified in this dataset might be used for improving yield and grain quality. For example, a focus on silencing the genes that are involved in photorespiration, H-subunit (protein component with no independent catalytic activity) or P-subunit (Aminomethyl-transferring dehydrogenase) or T-subunit (THF dependent aminomethyltransferase) might be helpful in improving grain yield through reduced photorespiration [44]. Reducing photorespiration might be a significant contribution to achieve the goal of 20:20 Wheat® [45] to achieve the productivity of 20 t per hectare in 20 years -through increased photosynthetic efficiency.

Sample collection and RNA extraction
Tagging, sample collection and pulverization using tissuelyzer (Qiagen, US) were done as described by Furtado et al. (2015) [35]. Wheat spikes were collected at 14 and 30 days-post-anthesis (dpa) stage (Zadok's stage at Z75 and Z85 respectively for 14-and 30-dpa and snap frozen under liquid nitrogen. Developing caryopses from the spikes were selected, pulverized and stored at −80°C in a deep freezer until further processing for RNA extraction. Total RNA from 35 cultivars at both 14 and 30dpa were isolated in duplicate as per the protocol reported previously [46]. Quantity and quality of isolated total RNA were determined using 2100 Bioanalyzer (Agilent Technologies, USA).

Library preparation and NGS sequencing
Libraries for 31 samples from 14dpa and 32 samples from 30dpa stage (with 28 cultivars being common) were prepared and sequenced (100 bp paired-end reads) on an Illumina platform. Due to the lack of sufficient starting material data is not available for four cultivars, NW-93A, NW-108A, Pelada, and Vega at 14dpa, and three cultivars, Greece-25, NW-25A, and NW-51A at the 30dpa stage.

Read assembly
Paired end-reads from raw sequence data were imported into CLC genomics workbench ver7.0.4 (CLC Bio, Denmark; presently Qiagen, USA) and all further computational analyses were performed with this tool unless otherwise stated. Imported reads were trimmed with default parameters. A quality check was performed both before and after trimming with default parameters. Trimmed reads were mapped against the DFCI Triticum aestivum Gene Index (TaGI) reference cDNA database consisting of 221,925 sequences (release 12.0, The Computational Biology and Functional Genomics Laboratory, Dana Farber Cancer Institute and Harvard School of Public Health).

Transcriptome database during early and late grain-filling
The transcripts that were mapped against the TaGI reference database at early (14dpa) and late (30dpa) grain filling stages during wheat grain ontogeny were selected. From the mapped set of transcripts from the TaGI database, three sub-sets of transcript ids were created, common set for grain filling, unique set at 14dpa, and unique set at 30dpa. Functional annotation, enrichment analysis and R-fam analysis were performed through BLAST2GOv3.2 [31,32] for the three subsets independently to identify the set of metabolic pathways that are unique to early and late grain filling stages and common ones that are involved in regulating various metabolic pathways across grain filling in wheat.

Differential gene expression analysis and statistics
Sets of genes that are differentially expressed between 14dpa and 30dpa were identified through RNA-Seq experimentation. A false discovery rate (FDR) cut-off value of 0.01 was used to select the list of differentially expressed genes that were significant statistically from both mean based (Gaussian) and count-based (EDGE -Empirical analysis of Differential Gene Expression) statistics. Read expression values were normalized as RPKM (reads per kilobase per million) values and quantified in fold changes between two groups viz., 14 and 30dpa. Based on fold changes, the set of genes were grouped into four viz., U14 (uniquely expressed at 14dpa); D14 (differentially expressed in higher folds at 14dpa); D30 (differentially expressed in higher folds at 30dpa); and U30 (uniquely expressed at 30dpa) for further analyses (Fig. 1).

Functional annotation and enrichment analysis for differentially expressed genes
A set of sequences from the four groups (U14 -319; D14 -16,237; D30 -9740; U30 -181) were retrieved from the TaGI database and blastx analyses were performed in parallel for the four groups against the non-redundant protein database in CLC genomics workbench ver7.0.4. Output files were converted into Blast2GO project files through a plug-in version and exported from CLC in ".dat" format. These files were then imported into Blast2GO Pro ver3.0.10 [31,32]