Identification of two putative reference genes from grapevine suitable for gene expression analysis in berry and related tissues derived from RNA-Seq data

Background Data normalization is a key step in gene expression analysis by qPCR. Endogenous control genes are used to estimate variations and experimental errors occurring during sample preparation and expression measurements. However, the transcription level of the most commonly used reference genes can vary considerably in samples obtained from different individuals, tissues, developmental stages and under variable physiological conditions, resulting in a misinterpretation of the performance of the target gene(s). This issue has been scarcely approached in woody species such as grapevine. Results A statistical criterion was applied to select a sub-set of 19 candidate reference genes from a total of 242 non-differentially expressed (NDE) genes derived from a RNA-Seq experiment comprising ca. 500 million reads obtained from 14 table-grape genotypes sampled at four phenological stages. From the 19 candidate reference genes, VvAIG1 (AvrRpt2-induced gene) and VvTCPB (T-complex 1 beta-like protein) were found to be the most stable ones after comparing the complete set of genotypes and phenological stages studied. This result was further validated by qPCR and geNorm analyses. Conclusions Based on the evidence presented in this work, we propose to use the grapevine genes VvAIG1 or VvTCPB or both as a reference tool to normalize RNA expression in qPCR assays or other quantitative method intended to measure gene expression in berries and other tissues of this fruit crop, sampled at different developmental stages and physiological conditions.

Results: A statistical criterion was applied to select a sub-set of 19 candidate reference genes from a total of 242 non-differentially expressed (NDE) genes derived from a RNA-Seq experiment comprising ca. 500 million reads obtained from 14 table-grape genotypes sampled at four phenological stages. From the 19 candidate reference genes, VvAIG1 (AvrRpt2-induced gene) and VvTCPB (T-complex 1 beta-like protein) were found to be the most stable ones after comparing the complete set of genotypes and phenological stages studied. This result was further validated by qPCR and geNorm analyses.
Conclusions: Based on the evidence presented in this work, we propose to use the grapevine genes VvAIG1 or VvTCPB or both as a reference tool to normalize RNA expression in qPCR assays or other quantitative method intended to measure gene expression in berries and other tissues of this fruit crop, sampled at different developmental stages and physiological conditions.

Background
Quantitative real-time PCR (qPCR) is generally used for measuring transcripts abundance due to its high sensitivity, specificity and broad quantification range for high throughput and accurate expression profiling of selected genes [1]. Also, qPCR analysis has become the most common method for verification of microarrays and RNA-Seq results [2][3][4]. Besides being a powerful technique, qPCR has certain disadvantages such as the difficulties associated to the inappropriate data normalization, one of the most important aspects to solve [5] in order to fit this technique for the study of a new organism, organ or tissue. The data normalization is a key stage to control the artifacts and experimental error occurring during sample preparation and the following experimental steps, ending with the data analysis. It has been shown that qPCR results are highly dependent on the reference genes chosen [6], which explain the considerable effort applied into the validation of the gene(s) selected for the normalization stage, prior to extensive experimentation [7]. These housekeeping genes should not vary in their expression level considering the different tissues or cells under investigation, nor in response to any experimental treatment [8].
Regardless of the experimental technique employed, appropriate normalization is essential for obtaining accurate and reliable quantifications of gene expression levels, especially when measuring small expression differences or when working with tissues of different histological origin [9]. The purpose of normalization is to correct variability associated with the various steps of the experimental procedure, such as differences in initial sample amount, RNA extraction recovery and integrity, efficiency on cDNA synthesis and differences in the overall transcriptional activity of the tissues or cells analyzed [10]. Among the numerous normalization approaches that have been proposed [11,12] the use of internal controls or reference genes has become the method of preference [13,14], because they potentially account for all of the sources of variability mentioned above. However, numerous studies have reported that the transcript quantity of the most commonly used reference genes can vary considerably under different developmental, physiological and experimental conditions [11,[15][16][17][18][19][20][21][22][23]. Several reference genes are commonly used, such as elongation factor [24,25], actin [26,27], ubiquitin [28,29], and ribosomal units (18S or 28S rRNA) [30][31][32]. However, several reports have demonstrated that transcript levels of these genes also vary considerably under different experimental conditions and consequently their suitability for gene expression studies must be evaluated case by case [22,33,34]. This implies that a reference gene with stable expression in one organism may not be suitable for normalization of gene expression in another organism [35,36], or even in different experiments for the same species.
Many works have been carried out on animal models and in relation to human health [37,38], fields in which multiple reference genes for normalization of qPCR data have been described. However, similar reports are less abundant in plants [10,35,39]. Czechowski et al. [22] employed a new strategy for the identification of reference genes in Arabidopsis thaliana, based on the microarray data of Affymetrix (ATH1), and several new reference genes were revealed [40]. This list of Arabidopsis reference genes was successfully employed to search for reference genes by sequence homology in unrelated species such as Vitis vinifera [7]. This approach resulted in a strategy that is based on the parallel use of a series of control genes and calculation of normalization factors using statistical algorithms [8,11,41]. It is necessary to validate the expression stability of a candidate control gene in each experimental system prior to its use for normalization. In this regard, several free software applications such as geNorm [8], NormFinder [42] or qBase [43] are used in order to identify the best internal controls from a group of candidate normalization genes in a given set of biological samples.
To our knowledge, no investigations have been yet carried out for the identification of reference genes in table grape, one of the most important template fruit crops. In this work we used a data set obtained from a large RNA-Seq experiment of table grape segregants phenotypically and genetically diverse, belonging to a 'Ruby Seedless' x 'Sultanina' crossing, sampled at three phenotypic stages, anthesis, fruit-setting and berries of 6-8 mm diameter (the last one from plants treated or not with gibberellic acid). We focused the search of control genes evaluating the variability (or stability) in the expression of 19 genes selected from an initial set of 242 genes that showed a threshold stability level, comparing the four different developmental and physiological conditions. Two new reference genes, VvAIG1 (AvrRpt2-induced gene) and VvTCPB (T-complex 1 beta-like protein) were validated by qPCR and geNorm techniques and are presented as new housekeeping genes for table grape.

Results and discussion
Identification of putative reference genes Usually the search for reference genes in any plant species is based on the identification of orthologs of genes stably expressed in model plants, mainly from Arabidopsis thaliana [44,45]. In this case, we used our own information obtained from a massive sequencing assay done with 47 samples of the same species of interest, i.e., table grape (Vitis vinifera L.). This set of samples corresponds to 14 genotypes from which RNA was collected, combining different flower and berry developmental stages and treatments (see Methods section). Even when the main outcome of an RNA-Seq exercise is the identification of differentially expressed genes, in this case the same data set was used to search for putative reference genes, considered as that any gene that has a minimal expression level variation in every sample analyzed. Based on these criteria, a total of 242 candidate housekeeping genes were identified, using the bioinformatics workflow presented in Figure 1. These genes are involved in different biological processes (data not shown), such as synthesis, degradation, folding, defense, stress and catabolism of proteins and metabolites. As this number of genes is too large to evaluate each one respect of their transcriptional stability, we selected a subset according to statistical criteria described in the next section. With this purpose in mind, we ranked the list of 242 genes according to their coefficient of variation (CV), even when it was not observed a direct relation between CV and total reads (Additional file 1: Table S1).

Selection of a sub-set of 19 candidate reference genes
Different approaches, such as Poisson distribution, quasi-Poisson distribution and negative binomial distribution, have been used to represent the statistical distribution of sequence data [46][47][48]. Under these three kinds of distribution, the mean of count reads is highly related to variance [47,48]. A summary of the three statistical parameters used in this work to characterize the NDE genes is shown in Table 1. The mean and the variance were high and positively related, while the average was not related to the CV (Table 2). Therefore, this last parameter, together with the mean, was used to select those NDE genes that behaved as housekeeping and had both a low variation coefficient and a high abundance along the 47 samples analyzed. The data (Table 1) showed a high estimated coefficient of variation (CV > 40%, with a range of~25%). This variation probably could be given by an intrinsic variation within the biological sample (phenotype, phenological stages and gibberellic acid treatment) or by sampling error, because the sources of variation are considered during the selection of genes as differentially expressed by edgeR package [46]. Only a few genes (~8% of total NDE genes) had mean and CV values large enough to rule out sampling errors. This among-samples variation could be explained because the genotypic effect was not taken into account for the selection of NDE genes. In this study, each one of the 14 genotypes used could be differentially interacting with the other factors or conditions (phenotype, phenological stages and gibberellic acid treatment).
Because of the difficulties to find genes that possess simultaneously both a high expression level (number of reads) and a low CV, we used the coefficient of variation, which is not related to the mean and it is easier to interpret ( Table 2). This parameter has been previously used in other experiments [49,50]. The threshold estimated by the simulation for CV and μ are listed in Table 3. According to this, as the threshold became more stringent, fewer genes were found that satisfied both criteria of selection: CV < percentile threshold and μ > percentile threshold (Table 3). Only 19 of the 242 NDE genes satisfied both criteria at 97.5% and 2.5% for μ and CV, respectively (Table 4).

Primers design and analysis of the variability from threshold cycles value
Primer pairs for qPCR were designed and subsequently evaluated on table grape cDNA. For 17 out of the 19 primer pairs designed, a single amplicon was observed by electrophoretic separation; each amplicon was sequenced to confirm the primer specificity. The primers for VvADH7 and VvSLP had to be excluded from the study as they produced two amplicons under the tested PCR conditions. All primers were designed with the following criteria: 20-24 bp length, GC content between 50% and 65%, product size in the range of 91-268 bp and melting temperature between 60-64°C (Table 5). Melting curve analyses of the 17 genes showed a single peak in each case, confirming that the primers amplified a single product (data not shown). Except for VvUNP3 (129%) and VvADF2 (114%), all PCRs    displayed amplification efficiencies between 83% and 110% (Additional file 2: Table S2).
As a first approach we compared the different expression levels of the reference genes over all the 47 samples using the absolute Ct value. Analysis of the raw expression levels across all samples detected some variation among reference genes. The results (Additional file 2: Table S2 and Figure 2) revealed that all genes presented median ct values between 18.5 and 24.8 and the CV was < 7% for all the reference genes (Additional file 2: Table S2), among which VvAIG1 and VvTCPB presented the lowest CVs, 3.6 and 3.9 respectively.

Expression analysis of reference genes for qPCR
Using quantitative Real-Time PCR we studied the expression of 12 out of 19 candidate reference genes in cDNA samples of table grape genotypes from different phenological stages. Most of the genes showed a similar expression pattern considering the different samples under study, e.g., lower expression at anthesis and fruit-setting stages and slightly higher expression in the 6-8 mm berry size stage (Figure 3, C-L). Other genes such as VvAIG1 and VvTCPB did not show significant differences in their expression along the different phenological stages and in the different samples (segregants) studied ( Figure 3, A and B). As a control, we included three genes studied by Reid et al. [7], VvUBQ10, VvPIP2B and VvEF1-α, which presented an expression profile similar to the set of putative reference genes, with appreciable differences between phenological stages ( Figure 3, M-O). Interestingly, this set of three control genes, commonly used in gene expression studies in grapevine exhibited very "unstable", non-uniform or toolow expression levels, and so they were not included in the list of 242 genes initially selected, and consequently they are not recommendable to be used as reference genes in table grapes.

Validation of candidate reference genes
For the validation of VvAIG1 and VvTCPB as reference genes, we studied their expression profile also in more advanced phenological stages (pre-veraison and postveraison), using cv. Sultanina as a model  studies reported that expression of housekeeping genes can also vary considerably under particular experimental conditions as it is observed in the Figure 4. VvAIG1 and VvTCPB genes neither presented significant differences in their expression at the growing stages evaluated ( Figures 5A and 5B, respectively). Similar results for these two genes were observed in cvs. Red Globe, Crimson seedless and Muscat of Alexandria, a set of genotypes representing at some extent the genetic diversity of table grapes [51]. In addition, we evaluated the performance of these two reference genes in leaves with similar results as berries (data not shown).
To complement this, we used geNorm algorithm to determine the most stable reference genes assuming that two ideal reference genes should not vary in comparison with each other in the different tested conditions. This algorithm calculates the average pair wise variation of a given candidate reference genes set with all other genes under evaluation and assigns a measure of its expression stability (M), based on which a ranking of candidate reference genes is produced [8]. The geNorm software has been cited for many authors in relation to the identification or behavior of reference genes; this is because of its easiness, robustness, reliability and convenience of use, and so it is currently included in qRT-PCR analyses in animals, yeasts, bacteria but rarely in plants [52]. Our results based on geNorm were consistent with this couple of genes being very stable regarding gene expression in the analyzed samples.
For anthesis, the two most stable genes were VvAIG1 and VvCCRP ( Figure 4A); in the case of fruit-setting these were VvUNP and VvCCRP ( Figure 4B); and for 6-8 mm berries, the most stable genes were VvUNP and VvAIG1 ( Figure 4C). Other genes considered in this work (EF, PP2A and UBQ10) were studied in other species of plants such as soybean [53] and Gossypium hirsutum [23], showing a high variability in their expression profile depending of the physiological condition, tissues and genotypes.
In summary, the most stable reference genes for all samples studied (different genotypes evaluated at different phenological stages) were VvTCPB and VvAIG1 ( Figure 4D). These results demonstrate that our approach allowed us to obtain a set of genes that could be used as reference genes in qPCR experiments; this is similar to the result obtained by Coito et al. [40], where they proved the accuracy of choosing a combination of grapevine reference genes for qPCR, but in that case through a microarray analysis.

Conclusions
This work is the first study that shows that a data set derived from a massive RNA sequencing for several individuals and phenotypic conditions can be used for the identification of housekeeping genes in a non-model plant species such as grapevine. The genes VvTCPB and VvAIG1, never cited before as possible reference genes in this or other woody species were the most stable genes in all samples studied. Then, these genes are proposed as reference genes to be used in qPCR assays in table grape berries at different developmental stages and physiological conditions.

Plant material
Twelve table grape segregants belonging to a 'Ruby seedless' x 'Sultanina' crossing of contrasting and extreme phenotypes respect of seed content and berry size plus both parents were used in the RNA-Seq experiments (Muñoz et al.,  manuscript in preparation). For RNA-Seq analyses, a number of whole berries from each condition (for a list of samples, phenological stages, etc., see Additional file 3: Table S3) was frozen in liquid nitrogen, homogenized and their RNA was sequenced after converted to cDNA, obtaining ca. 500 million reads from 47 sequenced samples.
For the qPCR validation of the 19 candidate reference genes, two genotypes from the same crossing collected at three phenological stages (anthesis, fruit-setting and 6-8 mm berries, treated or not with gibberellic acid) were used. We also included samples of 'Sultanina' collected at more advanced phenological stages (pre-veraison and postveraison). The vines, established at La Platina Experimental Station of the 'Instituto de Investigaciones Agropecuarias' , located in Santiago, Chile, were maintained under a standard management program for watering, fertilization, pests and diseases control and pruning. After harvest, every sample was immediately frozen in liquid nitrogen and stored at −80°C until use.

Public data used
The reference grape genome (12X) and the gene annotation were downloaded from the GENOSCOPE database (http://www.genoscope.cns.fr/externe/GenomeBrowser/Vitis/). The reference genome contains a total of 26,346 annotated transcripts with an average size of 1,137 base pairs.

Identification of candidate reference genes
To build the RNA-Seq data-base, a total of 491 million reads were generated in a Genome Analyzer II, from Illumina (IGA, Udine, Italy). After the quality trimming, 477 million reads were kept, and 91% of them were located (See figure on previous page.) Figure 3 qPCR expression values for candidate reference genes in grapevine samples. Two segregants from the Ruby x Sultanina crossing (112 and 19) in three phenological stages (anthesis, fruit-setting and 6-8 mm berries) treated or not with gibberellic acid (GA3) were used. These segregants represent extreme phenotypes for berry size. For relative expression the genes were normalized with the lowest expression gene. A, AIG1 (VvAIG1); B, T-complex protein 1 subunit beta (VvTCPB); C, vacuolar sorting-associated protein 4 (VvSAP4); D, 26S proteasome non-ATPase regulatory subunit 13 (VvPRN26S); E, carbon catabolite repressor protein 4 homolog 2 (VvCCRP); F, unkown protein function (VvUNP2); G, unkown protein function (VvUNP); H, unkown protein function (VvUNP3); I, Rab GDP dissociation inhibitor alpha (VvRABI); J, proactivator polypeptide-like 1 (VvPP1); K, acting-depolymerizing factor 2 (VvADF2); L, 26S protease regulatory subunit 4 homolog (VvPR26S). Other putative housekeeping genes reported and used in many works are the following: M, polyubiquitin (VvUBQ10, GenBank acc CB977307); N, plasma membrane intrinsic protein 2B (VvPIP2B, GenBank acc EC969993); and O, elongation factor 1-alpha (VvEF1-α, GenBank acc CB977561). Bars in the graphs correspond to standard error (SE) from three biological samples, assayed in duplicate. Different letters represent significant differences a t P < 0.05 by LSD test. in the reference grape genome by TOPHAT [54] program. The differential expression test on seventy comparisons was implemented in the edgeR [46] software. Then, using in-house development scripts, we searched for genes that were classified as non-differentially expressed, and presented at least 100 reads in each sample/condition and a low variation index among conditions. Finally, all these steps were executed as a bash pipeline (Figure 1).

Derivation of the statistical test for the selection of reference genes
As a first approximation to identify the reference genes, it was used as criteria the mean of read counts and the coefficient of variation (CV = standard deviation/mean) among the 47 different conditions for each of the 242 non-differentially expressed genes (NDE). The relationship between these two criteria was analyzed by Pearson's correlation coefficient (r) using R 2.15.0 [55]. The CV has been previously used for this purpose in cereal crops [49,50]. In order to find those genes having both a high number of reads and a low variation coefficient among samples from different phenological stages and conditions, pseudo data sets were simulated by resampling of the original data. The purpose was that the stability (low CV) and level of expression (high mean values of read counts) were due to features of the gene and not to random or experimental error. The procedure was performed as follows: for each original gene we calculated the mean and the CV of the read counts among the different conditions. Then, a pseudo set of data was simulated representing a pseudo NDE gene under the 47 conditions. To represent this gene, 47 read counts were sampled at random from the original data matrix (247 × 47 observations) and then both the mean and CV were calculated for this pseudo NDE gene. Thus, 10,000 pseudo NDE genes were simulated. Then the 10,000 pseudo-values of the mean and CV were sorted from the lowest to the highest values. The highest 9,750-th value (percentile: 97.5%) and the lowest 250-th value (percentile: 2.5%) of mean and CV, respectively, were used as thresholds of selection. Finally, only those genes that had both a mean of read counts above and a CV below the corresponding thresholds were selected. This algorithm was programmed using R 2.15.0 [55].

RNA isolation and cDNA synthesis
Total RNA was isolated from 3-4 g of frozen tissue using the modified hot borate method [56]. The quantity and quality of the RNA were assessed by measuring the A 260/280 ratio and by electrophoresis on a 1.2% formaldehyde-agarose gel. First strands of cDNA were obtained by reverse transcription reactions with 2 ug of total RNA as template, using MMLV-RT reverse transcriptase (Promega, Madison, WI) and oligo dT primers according to standard procedures. The concentration of cDNA was assessed by measuring the absorbance at 260 nm, finally diluting each cDNA to 50 ng/ul prior to use in qPCR. Quality and quantity of cDNA was also determined by using a Bioanalyzer (Agilent Technologies, Santa Clara, CA), with equivalent results.

Primer design
Gene-specific primers were designed using Primer Premier 5.0 software (Premier Biosoft International, Palo Alto, CA) and synthesized by Alpha DNA (Montreal, Quebec, Canada). The nucleotide sequences were obtained from a private data-base maintained at http://vitisdb.cmm.uchile.cl/. In addition, three genes encoding a polyubiquitin (UBQ10), plasma membrane intrinsic protein 2B (PIP2B) and elongation factor 1-alpha (EF-1α) and their respective pairs of primers were selected from previously published reports [28] and evaluated as a way of comparison. Accession numbers, primer sequences, expected size of amplicons and melting temperature are provided in Table 5.

Quantitative real-time PCR assays (qPCR)
Each transcript abundance was analyzed by real-time PCR with the LightCycler Real-Time PCR System (Roche Figure 5 Validation by qPCR of two putative reference genes in cDNA from 'Sultanina' samples. We selected 20 samples from five different phenological stages: before (fruit-setting, 6-8 mm berry size and V-2) and after (V + 2) veraison (V). A, AIG1 (VvAIG1) and B, T-complex protein 1 subunit beta (VvTCPB). Bars in the graphs correspond to standard error (SE) from four biological samples, Different letters represent significant differences at P < 0.05 by LSD test.
Diagnostics, Mannheim, Germany), using SYBR Green™ as a fluorescent dye to measure the amplified DNA products derived from RNA. Three biological samples in duplicate of quantitative PCR experiments were performed for each sample as described in García-Rojas et al. [57]. Briefly, the amplification reaction was carried out in a total volume of 20 μl containing 1 pmol of each primer, 5 mM MgCl 2 , 1 ml LightCycler™ DNA Master SYBR® Green I (Roche Diagnostics) and 100 ng of each cDNA analyzed. The thermal cycle conditions were: denaturation at 95°C for 10 min, followed by 35 three-step cycles of template denaturation at 95°C with a 2 s hold, primer annealing at 60-65°C for 15 s and extension at 72°C for 25 s. Fluorescence data was collected after each extension step. Melting curve analyses were performed and checked for single peaks, and the amplification product sizes were confirmed in an agarose gel to ensure the absence of non-specific PCR products. Fluorescence was analyzed using LightCycler™ Analysis Software (Roche Diagnostics). The crossing point for each reaction was determined using the Second Derivative Maximum algorithm and manual baseline adjustment.

Determination of reference gene expression stability
Expression levels of each one of the 19 candidate reference genes in all samples were determined by assessing the number of threshold cycles (Ct) needed for the amplification related fluorescence to reach a specific threshold level detection. Ct values were transformed to quantities using a standard curve which is a requirement for using geNorm. To manage the large number of calculations generated, we used a Visual Basic Application (VBA) for Microsoft Excel that automatically calculates the gene-stability value M for every control gene in a given set of samples [8].

Statistical analysis for qPCR
Data from qPCR was subjected to statistical analysis of variance, and means were separated by LSD test at 5% level of significance using Statgraphics Plus 5 (Manugistics Inc., Rockville, MD).
The RNA-Seq data used in this study is available at the NCBI's Sequence Read Achieve (http://www.ncbi.nlm.nih. gov/sra) with the SRA Study accession number SRX366617.