Genome-wide analysis of the Hsp20 gene family in soybean: comprehensive sequence, genomic organization and expression profile analysis under abiotic and biotic stresses

Background The Hsp20 genes are associated with stress caused by HS and other abiotic factors, but have recently been found to be associated with the response to biotic stresses. These genes represent the most abundant class among the HSPs in plants, but little is known about this gene family in soybean. Because of their apparent multifunctionality, these proteins are promising targets for developing crop varieties that are better adapted to biotic and abiotic stresses. Thus, in the present study an in silico identification of GmHsp20 gene family members was performed, and the genes were characterized and subjected to in vivo expression analysis under biotic and abiotic stresses. Results A search of the available soybean genome databases revealed 51 gene models as potential GmHsp20 candidates. The 51 GmHsp20 genes were distributed across a total of 15 subfamilies where a specific predicted secondary structure was identified. Based on in vivo analysis, only 47 soybean Hsp20 genes were responsive to heat shock stress. Among the GmHsp20 genes that were potentials HSR, five were also cold-induced, and another five, in addition to one GmAcd gene, were responsive to Meloidogyne javanica infection. Furthermore, one predicted GmHsp20 was shown to be responsive only to nematode infection; no expression change was detected under other stress conditions. Some of the biotic stress-responsive GmHsp20 genes exhibited a divergent expression pattern between resistant and susceptible soybean genotypes under M. javanica infection. The putative regulatory elements presenting some conservation level in the GmHsp20 promoters included HSE, W-box, CAAT box, and TA-rich elements. Some of these putative elements showed a unique occurrence pattern among genes responsive to nematode infection. Conclusions The evolution of Hsp20 family in soybean genome has most likely involved a total of 23 gene duplications. The obtained expression profiles revealed that the majority of the 51 GmHsp20 candidates are induced under HT, but other members of this family could also be involved in normal cellular functions, unrelated to HT. Some of the GmHsp20 genes might be specialized to respond to nematode stress, and the predicted promoter structure of these genes seems to have a particular conserved pattern related to their biological function.


Background
Plants inevitably interact with climatic factors and are often subjected to different types of biotic and abiotic stresses. Environmental stress conditions, such as those related to drought, flooding, salinity, cold, heat, chemical substances derived from human activities and pathogens, have adverse effects on plant growth and crop yields [1,2].
Temperature is one type of stress that greatly affects crop production around the world; however, additional stress factors may also act either separately or simultaneously and ultimately place the plant under combined stresses, causing cell damage and the production of secondary stresses, such as osmotic or oxidative stress [1,2]. As part of a biological system, plants are also attacked by different pests and pathogens. The diseases caused by root nematode parasites belonging to different genera, such as Meloidogyne spp., and the fungus Phakopsora pachyrhizi, which causes Asian Soybean Rust disease [3], have been contributing to decreases in soybean yields, especially in tropical and subtropical regions.
Plants are sessile organisms that are not able to avoid exposure to adverse effects. However, they can supplant such exposure through the evolution of different morphological, molecular and physiological mechanisms or adaptations [4]. Heat shock proteins are often associated with plant responses to cold stress, heavy metals and reactive oxygen species (ROS) [5]. Heat shock proteins (HSPs) have also recently been found to be associated with the plant response to infection by pathogens such as nematodes [6][7][8][9], bacteria [10,11] and fungi [12,13]. The signals or specific factors that trigger the expression of Hsps genes during biotic stress are currently unknown, but the metabolic changes resulting from pathogen attack can generate similar signals or stimuli as those observed under abiotic stress activation [14,15].
The HSPs were first identified in Drosophila melanogaster in the response to heat shock stress [16]. These proteins are grouped into high molecular weight protein families, comprising the HSP100, HSP90, HSP70/DnaK and HSP60/ GroE, and low molecular weight families, including Heat Shock Protein 20 (HSP20) or small heat shock proteins (sHSPs) of 15-42 kDa [12].
The HSP20 proteins are ATP-independent molecular chaperones that usually form oligomeric protein complexes ranging from 9 to 50 subunits (200-800 kDa) and act by avoiding protein denaturation in both eukaryotic and prokaryotic cells [16,17]. These chaperones can also assist other chaperones in helping to maintain the native conformation of nascent polypeptide chains and in reorganizing denatured proteins to their native conformation. The main characteristic of HSP20 proteins is a highly conserved 80-100 amino acid sequence referred to as the alpha crystallin domain (ACD) located in the protein's C-terminal region. This domain is divided into two regions, N-terminal consensus I (27 amino acids) and C-terminal consensus II (29 amino acids), which are separated by a hydrophobic region of variable length. Moreover, the region upstream of the Hsp20 coding sequence generally contains several repetitions of the 5′-nGAAnnTTCnnGAAn-3′ (heat shock element (HSE)) sequence, which is recognized and activated by specific transcription factors, designated heat shock factors (HSFs) [12].
Plants have approximately four times more Hsp20 genes than animals [18]. This genic and functional diversification could be a consequence of their sessile biology. These proteins are encoded by nuclear multigenic families and are located in different cellular compartments [18]. Arabidopsis has 19 genes encoding Hsp20, grouped into 12 subfamilies based on their subcellular localization and homology, while 36 Hsp20 genes have been described in Populus trichocarpa and 23 in Oryza sativa [12,[19][20][21]. Other subfamilies have previously been described in other species, totaling 16 subfamilies in plants [19][20][21][22].
Recently, genetic evidence has revealed that chaperones play a fundamental role in plant immunity [23]. The chaperone activity of heat shock proteins during biotic stress has been shown to be important for the stability and accumulation of resistance proteins (R proteins) and for the coordination of the entire defense signaling cascade [24]. Thus, HSP20 activity is especially important in crops such as soybean, which are cultivated in large areas around the world and constantly subjected to severe and variable stress conditions. Moreover, soybean is one of the most important crops for providing both animal feed protein and human cooking oil [25,26] and has an important impact on the Brazilian economy [27]. However, nothing is currently known about the Glycine max Heat Shock Protein 20 (GmHsp20) family, and only one Hsp20 gene that is responsive to biotic stress has been identified in soybean. This gene was mapped to a Quantitative trait locus (QTL) responsible for Meloidogyne javanica resistance and found to be differentially expressed between resistant and susceptible soybean genotypes [3,7].
Given the evidence regarding plant HSP20s and their functional diversification, these proteins are considered ideal targets for improving the development of new varieties of soybean that are tolerant to a wide range of stress conditions or combinations of these stresses. Thus, the main objectives of this study were to identify GmHsp20 gene family members and carry out their molecular characterization, focusing on the regulation of their expression under different biotic and abiotic conditions, genome distribution and putative promoter structure.

Results
Identification and classification of the soybean Hsp20 gene family Hidden Markov model (HMM) analysis and name search resulted in the identification of 73 and 74 gene models from the Superfamily 1.75 and Phytozome v8.0 Soybean databases, respectively. After removing overlapping hits, 76 putative GmHsp20 were retrieved. PROSITE and MEME scans of these sequences confirmed the presence of an ACD in 74 of the 76 gene models. However, these ACD were identified in the C-terminal region of only 62 of the putative GmHsp20 gene models (Figure 1 and Additional file 1: Figure S1).
Three of the putative GmHsp20 with an irregular ACD disposition (GmHsp15.2, GmHsp15.4 and GmHsp16.2B) were also considered potential GmHsp20 members because their induction under HS has been previously observed (Additional file 1: Figure S2) [28]. The other 9 candidates eliminated from the analysis due to the position of the ACD and their absence in previous expression studies available in the expression databases. Besides, the Blastp analyses to these first 11 excluded genes also showed that seven genes were similar to unknown or not characterized proteins. One gene showed similarity to a predicted tropinone reductase from soybean. The remaining genes showed a low identity to plant Hsp20 genes.
Using the set of 65 GmHsp20 candidates, we searched for those candidates that had been detected in previous general gene expression experiments (see Methods). This search resulted in a total of 51 likely candidates (Additional file 2: Table S1). All 51 GmHsp20 candidates showed at least one repetition of the putative HSE in the 500bp, or 1,500 bp its applied, promoter region ( Figure 2; Additional file 1: Figure S3 through S5, and Additional file 2: Table S2). Thus, all of these potential candidates were considered in silico-predicted GmHsp20 genes.
Complexity and organellar localization of the soybean Hsp20 genes A phylogenetic tree constructed via alignment of the ACD amino acid sequence of the 51 in silico-identified GmHsp20 candidates made it possible to divide them into 13 of the 16 described Hsp20 subfamilies ( Figure 3). Based on the phylogenetic tree and in silico subcellular localization analysis, we identified soybean Hsp20 members related to the previously defined CI, CII, CIII, CIV, MI, P, ER and Px subfamilies as well as to the recently identified CV, CVI, CIX, CXI and MII subfamilies [12,18,22]. In addition, we identified three orphan genes, two of which (GmHsp28.6 and GmHsp28.7) clustered with the Arabidopsis AtHsp14-7 CVII subfamily and were found to be heat responsive in our in vivo analysis; however, the cluster bootstrap value was low (305; threshold > 500). The third orphan gene was GmHsp17.7A (not heat responsive).
In addition to phylogenetic analysis and prediction of subcellular localization, the prediction of protein secondary structure models for GmHSP20 subfamilies is important for determining the subfamily distribution (Additional file 2: Table S3). Subfamilies CI and CII contain amino terminal α-helices and a variable number of β-sheet segments, seven segments in CI members and six in CII members ( Figure 4). Subfamily CIII is very similar to CII, except that all members of the CIII subfamily exhibit one intron in the ACD and another in the third β-sheet segment. In addition, the CIII GmHsp20 genes present an intron in the 5′UTR region. Cytoplasmic subfamily CIV contains seven β-sheet segments and two α-helices in the ACD. Subfamily CV exhibits a conserved pattern in relation to the secondary structure observed in rice and Arabidopsis, where the presence of an intronic region just after the second β-sheet in the ACD is a peculiar feature.
The most important characteristic of the secondary structure of the ER GmHsp20 subfamily is the presence of two large β-sheets and an α-helices in the N-terminal region, where a signal peptide was also predicted. The two mitochondrial subfamilies contain four to five α-helices and one β-sheet in a small segment of the N-terminal region. In MI subfamily members, seven small β-sheet segments were identified, while only five segments were identified in the MII subfamily members. The profiles of MI and MII regarding the position of the ACD in the Cterminal region and the occurrence of an intron in the center of the protein are unique among the GmHsp20 families. A diagrammatic representation of the Hsp20 subfamilies showing the ACD region, intron positions, transit peptides and secondary structures is presented in Figure 4.
An amino acid sequence alignment of the 51 GmHsp20 proteins showed that the identity among the sequences varies from 17.50% to 98.99%. The highest values were detected in the C-terminal region corresponding to the ACD, while the lowest values were observed between members of different subfamilies.
The motive analysis of the Hsp candidates reveled that among the 51 possible GmHsp20 genes, 33 were intronless, while 11 contain only one intron, and seven showed exhibit two introns or more. The intron occurrences were validated in two gene models using conventional PCR. DNA amplification of GmAcd33.0 and GmAcd23.1 confirmed the presence of intron fragments of 527 and 457 bp, respectively, because cDNA amplification produced the expected amplicon as predicted via genome annotation using the Phytozome database (Additional file 1: Figure S6).

Genome organization and gene duplication
The 51 putative Hsp20 gene candidates are distributed across 17 of the 20 chromosomes in the soybean genome. No GmHsp20 genes were detected on chromosome 3, 5 or 9. Interestingly, closely related sequences of the CI subfamily clustered together in the phylogenetic tree and are mainly located on chromosomes 7 and 13, suggesting that the expansion of this gene family may have occurred via localized or intra-chromosomal duplication.
Four Hsp20 paralog gene groups were identified on chromosomes 14, 2, 4 and 17. Furthermore, duplication with a high similarity (96%) was detected between GmHsp22.4 in the terminal region of chromosome 10 and GmHsp22.0 at the same region of chromosome 20. GmHsp22.0 also shared a similarity of 80% with GmHsp22.5 (Additional file 3: Table S6). GmHsp17.9E on chromosome 20 also showed a likely (97%) duplicated region shared with GmHsp17.8A on chromosome 6, both of which are located on the upper arm of chromosomes. Finally, chromosomes 4 and 6 contain three putative duplications, one presenting 82% similarity, involving two of the four genes classified in subfamily P. The duplication prediction analysis indicated that the evolution of the Hsp20 gene Figure 2 Hidden Markov model logos obtained using MEME/ TOMTOM based on predicted soybean Hsp20 promoter sequences. The motifs obtained via MEME analysis were plotted according to their position within the consensus sequences, and their locations are presented as graphs using HMM. The events are classified by p-value and aligned with motifs available at DB. The motif e-value is an estimate of the expected number of motifs with the same size and occurrence that would be present in a set of similarly sized random sequences. The heights of the symbols at each motif position indicate sequence conservation. The sequences were manually highlighted to indicate the perfect recognized HSE consensus (red box; nGAAnnTTCnnGAAn or nTTCnnGAAnnTTCn) and the imperfect HSE module (yellow box). Three motifs are found, with the elements being represented by the perfect HSE in alignments 1 (Matches to Query: 1) and 3 (Matches to Query: 3) in the figure. Only the upstream sequences of gene models with a predicted 5′UTR were analyzed. family in the soybean genome resulted from a total of 23 gene duplications, five of which were segmental among four chromosomes ( Figure 5).

GmHsp20 expression under heat shock and cold treatments
To validate the 51 putative GmHsp20 candidates under stress, we investigated their in vivo expression profiles in heat shock and cold experiments. Out of the 51 primers designed for these individual GmHsp20 genes, 43 produced only one amplicon and were used in this analysis. In addition, two gene models resulting from the database exploration analyses that showed a high homology with rice Acd genes were also analyzed.
Among the 43 GmHsp20 candidates analyzed in vivo, 40 showed strong induction under heat shock, which was easily visualized using conventional PCR (Additional file 1: Figure S7). The expression of all candidates under heat shock treatment was quantified via quantitative real-time polymerase chain reaction (qRT-PCR) ( Figure 6 and Additional file 2: Table S5). Based on the obtained results, four of the GmHsp20 candidates (GmHsp16.4D, GmHsp15.7B, GmHsp17.7A and GmHsp19.5) and two Acd genes (GmAcd33.0, GmAcd23.1) were not induced significantly in stressed soybean roots compared with the Figure 3 The unrooted phylogenetic tree for the 51 predicted GmHsp20 members. The tree was constructed through alignment of the ACD amino acid sequences from Hsp20s from the following species: Glycine max (soybean Hsp20 genes are highlighted in red); At, Arabidopsis thaliana; Os, Oryza sativa; Vv, Vitis vinífera; Ta, Triticuma estivum; Ps, Pisum sativu; Le, Solanum lycopersicum; Zm, Zea mays; Lp, Lycopersicon peruvianum; Cd, Cynodon dactylon; So, Saccharum officinarum; Sb, Sorghum bicolor; Hv, Hordeum vulgare. The predicted GmHsp20 genes were attributed to 12 of the 16 known Hsp20 subfamilies. ER (endoplasmic reticulum), P (plastid), Px (peroxisome), MI and MII (mitochondrion). control conditions. To check whether these gene models were induced in another part of the plant exposed to heat shock, we also performed a qRT-PCR analysis using foliar samples, but no induction was detected.
As expected, cold stress showed a weaker influence on GmHsp20 expression compared with the heat treatment. Under cold conditions, only five genes were responsive. GmHsp18.2A, GmHsp18.0B, GmHsp16.4C and GmHsp22.0 were induced, while GmHsp27.3 was down regulated; the first two genes belong to CIII and the others to the CI, ER and P subfamilies, respectively. All five genes were also induced under the heat stress treatment.

GmHsp20 expression under M. javanica infection and differences in resistant and susceptible soybean genotypes
The expression of the 51 putative GmHsp20 candidates was also monitored in soybean plants inoculated with M. javanica as a biotic stress model. The qRT-PCR analysis following the biotic stress treatments resulted in the identification of six responsive GmHsp20 genes and one GmAcd gene (GmAcd23.1). Five of the six GmHsp20 genes induced by heat shock stress, GmHsp22.4, GmHsp17.6B, GmHsp17.9B, GmHsp16.2B and GmHsp22.3B, were also significantly induced in at least one of the biotic stress treatments tested, while the other Hsp20 gene, GmHsp19.5, exhibited induction that was detectable only at 4 days post-inoculation (dpi) and was not responsive to heat shock stress. GmHsp22.4 was induced at 4 dpi and repressed at 8 dpi in BRS 133, while GmHsp17.6B was repressed at 4 dpi ( Figure 6). Four genes, GmHsp17.9B, GmHsp19.5, GmHsp22.3B and GmAcd23.1, were induced only at 4 dpi in BRS 133, while GmHsp16.2B was induced at both 4 dpi and 8 dpi.
When the nematode-induced GmHsp20 genes were compared between the two soybean genotypes, four gene models showed a differential expression profile: GmHsp16.2B, GmHsp22.3B, GmHsp17.6B and GmHsp22.4. GmHsp16.2B and GmHsp22.3B were induced in the susceptible genotype (BRS 133): the former at both 4 and 8 dpi, and the latter only at 4 dpi. Interestingly, both GmHsp22.4 and GmHsp17.6B were down-regulated in BRS 133 at 8 dpi and up-regulated in the resistant genotype (PI 595099) at 8 dpi ( Figure 7). The complete arrangement of the genes according to their expression profiles after the treatments can be seen in the Venn diagram provided as Additional file 1: Figure S8.
The analysis of disease resistance QTL locations in the soybean genome using data on corresponding molecular markers (http://soybase.org/) confirmed a strong bond between these sites and the Hsp20 genes. At least, one QTL appears to be related to 22 of the 51 GmHsp20 gene candidates (Additional file 4: Table S7). Among the QTLs reported in SoyBase that are related to nematode and fungal disease resistance, 45 QTLs were found to be physically located near to, in approximately 2 Mb flanking regions, at least one GmHsp20. The GmHsp17.9A gene, reported by Kandoth et al. [6] to be related to Heterodera glycines infection, was identified as being located near a QTL for resistance to such infection (SCN 18-3). The GmHsp17.4A, GmHsp22.4 and GmHsp17.6B genes are also situated near QTLs involved in biotic stress.

Characterization of putative cis-elements in GmHsp20 promoters induced by abiotic and biotic stresses
The 51 Hsp20 candidates were also evaluated regarding the occurrence and distribution of putative cis-elements in their promoter regions. For this analysis, 48 of the 51 in silico-predicted candidates, plus two GmAcd genes, were considered based on the availability of promoter regions in Phytozome (see Methodspredicted 5′UTR). The promoter regions exhibit a characteristic consensus TATA box, which was identified in 29 members, followed by putative HSEs that are present in all 50 genes, ranging from one to four repeats ( Table 1). The putative HSEs are mostly concentrated approximately 150 bp upstream of the transcription start site. Exceptions to this pattern were found (GmHsp16.2A, GmHsp15.2 and GmHsp15.4) in which a single putative HSE was identified, occurring approximately 500 bp or more upstream (up to 1,500 bp upstream).
Other elements known to function as cis-elements in Hsp20 genes, such as CAAT box, W-box and TA-rich regions, were also predicted at variable positions [7,14,29]. A putative CAAT box was detected in 47 genes and a putative W-box in 31. Among the GmHsp20 genes identified in this study as being responsive to nematode infection, four showed at least one putative W-box site: GmHsp22.4, GmHsp17.6B, GmHsp17.9B and GmHsp16.2B. All four of these genes were induced by high temperature (HT).
Overall, it was possible to identify an organization pattern of the putative cis-elements in the promoters of the six GmHsp20 candidates that were responsive to nematode infection. CAAT boxes presented a heterogeneous distribution ranging from two to five repetitions within the region containing putative HSEs or immediately upstream, while putative W-boxes were located more distantly, upstream of the putative HSEs. In all six GmHsp20 genes, the first putative HSE is located within the first −83 bp upstream of the start site, while the putative HSE was followed by at least one putative CAAT box. At least one putative W-box was identified in four of the six GmHsp20 genes (GmHsp22.4, GmHsp17.6B, GmHsp17.9B and GmHsp16.2B) induced by nematodes, but they were all at different positions (−406 bp, -4 bp, -71 bp and −476 bp, respectively) ( Figure 8).

GmHsp20 putative operational promoter models
The results of the in silico analysis and in vivo expression profiling of the GmHsp20 candidates under biotic and abiotic stress conditions were used to determine a putative transcription factor binding site (TFBS) combinatorial models for their promoters. Comparative analysis of the promoters of the five GmHsp20 genes responsive to cold stress (GmHsp16.4C, GmHsp18.2A, GmHsp18.0B, GmHsp27.3 and GmHsp22.0) resulted in five putative operational models containing the mandatory HEAT element. Other putative common elements were L1BX, AHBP, GTBX, VTBP, MYBS, MYCL, MYBL and ABRE ( Figure 9).

Discussion
Based on a genome-wide analysis, we propose that the soybean genome is enriched in small heat shock genes, presenting 51 Hsp20 genes, compared with the 31 and 39 Hsp20 genes observed in Arabidopsis and rice, respectively [12,21,30]. This greater expansion of the Hsp20 gene family in soybean was expected because the soybean genome has experienced successive duplications throughout its evolution [25]. According to our analysis, a total of 23 gene duplications of the Hsp20 family can be predicted to have occurred in the soybean genome. As reported above, the vast majority of the GmHsp20 genes (47 gene models) were strongly induced by heat treatment (Figure 6), which suggests that our in silico pipeline for predicting GmHsp20 candidates was efficient. Three of the GmHsp20 candidates were expressed only under non-stressed conditions, while GmHsp19.5 was exclusively induced after M. javanica inoculation. These results indicate that some GmHsp20 genes exhibit functions that are unrelated to heat shock under normal growth conditions, for example, specific housekeeping activities, in addition to more specialized activities, such as in the response to biotic stress. This functional diversification of the Hsp20 gene family has also been reported in sunflower and rice [12,31].
In earlier attempts to categorize the subfamilies of Arabidopsis Hsp20 genes, it was proposed that the majority of the AtHsp20 could be divided into seven subfamilies (CI, CII, CIII, M, P, ER and Px) and that five other genes do not fall into any subfamily [21,22]. In a more recent analysis, the AtHsp20 gene family was extended to include 12 subfamilies based on placing the five uncategorized Hsp20 genes into four new nucleocytoplasmic subfamilies (CIV, CV, CVI and CVII) and adding a new mitochondrial subfamily, MII [22]. A recent categorization of the rice Hsp20 gene family proposed a distribution of OsHsp20 genes into 16 subfamilies: four nucleocytoplasmic subfamilies (CVIII, CIX, CX and CXI) plus 12 subfamilies already identified in Arabidopsis [12]). GmHsp20 clustering with Arabidopsis subfamily CVII and rice subfamilies CVIII and CX was not observed in the present study when bootstrap values were considered. However, in the 15 remaining subfamilies, we were able to identify at least 51 members, two of which subfamilies have not yet been described in the literature. Our results suggest that there are 10 nucleocytoplasmic subfamilies in Figure 6 Heat map of the expression profiles of 43 GmHsp20 candidates and 2 Acd genes. The expression profiles were analyzed under biotic (nematode infection) and abiotic (heat and cold) stress conditions. The expression profiles under stress conditions, based on qRT-PCR data, are presented as heat maps generated using TreeView 1.60 software. The transcript levels following heat shock stress are depicted using a color scale indicating log10 values and are shown beside the transcript levels following the other stresses. The spots highlighted in yellow indicate the genes that showed a significant expression level change (at a 5% significance level) compared with the control under cold and biotic stresses treatments. The spots highlighted in blue indicate the genes that did not exhibit a significant expression level change (at a 5% significance level) compared with the control under heat shock treatment.
soybean, the largest of which is subfamily CI, with 19 members (Figure 3).
The large number of GmHSP20 proteins classified into nucleocytoplasmic subfamilies is a feature shared with other species, such as Arabidopsis and rice [12,22,30], and indicates that the cytoplasm may be the primary site of action for HSP20 proteins. In the cytoplasm, where protein assembly occurs, a higher concentration of Hsp20 proteins could prevent in appropriate folding or interactions that could lead to the formation of prejudicial aggregates.
Notably, in the phylogenetic analysis, the Hsp20 genes from different species that are classified in the same subfamily were observed to be more closely related than the members of the same species that belong to different subfamilies. This finding gave us an indication that synteny might exist among soybean, rice and Arabidopsis HSP20 proteins. The Hsp20 genes most likely had a common ancestor that gave rise to the different subfamilies before the diversification within these species [30].
Regarding gene organization, 64% (33 of the 51 gene candidates) of the soybean Hsp20 genes are intronless based on genome prediction and qRT-PCR data, which is similar to the percentage reported for rice Hsp20s (74%) [12]. Few of the GmHsp20 genes contain introns, and their lengths are highly variable. The relationship between the occurrence of introns and the expression level of a gene is controversial [32,33]. In some studies, the absence of an intron, or a short intron length, has been found to enhance the level of gene expression in plants [34,35]. In addition, there are indications that during evolution, genes must be rapidly activated in response to stress tend to show a decreased intron density [36]. This may be the mechanism that has led to more rapid induction of the expression of plant Hsp20 genes, which occurs within a few minutes after the initiation of heat shock [12].
Among the GmHsp20 genes containing introns, 10 (35.71%) contain only one intron, and two (GmHsp18.0B and GmHsp18.2A) contain an intron in the 5′UTR region; these two genes were induced by cold stress. According to Kamo et al. [37], the presence of an intron in the 5′UTR region can potentiate the translation process.
Furthermore, our results indicate that the GmHsp20s can be classified as unstable proteins, since 76.5% of aminoacid sequence showed an unstable profile (when instability index threshold were considered [38]) (see Additional file 2: Table S4). An unstable profile is believed to be a common feature among stress-induced proteins [39]. Considering that HSP20 proteins are synthesized at a specific time in the cell, their instability indicates a rapid turnover that should allow transcriptional regulation of these proteins in the cellular environment [31,40].  As expected, the GmHsp20 genes were preferentially located in terminal regions of the soybean chromosomes, which have been demonstrated to be enriched in genes in the soybean genome [25]. This localization might contribute to the occurrence of segmental duplications in the soybean Hsp20 family. Similarly, the genome duplications experienced by the soybean genome during its evolution and the high recombination rates between segmental regions of homologous chromosomes might have increased the occurrence of gene duplications [41] and, consequently, favored the expansion and functional diversification of the GmHsp20 family. Based on our analysis and the findings of Schmutz et al. [25], particularly their conclusions about soybean genome evolution and The CCAAT, TATA box, W-box and TA-rich positions identified 500 bp upstream of the GmHsp20 candidates are estimated in relation to the transcription start site. a The results are presented according to MEME, MatInspector and Place software. The TA-rich elements were obtained using the program PlantCare. The abbreviation "n/a" means that the upstream region is not available on Phytozome. Figure 8 Cis-element analysis in the biotic stress-induced genes. Place, Genomatix, MEME and PlantCare were used to analyze the region 500 bp upstream of each of the seven genes that were responsive to biotic stress (six GmHsp20 genes and one GmAcd). Blue boxes indicate TATAbox elements; red, the HSE; gray-green, the CAAT boxes; green the TA-rich elements; and gray, the W-box elements. Genes GmAcd23.1 and GmHsp19.5 were responsive to only biotic stresses, while the others were also heat-responsive.
organization, we suggest that the evolution of the soybean Hsp20 family has involved a total of 23 gene duplications, five of which were segmental on four different chromosomes ( Figure 5). The same number of duplications has been reported for the rice genome, in which 23 OsHsp20 genes were originated via duplication events [30]. These segmental duplications appear to have contributed significantly to increasing the number of members of the soybean CI subclass, located on chromosomes 7, 8, 13 and 14. In rice, the CI members are also distributed in clusters of segmental duplications [30]. Considering the concept of parsimony, the conservation of this pattern of Hsp20 gene duplication within the same chromosome observed in the genomes of rice, Arabidopsis and soybean most likely originated through segmental duplication events that occurred before the divergence of monocots and dicots, in the ancestral species, followed by chromosomal duplications in both the ancestral species and within each species [25,42]. Still, it is notable that three of the six GmHsp20 genes that are responsive to M. javanica and H. glycines infection (GmHsp17.4A, GmHsp17.9B [6] and GmHsp17.6B [7]) are organized in blocks of segmental duplications in the genome. In soybean, expansion of the segmental gene families associated with the basal resistance response is common and has been observed in families including NBS-LRR, F-box and auxin-responsive genes [25]. Such duplications may contribute to the diversification of relevant alleles during plant-pathogen interactions or to the maintenance of similar levels of gene expression within the block, as observed in rice [30,43]. However, unlike the results reported by Ouyang et al. [30], the expression pattern of the tandem duplicated genes, under the stress conditions tested here, was observed to be highly heterogeneous for GmHsp20.
Based on the organization of the soybean genome, the number of Hsp20 paralog gene groups observed between chromosomes 14, 2, 4 and 17 corroborates their high synteny, as described by Schmutz et al. [25]. Furthermore, this organizational characteristic was observed between chromosomes 20 and 10 as well as between 6 and 4, but 7.08% of the length of chromosome 20 is still homologous to fragments of four other chromosomes [25]. The putative interchromosomal duplication observed between GmHsp22.4 and GmHsp22.0, located at the ends of the lower arms of chromosomes 10 and 20, respectively, is an example of the high rate of recombination between homologous chromatids in chromosome arm end regions.
Our expression analysis showed that the regulation of soybean Hsp20 genes is generally associated more with heat stress than with the other tested stresses. A total of 47 GmHsp20 candidates, including all of the organellar genes, were highly induced under heat shock stress in the roots and leaves, showing variation that ranged from four up to 10,000 times higher expression at 42°C compared with the control condition. The Hsp20 chaperone function under heat shock has been elucidated, but the functional roles of these proteins under other stresses or non-stress conditions have not been extensively worked out. The fact that these genes can be induced not only by heat shock but also under other stress conditions, as demonstrated in this study, reflects an interconnected mechanism of induction involving the HSFs. Hsp20 genes are known to be specifically controlled by different HSFs, which is interesting considering that there are 52 soybean HSF genes, while other species have closer to 30 HSFs [44,45].
The expression profiles of subfamily CIV and GmHsp17.7A differed from all of the other clustered nucleocytoplasmic GmHsp20 genes, mainly because they were not altered by HT, even when the leaves were tested. The tissue-specific expression patterns of Hsp20 genes have been reported in different species. In Arabidopsis, the expression profile of some AtHsp20 genes under heat shock shows great variation depending on the tissue tested [46], while in rice, the expression profiles of the OsHsp18.8-CV and OsHsp19.0-CII genes were shown to be regulated differently in flowers and pistils, respectively [12]. In contrast, our results demonstrate very similar expression profiles of the GmHsp20 genes among the tissues analyzed under heat shock treatment (four GmHsp20 and two Acd genes).
The GmHsp22.4, GmHsp17.9B, GmHsp17.9A and GmHsp17.4 genes were induced by M. javanica in the susceptible genotype and have been described by Kandoth et al. [6] as also being responsive to H. glycines infection (Figure 7). Similarly, four OsHsp20 genes were found to be induced under the biotic stress of infection with M. grisea fungus [12]. Similarity analysis revealed that the rice gene Hsp16.9A-CI is homologous to GmHsp17.9B, suggesting that a functional role of this gene, being activated under pathogen infection, might be conserved. Furthermore, two other genes (GmHsp22.4 and GmHsp17.6B) are clearly involved in biotic responses. In earlier attempts, GmHsp17.6B was mapped to a QTL responsible for Meloidogyne javanica resistance and displaying a differential expression profile in resistant and susceptible soybean genotypes [3,7]. In our analyses, GmHsp22.4 was shown to be highly induced in the resistant genotype compared with the susceptible genotype; this gene has been described as being associated with the response to H. glycines infection in soybean [12] and as being located near a biotic resistance QTL (http://soybase. org) (Additional file 4: Table S7).
In silico analysis of the GmHsp20 promoter were in line with previous results that reported the occurrence of putative HSEs within −83 bp from the transcription start site in Hsp20 genes that are responsive to nematodes. Five GmHsp20 genes induced by M. javanica followed this pattern described by Barcala et al. [14] (Table 1 and Figure 8). Only GmAcd23.1 and GmHsp19.5, which were induced by nematodes, did not exhibit this pattern.
In Arabidopsis mutants for Hsp20 genes involved in the responses to nematode infection, the TATAbox element should be preferentially located between 12 and 21 bp upstream of the transcription start site, followed by an HSE at around −83 bp and a CCAAT box between 84 and 141 bp upstream of the transcription start site [14]. The promoter of one Hsp20, a CAAT box element was previously reported in the promoter region of Hs1 pro-1, a gene conferring complete resistance to H. glycines, and appears to be essential for site-specific regulation [29]. The promoters of all Hsp20, which are responsive to nematode infection, also show putative CAAT elements. Moreover, the GmHsp20 biotic stress-responsive genes followed the same pattern observed in Arabidopsis and sunflower and not observed in the others Gmhsp20, where the CAAT box always occurs either between the HSEs or immediately upstream of them, while the W-box, when present, is further upstream of the HSE. However, previous studies have shown the function of these cis-elements in the Hsp20 regulation in Arabidopsis and sunflower, the involvement of them in soybean responses to nematodes need to be checked by in vivo experiments [9,14].
TA-rich elements have been described as being directly involved in the regulation of the expression level of an Hsp20 gene in response to nematode infection in soybean [7], and they appear to act by altering the distances between other cis-elements, interfering with the strength of the promoter [47]. The number of TA repetitions in the promoter region of a soybean genotype resistant to M. javanica appears to be correlated, in a significantly higher level, with GmHsp17.6B expression observed in response to this stress. The resistant plants contain 32 TA repetitions in the GmHsp17.6B promoter region, while the susceptible plants have only nine [7]. Our in silico analysis showed the occurrence of a putative TA region in the promoter regions of GmHsp20 responsive to M. javanica infection (GmHsp17.6B, GmHsp22.4, GmHsp17.9B and GmAcd23.1). It will be now interesting to investigate if these TA rich regions are really GmHsp20 cis-elements i.e., if the number of TA repetitions can be correlated to nematode resistance for these genes and if the deletion of TA region can interfere in gene expression.
Two Acd genes, GmAcd33.0 and GmAcd23.1, were not induced by heat shock in our analyses, and a sequence comparison showed that these genes exhibit high homology to the rice genes OsAcd41.4 and OsAcd31.8, respectively. The cellular roles of the Acd genes are not very well established, but their homologs in rice and Arabidopsis have been shown not to be involved inheat shock responsive (HSR) [12]. These findings, combined with the irregular localization of ACD at the N-terminal ends of the proteins, might suggest that these genes are not real Hsp20 genes [48]. Interestingly, however, both genes present a normal HSE distribution in their promoters, and one of them, GmAcd23.1, was induced under biotic stress in the susceptible genotype. Thus, it appears that the Acd genes might play roles similar to the constitutive Hsp20 genes or could be proteins that are involved in specialized functions.

Conclusions
This study makes a relevant contribution by identifying 51 potential genes that we suggest compose the soybean Hsp20 gene family. The combination of in silico prediction strategies and in vivo expression analyses showed that the applied bioinformatic tools were very efficient in precisely identifying GmHsp20 family members. In addition, the GmHsp20 genes were divided among 13 of the 16 known plant Hsp20 subfamilies and two additional unknown subfamilies that showed unique secondary structures and phylogenetic relationships between the soybean subfamily members and with members of the Hsp20 gene families identified in other species. We have presented evidence of the genomic complexity and diversity of the expression of the soybean Hsp20 gene family. The soybean Hsp20 genes are distributed across 17 chromosomes, where gene duplication events have most likely resulted in expansion of the family, most notably for the CI subfamily. The vast majority of the Hsp20 genes analyzed in vivo (47 genes) were found to be strongly induced under heat shock, but other members of this family could be involved in normal cellular functions, which are unrelated to heat shock. Among the GmHsp20 genes that were HSR, five were also identified as being involved in the soybean response to cold, and five others were responsive to Meloidogyne javanica infection. Furthermore, one predicted GmHsp20 was shown to be responsive to nematode infection, while no change in expression was observed under other conditions. These genes showed a divergent expression pattern between the examined resistant and susceptible soybean genotypes. The promoter region of the GmHsp20 members is minimally defined by the presence of a putative TATAbox and one to three putative HSEs, but results obtained elsewhere suggest that other regulatory elements found in this study are also likely to be important, such as W-box, CCAAT box sequences and TA-rich regions. The promoters that were responsive to biotic stress followed the same in silico predicted ciselement composition and distribution patterns that have been described for other species following nematode infection. Moreover, further investigation is required to obtain clues regarding the functions of the individual genes identified in this study. The results presented here can be further analyzed to reveal candidate genes and promoter structures that will be useful in developing technologies that generate genotypes that are more resistant to the various stresses that affect soybean crops.

Database screening and sequence analyses
The soybean genome annotation database (DB) of Superfamily 1.75 and Phytozome v8.0 (Joint Genome Institute (JGI)) was subjected to Blast searches employing the HMM profile of the Hsp20 domain (PF00011) downloaded from PFam (http://pfam.sanger.ac.uk/) to identify Hsp20 genes with an e-value ≤ 0.001.
An additional search strategy was to use the word "Hsp20" as a keyword to identify gene models annotated as Hsp20 in the soybean genome, which could potentially be missed when only the HMM profile is used due to the presence of incomplete domains. Finally, the redundant sequences obtained from both DBs were removed, and a total of 76 candidate soybean Hsp20 gene models were returned.
The proteins and 500 bp upstream regions of all predicted genes were searched against the Phytozome database. All predicted proteins were examined for the Hsp20 domain using MEME (http://meme.sdsc.edu/ meme/cgi-bin/meme.cgi) [49] with the following parameters: repetitions per sequence = 1; maximum number of motifs found = 1; and an ideal motif size between 80 and 100 amino acids [16]. The motif sequence identity was confirmed via analyses using InterProScan (http://www. ebi.ac.uk/Tools/InterProScan/) and PROSITE (http:// www.expasy.org).
The protein sequences of the GmHsp20 candidate genes were evaluated with EXPASY PROTPARAM (http://www. expasy.org/tools/protparam.html) to obtain their molecular weights, theoretical isoelectric points (IP) and instability indices (with a value >40 considered unstable). The chromosomal locations, intron numbers and sizes (bp) were obtained using the Phytozome DB.
Digital expression analysis of the Hsp20 genes was performed with gene expression evidence search tools against the soybean data available at Genevestigator (https://www.genevestigator.com/gv/plant.jsp) [28], Soybase (http://soybase.org/soyseq/) [51] and the LGE -Soybean Genome Project (http://bioinfo03.ibi.unicamp.br/soja/). Duplications of Hsp20 genes considered parameters 70% identity and 80% coverage. The genes were plotted on chromosomes using MapChart software and physical localization data available at Phytozome. The soybean disease resistance QTLs for nematodes and fungi were retrieved from SoyBase (http://soybase.org, as of Dec. 2011). The physical locations of these QTLs were inferred based on information on the physical locations of markers, which was posted in SoyBase as soybean map version 4.0, and only the QTLs with an associated marker were considered [52].

Growth conditions and stress treatments
Soybean seeds (G. max L, cv BRS 133 and genotype PI 595099, which are susceptible and resistant, respectively, to infection with Meloidogyne javanica obtained from the Embrapa Soja Active Germplasm Bank (AGB] were soaked for three days in sand. After stage V3 was reached, the plants were subjected to abiotic or biotic stress in two independent experiments. In the abiotic stress experiments, BR133 genotype plants were exposed to a temperature of 42 ± 2°C (heat stress), or 4 ± 2°C for 3 hours (cold stress), or were maintained at 25 ± 2°C for 3 hours (control plants). After being subjected to these stresses, the leaves were immediately collected in liquid nitrogen and transferred to −80°C. In the biotic stress experiments, the BRS 133 (susceptible) and PI 595099 (resistant) genotypes were inoculated with 500 infective second-stage juvenile (J2) M. javanica or not inoculated (control plants). The M. javanica eggs were collected as described by Hussey and Barker [54]. The roots were collected at 4 and 8 days post-infection.

Nematode DNA and plant RNA isolation
Each sample from three repetition blocks, with three replicates, was macerated separately using a pestle, mortar and liquid nitrogen. After maceration, the samples were distributed into 1.5 mL microtubes and stored at −80°C.
After the experiment, to obtain molecular verification of nematode infection, we performed DNA extraction and subjected the DNA to PCR, according a protocol described by Rahmanet al. [55] (Additional file 1: Figure S9). A specific oligonucleotide primer set was used to detect M. javanica infection, resulting in an amplicon of 945 bp (foward_5′-CAAAACCACGCGGCTTCGGC-3′ and reverse_5′-TGGGGGTGCCCTTCCGTCAA-3′).
Total RNA (1 μg) was isolated from frozen roots, that were infected with M. javanica or not infected, and from samples subjected to abiotic stress treatment using an RNA extraction kit with the TRIzol® reagent (Invitrogen, Carlsbad, CA, USA). Quantification and quality analysis were performed with an Uniscience NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) at a wavelength of 230 nm or via agarose gel electrophoreses, respectively, and the RNA samples were treated with deoxyribonuclease I (Kit DNaseI, Invitrogen) (Additional file 1: Figure S10).
To synthesize cDNA from treated RNA, we used the SuperScript™ III Kit (Invitrogen) according to the manufacturer's instructions and stored the cDNA at −20°C. Validation of the quality of the cDNA samples was performed using PCR with primers designed to anneal to two different exons of the β-actin gene (forward_ 5′-CCCCTCAACCCAAAGGTCAACAG-3′ and reverse_ 5′-GGAATCTCTCTGCCCCAATTGTG-3′) (Additional file 1: Figure S11).

qRT-PCR
The expression profiles of the GmHsp20 gene models and the Acd genes were evaluated under abiotic and biotic stress conditions using qRT-PCR. Primers specific to each of the 51 candidate gene models and two Acd genes were designed using the software Primer3Plus (http://www. bioinformatics.nl/cgi-bin/primer3plus/primer3plus.cgi) and Vector NTI Advance™ (Invitrogen). The sequences of the primers are listed in Additional file 5: Table S8.
The cDNA samples were amplified with primers specific to each gene model and for the β-actin gene as endogenous control, at a final concentration of 0.1-0.25 μM, with the 1X SYBR Green Master Mix Kit (Applied Biosystems) in a final volume of 12.5 μL. The E = [10 -1 /slope] -1 formula was employed to calculate the reaction efficiency and to adjust the final primer concentration. The calibration curve was established based on the Ct and the log of the cDNA dilutions. The reactions were performed in a 7300 qRT-PCR thermocycler (Applied Biosystems) following the manufacturer's instructions. After initial steps at 50°C for 2 min (UNG activity) and at 95°C for 10 min (activation of the Ampli Taq Gold polymerase), a two-step program of 95°C for 15 s and 62°C for 1 min was run for 40 cycles. Dissociation curves were obtained to guarantee the absence of nonspecific amplification. The data were collected in the log phase, and the results were analyzed with the Sequence Detection program (Perkin Elmer, Waltham, Massachusetts, U.S). The final relative quantification of each gene compared with the control conditions was estimated considering the RQ obtained in each biological replicate, represented by each independent experiment, with three replicates each. Significant differences were determined based on estimates of the standard deviation (SD) and with REST software version 2.0.7 (p < 0.05) (http://gene-quantification.eu/chapter-3-pfaffl.pdf).
Screening for putative TFBS (transcription factor binding site) combinatorial models