Residue correlation networks in nuclear receptors reflect functional specialization and the formation of the nematode-specific P-box
© Lima Afonso et al.; licensee BioMed Central Ltd. 2013
Published: 25 October 2013
Skip to main content
© Lima Afonso et al.; licensee BioMed Central Ltd. 2013
Published: 25 October 2013
Nuclear receptors (NRs) are transcription factors which bind small hormones, whose evolutionary history and the presence of different functional surfaces makes them an interesting target for a correlation based analysis.
Correlation analysis of ligand binding domains shows that correlated residue subsets arise from the differences between functional sites in different nuclear receptor subfamilies. For the DNA binding domain, particularly, the analysis shows that the main source of correlation comes from residues that regulate hormone response element specificity, and one of the conserved residue sub-sets arises due to the presence of an unusual sequence for the DNA binding motif known as P-box in nematodes, suggesting the existence of different DBD-DNA specificities in nuclear receptors.
We conclude that DNA specificity and functional surface specialization has independently driven nuclear receptor evolution, and suggest possible binding modes for the class of divergent nuclear receptors in nematodes.
The large family of nuclear receptors (NRs) comprise regulatory transcription factors that are activated by specific ligands (usually small lipophilic molecules) and regulate a wide range of biological processes in metazoa, and the association of many of them to human diseases make them a current major drug target. The overall architecture of NRs consists of an N-terminal region (A/B domain), a DNA binding domain (DBD, or C domain), a molecular hinge (D domain), a ligand binding domain (LBD) and a C-terminal region (F domain). Given the functional importance of the LBD (where hormone binding is responsible for the structural changes and recruitment of other proteins for transcription initiation) and the DBD (which selectively binds to the DNA sequence known as the hormone response element, or HRE), they are the most conserved domains and are easily alignable (the DBD being the most conserved region). On the other hand, all other domains are highly variable in size and sequence (being even absent in some cases). There are also NRs lacking LBDs (as Knirps) or DBDs (as DAX-1, which has regulatory function by heterodimerization). More recently, nuclear receptors containing two DBDs have also been identified from the flatworm parasite Schistosoma mansoni and database mining have detected this architecture not only in other platyhelminthes but also in mollusks and arthropods . Such studies are possible due to the increasingly high availability of NR sequences - the current PFAM  release lists 4842 DBDs and 4622 LBDs from and 511 and 481 species, respectively. For both domains, more than 90% of the available sequences come from nematoda, chordata and arthropoda.
Phylogenetic analysis divide nuclear receptors in six classes, named NR1-NR6, plus a seventh class (NR0) for the unusual nuclear receptors lacking either the DBD or LBD. Nuclear receptors may also be referred to as Type I-IV: Type I receptors, comprising the NR3 subfamily, are cytosolic and ligand binding causes its transport to the nucleus after dissociation from heat shock proteins and homodimerization. Type II receptors, which correspond to the NR1 class, are kept in the nucleus, usually binding to DNA as heterodimers (RXR is the usual heterodimerization partner). They are inactivated by co-repressors in the absence of ligands, which, when present, cause the dissociation from co-repressors and co-activator recruiting. Type III receptors (NR2) are similar to Type I, the difference relying on the HRE sequence (Type I receptors bind to inversed repeats, while Type III bind to direct repeats). Finally, Type IV receptors are able to bind to DNA as monomers also, and are not restricted to a single class in the NR1-NR6 nomenclature.
The Caenorhabditis elegans (C.elegans) nematode is a particularly interesting model organism for the study of nuclear receptor evolution, due to the fact that its number of predicted NRs is above 300, 75% of them possibly representing functional genes . This is the highest number of NR genes ever found in a single species to date, being considerably higher than the amount found in Drosophila (~20) and even in humans (~50). The Caenorhabditis briggsae species, which also has available sequences, presents such a high amount of potential nuclear receptor genes as well. Although abundant, most C. elegans NR genes are difficult to be grouped in the NR1-NR6 system, which characterize a distinct event of proliferation and diversification for those receptors, probably from a series of duplications of a HNF4 ancestor, an orphan receptor . From all known C. elegans nuclear receptors, only seven have well described functions .
Protein families can be usually identified by well conserved motifs. In the case of nuclear receptors, DBDs can be easily detected by the presence of two highly conserved C4 zinc finger motifs, while LBDs contain a twenty residue long "signature motif" that stabilizes its canonical fold . Aside from positional conservation, which can define such motifs or detect family-wide functional important residues (such as those involved in catalysis in the case of enzymes), another useful way of studying protein families is looking for correlations - i.e., the fact that a given residue in a certain position in a multiple sequence alignment (MSA) increases or decreases the chance of observing another residue in another position. Quantitatively measuring such correlations was made possible due to the availability of a sufficient number of sequences from a given protein family, and many studies based on correlated mutations have appeared, especially during the nineties, following the seminal article of Göbel and co-authors in 1994 . This was followed by other correlation metrics such as mutual information, statistical coupling analysis, explicit likelihood , etc. Statistical coupling analysis have been applied to the study of LBDs , while the alignable portions of full-length NRs were studied by mutual information , which resulted in successfully reporting residues that connect the functionally important surfaces in nuclear receptors  and a set of three residues in the dimer interface that uniquely identify nuclear receptors . We have previously observed that further analysis of correlated positions in a multiple sequence analysis may reveal class-determining patterns - specifically, residues involved in metal selectivity and oligomeric state in Fe/Mn-Superoxide Dismutases , while Halabi et al. independently detected the existence of "sectors" in protein family multiple sequence analysis that may also evolve independently and be responsible for different functions . A new methodology was proposed in order to extract this additionally available and potentially useful information from correlation analysis  - specifically, we adopted the independent calculation of residue-specific correlation (which also enables calculation of the usually overlooked cases of anti-correlation), so that different class-determining groups are not masked by their potential presence in the same position in a multiple sequence alignment, which would be a limitation for metrics that report interpositional correlation (possibly followed by a clustering procedure) only. A correlation network where nodes are residue-position pairs (as in D48, E93 or K93) and connections are (anti-)correlation scores (e.g. a positive link between D48 and K93 means the presence of one of those residues significantly increases the odds of finding the other, while a negative link means they are unlikely to be simultaneously present in a given protein) is built and subsequently decomposed (using, for example, techniques from community structure analysis [16–18]) in order to detect residue groups (with their type explicitly reported) that may be related to a class-specific function or characteristic . The presence of well-defined functional surfaces, very different modes of action and a remarkable evolutionary history  led us to believe that correlation network decomposition could provide useful residue-specific information about the evolution and function of nuclear receptors.
Although nuclear receptor ligand binding domains have already been studied using correlation analysis , new insights can still be found when residue-specific metrics are used. The calculation and subsequent decomposition of the overall correlation network in LBDs resulted in four communities (a community in a network is a set of nodes which are well connected between themselves but not to the rest of the network; in a residue correlation network it consists of a set of residues which tend to appear simultaneously in a sub-set of a protein family). The first one consists of positions Ala283, Glu307, Phe289, Leu301, Pro287 and Gln297 (hRXRα numbering). Glu307 is close to the hormone binding site and all human nuclear receptors have an acidic residue in that position except for the steroid receptors and the two NRs lacking a DBD, DAX1 and SHP. The other residues in this community are close to the co-activator binding interface and also present in most human nuclear receptors.
The same is not true for the second community, consisting of residues Phe353, Gly288, Arg312 and Arg380 - whose individual frequencies are among 22-36% among all receptors. With residues (Gly288, Arg380) that are in contact with those in the co-activator binding site and also other two which are closer to the hormone binding site (Arg312, Phe353), they are not common to most human nuclear receptors, but actually present in specific NRs, possibly due to being involved in ligand specificity: in RAR, for example, a mutation in Arg272 (equivalent to position 312 in hRXRα numbering) affects ligand binding and causes resistance to all-trans retinoic acid , while in VDR the equivalent Arg274 is related to vitamin D binding, which turns undetectable upon its mutation to alanine . Another class-specific characteristic related to residues in this second community is found for Arg380: it is involved in a salt bridge which is specific for Class II nuclear receptors  - in the case of RARα, for example, the equivalent Arg339 binds to Asp267. In this receptor, residue Phe312 (position 353 in RXRα), is involved in retinoic acid expulsion .
The third community presents residues Lys371, Leu276, Trp282 and Trp305, which are present in about one half of all receptors in our final alignment. Positive residues in position 371 are found in a buried salt bridge that is present in some NRs (in RXRα, the charged pair is Arg371 and Glu239), while a leucine in position 354 (equivalent for 276 in RXRα) may be part of the binding site to the metastasis tumor-associated 1estrogen receptor repressor . In silico studies  suggested that the two tryptophans may be an important part of the estrogen receptor region that could bind polyproline-II containing proteins (which could result in the activation of mitogen activated protein kinases). These two tryptophans are extremely high correlated: the presence of a tryptophan in position 305 raises the frequency of tryptophans in position 282 from 50.5% to 84.5% (p < 10-40), in agreement with a possible important role for such residues as suggested by Jacquot et al. .
Finally there is a two-residue community consisting of Lys302 and Leu369, present individually in about 40% of the NR sequences. The role of these two residues has been well characterized for VDR. Using VDR numbering, Lys264 is crucial for ligand-dependent transactivation , while the mutation of Leu332 severely impairs its function, with all functions (ligand binding, heterodimerisation and gene transactivation ) abolished when Leu325 is also mutated .
The residues in group 3 also includes Gln188, which lies on another motif known as Distal Box or simply D-box, a region in the second zinc finger which is perpendicular to the P-box and mediates dimerization. Community 2 includes also Tyr147, at the β-hairpin on the vicinity of the first zinc finger and making important contacts with the major groove backbone; and D140, at the first zinc finger, facing the DNA backbone at the 3' side, as Q188 (but, in opposition to this same residue, at a significant distance of 5-7 Å of the cited backbone). Our structural analysis points to an apparent more controversial role for this residue along the DBD evolution - the apparent nature of the correlations found at the 3 sets of the DBD correlated analysis are detailed at the next topics.
Firstly, considering just the 1st, 2nd and 3rd "canonical" P-box positions at Figure 3 (E153 from community 3; G154 from community 2 and G157, also from community 3), it can be noted that just the first one between them make apparent significant and specific hydrogen bonds with the nitrogenous basis at the third basis-pair and have its aliphatic chain in proximity at the fourth basis-pair. Such apparent discrepancy is taken off, however, when we consider the adjustments of glycines 154 and 157 at the HRE major groove on a topological context. G154 is in close proximity of the DNA backbone wall at the 5' extremity of the DNA anti-sense strand (sequence 5'-TGACCT-3') and near, but in the opposite face, of the sites where the E153 side chain contacts the nitrogenous basis on the major groove (Figure 3). In turn, G157 is closely surrounded by the aliphatic side chains of the basic residues K156 (from set 2 of the correlation network), K160 and R161. From these three, R161 is strongly packed over the C7 atom of the final anti-sense thymidine and interacts with the phosphate-backbone at the great majority of the NR:HRE structures; K156 makes a direct saline bond with the already mentioned E153 side chain (in agreement with the strong anti-correlation between K156 and an E153R substitution, as seen in Figure 1) and hydrogen bonds with the second basis pair at the half-site; while K160 makes moderate interactions with nucleotides at the two central basis-pairs. Such highly packed system combined with the relatively limited space of the major groove causes stereochemical restrictions over the side chains that can be accommodated at the positions 153 and 157 in order to maintain the system interactions on their total integrity. On the study of Nelson et al. , the nature of the substitutions at these positions that were permissive for the recognition of a set of AGNNCA variants were congruent with stereochemical clash at these sites. The apparent concerted way in which the two P-box glycine residues adapt themselves to the stereochemical ambient at the DBD-major groove complex is also suggested by the anti-correlation between G157 (at set 3) and the G154A substitution (at set 1), as recovered from our correlation network (Figure 1).
Correlations between aberrant substitutions at the positions from the sets 2 and 3 of the correlation network and HRE recognition are also particularly evident at the analysis of the structure of the glucocorticoid receptor interacting with its specific HRE (GR:GRE complex, PDB:1GLU) (Figure 4-D). As it is usual for the 3-ketosteroid receptor subclass, GR presents the pattern CGSCKV in its P-box, with substitutions at the three principal positions that significantly modify the stereochemical characteristics of this segment. It is well known that such modifications are accompanied by a change in half-site selectivity, from AGGTCA to AGAACA . A less discussed issue, however, is that 3-ketosteroid receptors present also a subclass conserved substitution at residue 188 of the D-box, from the usual glutamine recovered in this study at the set 3 of the correlation network to a proline residue (Figure 4-D, arrow). Such substitution seems to act in concert with the modified P-box at the specific DBD-HRE fit (see below).
Therefore, the structural effects of the substitutions at the positions of set 3 in 3-ketosteroid receptor DBDs suggest an evolutionary pressure related to the optimization of the correct placing of the domain related to the modified DNA principal axis of its HRE, in order to maintain satisfactory interactions with the half-site and between monomers. It is interesting to note that such divergent pattern of the set 3 and set 2 (considering the position 154) for keto-steroid receptors does not appear at the correlation scheme found in Figure 1, apparently as a direct effect of the evolutionary history of 3-ketosteroid receptors. They are part of the NR3 subfamily, which is exclusive to vertebrates, and the current diversity of this subfamily (two estrogen receptor isoforms, an androgen receptor, a mineralocorticoid receptor, a glucocorticoid receptor and a progesterone receptor) was the result of duplication and diversification from an ancestral estrogen receptor . Being much more recent than the receptors containing a CEGCKG or similar motif, the very high similarity for the DNA binding domains of 3-ketosteroid receptors means that, in order to produce a well sampled final alignment, the procedures described in Materials and methods remove most of those receptors from the DBD alignment, resulting in only five sequences with a CGSCKV motif.
Although the influence of substitutions at the HNF4 R188 or GR P188, or still any other DBD 188 position over the HRE specificity has still not been systematically checked, such structural analysis, taken all together, corroborates that residue substitutions at the set 3 of the correlation network could act in a concerted way to adapt the DBD positioning to aberrant DNA contexts.
What is particularly striking is the appearance of residue set 1 (F147, R153, A154, A156 and A157) in the analysis. The resulting P-box motif (CRACAA) appears exclusively on C. elegans and C. briggsae, even though the final alignment is populated by other sequences from class Chromadorea (Brugia malayi, for example, is present with fourteen sequences). A search using all available sequences shows that this is true both for CRACAA and the physical-chemical similar motif CKACAA, with 178 and 18 available sequences, all of them from the Caenorhabditis genus. These results show that even though the explosive expansion of the nuclear receptor family in nematodes was accompanied by enough diversification in order to keep a statistically significant sequence number after the cutoffs described in Materials and methods (which were strict enough to reduce 3-ketosteroid receptors to a minimum), it is still noticeable that many of them kept conserved a specific P-box motif that is only found in this clade - which suggests that, analogous to what happens at the 3-ketosteroid receptors subclass, this group of nematode nuclear receptor may have evolved a different binding mode for DNA response elements.
The more intriguing substitution at the first set of the correlation network however is, without a doubt, the E153R one. Such substitution changes, simultaneous and drastically, the nature of the charge (from negative to positive), the pattern of interaction at hydrogen bonds (from acceptor to donor) and the extension of the side chain (from a four sections chain - two aliphatic and a carbonyl carbons plus the terminal acid oxigens - to a six sections one - three aliphatic carbons plus a nitrogen and a carbon from the basis plus the two terminal nitrogens); beyond to promote an enlargement of the terminal bidentate group (due to the addition of two protons at each terminal nitrogen on the basis, compared to the non-protonated oxygens at the acid group). In face of such accentuated physical-chemical and stereochemical changes, it is equally expected drastic changes at the topology and physical-chemical ambient of the complementary DNA sequence, implying on additional modifications on half site specificity. The co-substitution of the Gly residues at positions 154 and 157 by Ala is concurrent with the necessity of global topological changes on the major groove (as previously discussed, see Figure 3) for accommodation of the substitutions at set 1. In particular, R153 presents both a correlation with an Ala at position 157 as well as an anti-correlation with a Gly at the same position (Figure 1). Due the localization of the 157 residue near the center of the recognition helix (projecting itself near the center of the half-site major groove) (Figures 2, 3 and 7-A), it is expected that the side chain at this position presents more profound correlations with the global DNA topology (see, for instance, the global topological modifications that are accompanied by a G157A substitution on the ER:ERE complex, in Figure 3-C). As already discussed, the E153R substitution is anti-correlated with the presence of the K156 residue (Figure 1), which is justified by the physical proximity of the two residues, establishing an important saline bond when a Glu and a Lys are present at the respective positions (Figures 2 and 4). So the presence of the positively charged and long chain R153 on set 1 is accompanied by a K156A substitution that removes the positive charge and shortens the side chain at the site 156. A second inference that can be taken about such co-substitution is that the R153:DNA contacts should compensate the loss of the K156:DNA ones, i.e. R153 must interact with the first basis step of the half site (see Figure 2). The G154A bulky substitution at the opposite side of the half-site could, in turn, contribute to "push" the DBD mass center in the upstream direction (Figure 7) in order to facilitate the interaction of the R153 with these upstream basis, and even the packing of the R160, on the other H-A extremity, against the F147 residue (Figure 7), in an analogous way that the G154S substitution on 3-keto-steroid receptors seems to facilitate the positioning of the substituted G157V over the kink region on Figure 2-D. It is also interesting to note that the set of alanine residues at positions 154, 156 and 157, and the phenylalanine at position 147 form a hydrophobic "belt" between the recognition helix and the backbone wall, which could be helpful to adjust the H-A at the surface of the half site, while the three "deeper" arginine residues at the helix extremities (R153, R160 and R161) mediate the specific protein-basis contacts in a symmetric way (Figure 7).
Hence, the half site specific adaptations inferred for the E153R substitutions (i.e., global topological alterations and possible modifications on the 5'-NN first basis step) are, in principle, permissive for the adaptation individually inferred due the simultaneous T147F and K160R substitutions (i.e., a GT-TT dinucleotide step substitution at the center of the half site). Although the sequence based prediction of protein:DNA specificity is still a hard, and in a certain way risky, task (due the common superposition of long range topological terms over the local features[35, 36]), a goal for future studies would be to test the affinity of representative C. elegans DBDs for the 16 possibilities of the 5'-NNTTCA-3' half site and for more canonical ones (AGGTCA, AGAACA and AGTCCA), as well as the search for such sequences on C. elegans promoters (and from related organisms).
Finally, the statistics of residue prevalence at positions 140 and 188 (that do not appear at the set 1 of the correlation network) in the DBDs whose sequences fall at this same set, reveals interesting aspects of an apparent co-evolution of these two positions and from them and the rest of the correlation network. The great majority of the PFAM sequences used in our correlation analysis present the residues corresponding to sets 2 and 3, hence, with a glutamine at position 188 and an aspartate at position 140. As previously discussed and considering just DBDs that do not share the divergent set 1, the position 188 present some less frequent modifications, or as part of an alternative group (the case for the P188 of the 3-keto-steroid receptors) or as an individual discrepancy (as for the R188 of HNF4). Surprisingly, D140, which does not interact with DNA and that is repulsive for the backbone phosphates, is significantly more conserved between the DBDs that do not share the set 1. For the C. elegans DBDs presenting the set 1 however, while the position 188 present almost exclusively basic residues (46.6 % of Arg and 42.9 % of Lys), the position 140 has an accentuate variance, allowing almost all of the 20 aminoacids except prolines and tryptophans. The more prevalent residues at this position on C. elegans are Glutamine (25.7 %), Serine (12.5 %) and Glutamate (11.0 %), ranging, so, significantly, in chain size, charge and hydrogen bonding pattern even between the three more representative residues.
Another indication that the counter-balancing of interactions along the DNA and the DBD principal axis seems to be a driving force along the evolution is the pattern at which the residues from set 2 and 3 distribute themselves along the major groove interaction (Figure 8C and D). Both sets of residues are located at different diagonals crossing the DNA principal axis (red vertical axis at Figures 8-C and D); in order to stay, mostly, at different sides considering also an axis perpendicular to the first and passing through the H-A center (therefore dividing the major groove approximately by the half, blue horizontal axis at Figures 8-C and D). In this sense, each set controls interaction networks that, in principle, modulate the positioning of the DBD at each one of these respective diagonals (double arrow at Figures 8-C and D) and that, taken all together, can regulate small translation and rotation movements for the DBD around the entire major groove's inclined plane.
Taken all together, the correlation analysis for the DBDs indicates that evolutionary correlations for this domain are implicated on global fitting at the DNA topology, superposing such long range effects over the local interactions.
The correlation analysis of nuclear receptors presented here made possible the discussion of many residue-specific features in this protein family that could not be easily achieved from methods using just positional conservation. While LBDs were previously analyzed by such methods, the residue-specific approach revealed a much detailed picture for the relations between the different functional surfaces and class-specific residues. For DBDs, there is a strong clade-specific element in the correlation analysis, which, when analyzed in the light of the many structures of DNA bound DBDs, opens interesting possibilities for the understanding of hormone response elements specificity. We expect that such effort will be helpful in understanding the functional evolution of NRs, as well as to broaden the knowledge on the mechanisms of specific protein:DNA co-adaptation, an also on the molecular evolution of nuclear receptors.
Multiple sequence alignments of DNA binding domains (DBDs) were downloaded from the PFAM database (PFAM Code: PF00105). To avoid the presence of fragments or too divergent sequences, a minimum alignment coverage of 80% and minimum identity of 15% were imposed to all sequences, using human RXR-α as reference sequence. In order to reduce phylogenetic bias, an identity cutoff of 80% was applied. These procedures reduced the alignment size from 3702 to 508 sequences. Pairwise correlation scores were calculated for all residue+position pairs which were present on at least 25% of the sequences (minimum sub-alignment size was calculated as described in ). The correlation score was measured using -log(P) for correlation and log(P) for anti-correlation, where P is the p-value corresponding to the binomial probability of observing the corresponding frequency shift . Twelve residue-position pairs can be arranged in a network where a connection is added where a significant correlation or anticorrelation (minimum score = 10, Δf = 0.30, see  for details) is present. The resulting network is seen in Figure 1.
Ligand binding domain sequences were obtained from the PFAM database (PFAM code: PF00104), and given their higher variability when compared to DBDs, were filtered using 50% alignment coverage, but the same values for minimum (15%) and maximum (80%) identity. The final alignment consisted of 1042 sequences, with correlations being calculated with a cutoff score of 10, minimum alignment size of 20% and Δf = 0.30. The resulting network was considerably larger (49 nodes, 61 connections) than the one found for DBDs, requiring the use of heuristics for network decomposition by community detection [16–18]. Nineteen residue-position pairs were successfully grouped into four communities, while the remaining thirty residues remained isolated (due to presenting anti-correlation connections only). Software used for conservation and correlation calculations is available to academics under request.
Protein figures were prepared using PyMol (Delano Scientific), which was also used for manual amino acid substitutions in protein models. The DNA major axis was calculated using Curves . Canonical B-form models for DNA half sites (used at Figure 8) were built using Coot and the DBD was manually aligned in the complex form's position at such models using Pymol (http://www.pymol.org).
The final alignments for DBDs and LBDs are included as supporting files in PFAM format. A list of the proteins with available three dimensional structures is also provided.
LB acknowledges Fapemig (grant APQ-01840-11) for financial support.
Publication for this article has been funded by the Brazilian funding agency Conselho de Aperfeiçoamento de Pessoal de Ensino Superior (Capes).
This article has been published as part of BMC Genomics Volume 14 Supplement 6, 2013: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S6.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.