Conservation, diversification and expansion of C2H2 zinc finger proteins in the Arabidopsis thaliana genome

Background The classical C2H2 zinc finger domain is involved in a wide range of functions and can bind to DNA, RNA and proteins. The comparison of zinc finger proteins in several eukaryotes has shown that there is a lot of lineage specific diversification and expansion. Although the number of characterized plant proteins that carry the classical C2H2 zinc finger motifs is growing, a systematic classification and analysis of a plant genome zinc finger gene set is lacking. Results We found through in silico analysis 176 zinc finger proteins in Arabidopsis thaliana that hence constitute the most abundant family of putative transcriptional regulators in this plant. Only a minority of 33 A. thaliana zinc finger proteins are conserved in other eukaryotes. In contrast, the majority of these proteins (81%) are plant specific. They are derived from extensive duplication events and form expanded families. We assigned the proteins to different subgroups and families and focused specifically on the two largest and evolutionarily youngest families (A1 and C1) that are suggested to be primarily involved in transcriptional regulation. The newly defined family A1 (24 members) comprises proteins with tandemly arranged zinc finger domains. Family C1 (64 members), earlier described as the EPF-family in Petunia, comprises proteins with one isolated or two to five dispersed fingers and a mostly invariant QALGGH motif in the zinc finger helices. Based on the amino acid pattern in these helices we could describe five different signature sequences prevalent in C1 zinc finger domains. We also found a number of non-finger domains that are conserved in these families. Conclusions Our analysis of the few evolutionarily conserved zinc finger proteins of A. thaliana suggests that most of them could be involved in ancient biological processes like RNA metabolism and chromatin-remodeling. In contrast, the majority of the unique A. thaliana zinc finger proteins are known or suggested to be involved in transcriptional regulation. They exhibit remarkable differences in the features of their zinc finger sequences and zinc finger arrangements compared to animal zinc finger proteins. The different zinc finger helix signatures we found in family C1 may have important implications for the sequence specific DNA recognition and allow inferences about the evolution of the members in this family.


Background
C2H2 zinc finger proteins (ZFPs) constitute an abundant family of nucleic acid binding proteins in the genomes of higher and lower eukaryotes. The number of ZFPs identified by in silico analysis corresponds to ~2.3 and ~3% of all genes in diptera and mammalia, respectively [1,2]. Approximately 0.8% of the proteins in Saccharomyces cerevisiae [3] have C2H2 zinc finger domains and about 0.7% in Arabidopsis thaliana (this paper). C2H2 zinc fingers (ZF) display a wide range of functions, from DNA or RNA binding to the involvement in protein-protein interactions. Therefore ZFPs not only act in transcriptional regulation, either directly or through site-specific modification and/or regulation of chromatin, but also participate in RNA metabolism and in other cellular functions that probably require specific protein contacts of the ZF domain. In addition, the comparison of the whole ZFP sets in major eukaryotic lineages has revealed a remarkable level of complexity through lineage specific diversification and expansion. These expansions often include ZFPs that contain conserved lineage specific non-finger domains like the vertebrate specific KRAB domain (reviewed in [4]) or the ZAD domain specific to diptera [1]. These domains are protein interaction domains with known or suggested repressor functions. Several ZFPs in plants, e.g. Arabidopsis and Petunia, have already been functionally characterized. They are involved in a variety of processes such as the regulation of floral organogenesis, leaf initiation, lateral shoot initiation, gametogenesis and stress response. Former reviews on plant-ZFPs [5] have been limited to approximately 30 proteins. But the systematic analysis of a complete ZFP set of a plant genome with the aim to predict some basic molecular functions is lacking. The genome annotation of the model plant Arabidopsis thaliana has reached high quality and allows comprehensive computational analyses. Here we describe the classification of the full set of ZFPs in the Arabidopsis genome including a genome-wide comparative analysis based on the in silico analysis of the whole proteome of this plant.

General classification and characterization of ZFPs in the Arabidopsis genome
In A. thaliana we found altogether 176 proteins that contain one or more ZF domains (Table S1 [see Additional File 1]) which by far exceeds the previously reported numbers by Riechmann et al. [6,7]. Therefore, according to our estimate, A. thaliana ZFPs (AT-ZFPs) constitute the most abundant family of putative transcriptional regulators in A. thaliana. So far most studies on the DNA recognition of ZFPs have been carried out on proteins with tandem arrays of fingers (reviewed in [8]). In the genomes of animals, these types of ZFPs (classified as sets A and B, see Methods) constitute the majority (about two thirds) of all ZFPs (S.B. unpublished data). In striking contrast to the animal kingdom, only a minority of 33 AT-ZFPs (about 20%) contain tandem ZF arrays with 32 proteins in set A, containing up to five ZF, and one protein (TF3A) in set B (Table 1), containing nine ZF in more than one array. The vast majority of AT-ZFPs (about 80%) contain a single ZF or several dispersed ZFs (classified as set C). Set C can be further classified into three clearly distinguishable subsets, C1, C2 and C3 (Table 1). These subsets are characterized by ZF types that differ in their spacing between the two invariant zinc coordinating histidine residues by three (C1), four (C2) or five (C3) amino acid residues. A complete list of all classified 176 AT-ZFPs is shown in the supplementary Table S1 [see Additional File 1] including additional data and links to database information (MatDB, TIGR). Pairwise sequence comparisons of all ZF domains found showed that pairwise distances (PAM Dayhoff matrix, see Methods) among domains of subsets A1, C1, C2 and C3 varied and were lowest in A1 and C1 (1,02 and 1,27) and twice as high in C2 and C3 (2,35 and 2,57). This suggests that ZF domains of subsets A1 and C1 are younger than those of C2 and C3 and result from a more recent expansion.

Conserved AT-ZFPs
BLAST analysis against the non-redundant database of NCBI resulted in 33 AT-ZFPs that are conserved in other taxa. The proportion of conserved AT-ZFPs varies remarkably between the different subsets (Table 1), with minor proportions in sets A/B (6 of 33), C1 (2 of 77) and C2 (9 of 44), but a major proportion in set C3 (16 of 22

Sets A and B
The first TF3A homolog in a plant (At1g72050) has recently been cloned and characterized [9]. AT-TF3A is highly diverged in sequence from TF3As in vertebrates, but it binds specifically to 5S rRNA and 5S rDNA as shown in [9]. Interestingly, the nine fingers of AT-TF3A are arranged differently from animal TF3A. Instead of having the nine fingers in a single array, the first finger is isolated and fingers 2-4 and 5-9 are arranged in two separate tandem arrays. We found two other plant orthologs of AT-TF3A in Oryza sativa and Medicago truncatula (data not shown). One additional conserved AT-ZFP has tandem ZFs with the typical characteristics of a DNA binding ZFP of animals. This ZFP, At4g06634, contains four tandem ZFs and is, like its putative ortholog TRM1 from Zea maize, a member of the YY1-family (Table 2). TRM1 was recently cloned and characterized as suppressor of rbcS-m3 [10], a gene involved in photosynthetic CO 2 fixation. It was shown in [10] that TRM1 binds to an YY1-like DNA site and to two other regions with no homology to the YY1 site. Since TRM1 and At4g06634 are very conserved in sequence, we suggest that At4g06634 probably has similar DNA target sites. No functions have been described so far for any of the remaining four conserved AT-ZFPs (Table S1 [see Additional File 1], newSF1 and 2, ZP207-SF) in set A or for their homologs in other eukaryotes. They contain tandem fingers with rare HX4H spacing and have unusually short linkers (one or no residue), features that are conserved in the eukaryotic homologs.

Subset C1
Subset C1 comprises 77 ZFPs containing ZFs with HX3H or HX3C spacing. Only two of them, At5g09740 and At5g64610, are also conserved in other kingdoms. They arose through duplication and belong to the SAS-MOZ family ( Table 2). Both ZFPs contain the conserved combination of a single ZF with CHROMO and SAS domains ( Table 2). Therefore we suggest that they have a function in histone acetylation (HAT), a key process in chromatinremodeling (reviewed in [11]).

Subset C2
In subset C2, there are more ZFPs that are involved in chromatin-remodeling processes and are conserved between plants and animals. These are VERNALIZATION 2 (VRN2), EMBRYONIC FLOWER2 (EMF2) and FERTILI-ZATION-INDEPENDENT SEED (FIS2) [12][13][14][15]. They belong to the Polycomb group (PcG) and were given the name VEF family. As first described in [16] the ZF domains and other non-finger parts of these three AT-ZFPs are conserved in the Su(z)12 proteins of Drosophila and human. PcG proteins are required to maintain the transcriptionally repressed state of homeotic genes throughout development. The molecular function(s) of their single ZF are unknown, but data from Drosophila Su(z)12 suggest their involvement in specific protein contacts, but not DNA binding. FIS2 [12,13], VRN2 [14] and EMF2 [15] act as repressors in different developmental stages of Arabidopsis (reviewed in [11]). Another conserved AT-ZFP in subset C2 is the protein SERRATE (SE, At2g27100) [17]. The phenotype of the SE mutant reveals a role of the affected protein in the early steps of organ elaboration and a role in the regulation of gene expression via chromatin modification was also suggested [17]. We assigned the proteins At5g01160 and At3g12270 to the evolutionarily conserved E7 and PRMT families (Table 2), respectively, based on their conserved combination of a single ZF with a RING and a PRMT3 domain.

Subset C3
16 out of the 22 ZFPs in subset C3 are conserved in other eukaryotes. There is no information available regarding the function of these proteins. Interestingly, we found that eleven of them are predicted to have a U1 type ZF (Table  S1 [see Additional File 1], Table 2) that contains conserved extensions on both sides of the ZF which has a HX5H spacing. They are known or suggested to be involved in RNA binding and therefore we suggest that these conserved AT-ZFPs with U1 type ZFs could be involved in RNA metabolic processes, e.g. splicing like the splicosome associated proteins of the SAP62 family (Table 2), corresponding to At2g32600 in Arabidopsis. According to our prediction we found, that other conserved AT-ZFPs with U1 type ZFs are combined with domains, like DNAJ (At1g74250), KOW (At1g55460) or G-patch (At5g26610) ( Table 2), which are also known to be involved in different RNA metabolic processes (reviewed in [18]). Like chromatin-remodeling, many pathways of the RNA metabolism are ancient, conserved processes in eukaryotes, which is reflected by our finding that several ZFPs that are described above are evolutionarily conserved in all eukaryotic taxa from Protozoa to Mammalia (data not shown).

AT-specific ZFPs and their families
BLAST analysis resulted in the assignment of the 143 ATspecific ZFPs to eight families, six pairs and to five single occurrences. This distribution reflects the high incidence of duplication events in the Arabidopsis genome. The two largest families, named A1 and C1, contain 24 and 64 members, respectively. Together they constitute about 60% of all AT-specific ZFPs. Additional data on other ATspecific ZFP families and pairs with uncharacterized members are given in Table S1 [see Additional File 1]. The C1 family is part of subset C1. We investigated the two biggest families in more detail. The A1 family General description The A1 family, with 24 members, represents by far the most expanded AT-ZFP family of set A. All ZFPs of the A1 family contain four similar ZFs with a conserved arrangement as shown in Figures 1 and 3. They are classified by us into four different subgroups named A1a, A1b, A1c and A1d with 13, three, two and six members, respectively (Figures 1 and 2). These assignments are new and now group plant ZFPs described in the past as belonging to different groups or families. Among them are StPCP1 from Solanum tuberosum [19] and ZmID1 from Zea mays [20,21], which are members of the A1a subgroup. Also part of this group are ZFPs of the recently described WIPsubfamily [22] which is identical to the subgroup A1d that contains six members. So far three members of family A1 have been studied in more detail and shown to be quite variable in function. StPCP1 was identified by its ability to confer growth on sucrose as the sole carbon source upon a sucrose uptake-deficient yeast strain [19]. ZmId1, the indeterminate1 (id1) gene, was the first example of a gene other than photo-receptors that is involved in the production or transmission of a flowering signal [20,21]. In the id1 maize mutant the terminal shoot meristem continues to display vegetative (i.e. indeterminate) growth. One characterized member of the WIP-subfamily or subgroup A1d is transparent testa 1 (tt1, At1g34790), a gene that has been found to be involved in seed coat development in Arabidopsis [22]. All three studied representatives of family A1 have been suggested to be transcriptional regulators. However, no regulatory DNA sites or target genes have been reported for any of them.

ZF arrays and subgroups
All 24 members of the A1 family contain four ZFs with the conserved arrangement F1 isolated, F2-F3-F4 in tandem ( Figure 3). In former reports [19,20] only the first and the third ZF, F1 and F3, have been considered, but in [22] the existence of two possible additional fingers (F2 and F4) was mentioned though not discussed in detail. Both ZFs reported earlier match the consensus ZF pattern X2CX2CX12HX3H/C and have high scores in Pfam searches. In contrast, the "new" fingers F2 and F4, have unusual ZF patterns ( Figure 3) resulting in low scores. Pairwise comparisons (not shown) indicate highest similarities between members of the same subgroup with identities of 77-96 % for the subgroups A1a, A1b and A1d and 62.5 % for the two members of subgroup A1c. Comparisons between subgroups showed that similarities are about 30-40%. From the neighbor-joining tree ( Figure 2) and the domain architecture ( Figure 3) one can infer that A1a and A1b are more similar to each other than any of them is to A1c or d. Also, A1c and A1d are more closely related to each other than either is to A1a or A1b. Remarkably, amino acid residues in the ZF helix positions -1, 3 and 6, known as primary DNA recognition positions, are among the most conserved positions. Moreover, about half of the conserved amino acids are residues with high specificity to particular DNA bases according to a proposed ZF-DNA "recognition code" [8] which allows for the prediction of possible core bases of a DNA site ( Figure  4). We emphasize that this prediction is in part speculative because the ZF-DNA recognition rules are more complex than previously assumed.

Conserved non-finger regions in the A1 family
Using the expectation maximization search tool MEME (see Methods) we found additional conserved sequence parts in the regions outside of the ZF domains ( Figure 3). These conserved regions vary in length and are not shared among the four subgroups, with the exception of a conserved N-terminal region of 29 residues that starts with an R/K rich stretch that is common to subgroups A1a and A1b. The basic amino acids could represent a nuclear localization signal [23]. In addition, all members of subgroup A1a contain two other conserved sequences at their C-terminus. The consensus of the first is 'SATALLQKAAQMGS', the second is characterized by the Figure 3). These patterns could be necessary for protein interactions or localization.

The C1 family
The C1 family represents with 64 members by far the most expanded AT-ZFP family and includes about 85% of all ZFPs in subset C1. ZFPs in the C1 family are characterized by either a single finger or a varying number (2-5) of dispersed ZFs, most of them with the conserved QALGGH sequence in their alpha-helix positions 2-7. A part of this conserved plant family was investigated in petunia and named the EPF family [24]. Based on 21 petunia-ZFPs with two, three or four dispersed ZFs, a first systematic classification of their ZF types was described in [24]. About 20 members of the Arabidopsis C1 family have also been described regarding their biological functions and expression characteristics [25][26][27][28]. We have subdivided family C1 according to the varying numbers of fingers as subclasses C1-1i (N = 33), C1-2i (N = 20), C1-3i (N = 8), C1-4i (N = 2) and C1-5i (N = 1).

Representatives of the different C1 subclasses
Subclass C1-1i has 33 members exclusively found in plants or specifically A. thaliana (Table S1 [see Additional File 1]). 28 members show the invariant motif QALGGH in the alpha-helix (Figure 5a). The best studied member of this subclass is the SUPERMAN protein (SUP, At3g23130, AtZFP9) [28]. One suggested function of SUP is the coordination of stamen-and carpel-specific meristematic cells and hence the maintenance of the boundary between whorls three and four of the floral organ [29]. The sup gene encodes a transcription factor with specific DNA binding properties [30]. It was shown in [30] that the minimal region required for specific DNA binding includes the single zinc finger and two basic regions on either side of the ZF domain. Based on these DNA binding studies and on the NMR structure of the ZF domain of SUP [31] a peculiar DNA recognition mechanism was proposed by the authors and will be discussed below.
Recently two further SUPERMAN-like representatives of subclass C1-1i were described for A. thaliana. RABBIT EARS (RBE, At5g06070) regulates petal development and is probably required for the early development of the organ primordia of the second whorl [32]. Additionally, the expression of At2g42410 (ZFP11) was investigated [33]. It has very low expression levels in flowers, axillary meristems, roots and stems. Interestingly, the deletion of the R/K rich stretch at the C-terminus of the ZF, that was shown to be important for DNA binding in SUP, revealed that this stretch is necessary for the nuclear localization of the protein.
Subclass C1-2i comprises 20 members (Table S1 [see Additional File 1]). Among them are a few with known biological functions. In [27] the expression of four proteins, namely STZ/ZAT10 (At1g27730), AZF1 (At5g67450), AZF2 (At3g19580) and AZF3 (At5g43170) was investigated. The authors showed that all four genes are involved in the plant's water-stress response. Our analysis assigned them into one subgroup (C1-2iD) as shown
Subclass C1-3i contains eight ZFPs with three dispersed ZFs (Table S1 [ ment of their ZF helices in positions -1 to 11. The only characterized ZFP of this subclass is ZAT1 (At1g02030), a member of subgroup C1-3iA [26]. Its sequence is very similar to the segmentally duplicated At2g45120 and At3g60580.
The only two proteins we found with four dispersed ZFs (C1-4i), At1g49900 and At5g56200, are very different in length and in their sequences. The sequences of the ZF helices are also shown in Figure 5c. Only one protein with five dispersed zinc fingers (C1-5i), At3g29340, occurs in the Arabidopsis genome. Nothing is known about the function of the four and five fingered proteins.

Classification of C1 ZFPs and evaluation of different ZF helix types
Based on multiple sequence alignments and tree analyses of the complete sequences of the C1 family members, we further assigned ZFPs of different subclasses to several subgroups as illustrated in Table S1 [see Additional File 1] and Figure 5. These assignments remain the same if we only compare the corresponding ZF sequences and their flanks ( Figure 5). The comparison of the ZF helices only (position -1 to 10) resulted in five main signature types ( Figure 5 and 6).
28 ZFPs of subclass C1-1i contain an invariant QALGGH sequence in their ZF-helix. The alignment shown in Figure  5a reveals several sequence features of the ZFs and their flanks which are unique to these subclasses, e.g. the invariant N residue in helix position 9 and the conservation of the C-terminal R/K rich flank in five positions. We will refer to significant and apparently conserved positions as signature positions (e.g. the helix positions -1,1, 8, 9 and 10). We call the most prevalent signature found in group C1-1i Q2-1 (Q2 refers to the glutamine in position 2 of the helix) ( Figure 6). Comparison of the other two and more fingered ZFPs showed that Q2-1 is unique for subgroup C1-1i. 18 ZFPs of the C1-2i subclass contain in both ZFs the invariant QALGGH (Figure 5b). The signature of the two fingers is different from Q2-1 and we refer to finger number one and two as Q2-2 and Q2-3, respectively ( Figure 6). They contain in contrast to the Q2-1 type, an A/T (Q2-2) and R (Q2-3) residue at helix position 9. Furthermore, Q2-2 has an aromatic amino acid (F, Y or H) in helix position 1 whereas Q2-3 shows a conserved G. This observation is identical to the ZF domains of two fingered petunia ZFPs [24] where Q2-2 and Q2-3 are named ZF types A and B, respectively. The signatures of the three fingered ZFPs (C1-3i) are more varied (Figure 6). Based on the signature and arrangement of the three fingers we could subdivide the eight proteins into four groups C1-3iA to C1-3iD (Figure 5c). We found signatures in the first finger of all eight proteins and in the second of subgroup D that we encountered previously in a few representatives of C1-1i (C1-1iCa and C1-1iCb) and C1-2i (C1-2iX2).
This signature has as the most prominent feature the Q of QALGGH replaced with K or R. We classified the new signatures again according to their residues in helix position 1 and 9. Signature K2-1 has a G in position 1 and a R/K in position 9 whereas K2-2 has L, W, M or S in position 1 and R or A in position 9. We also investigated the signature of the four and five fingered ZFPs (C1-4i, C1-5i). The four fingered At1g49900 has the signature Q2-2/Q2-3/Q2-2/ Q2-3 (Figure 5c). The second member of subclass C1-4i, At5g56200, shows the combination K2-1/K2-2/Q2-2/Q2-3 ( Figure 5c). None of the signatures we found to this point match any of the ZF domains of the only five fingered ZFPs (At3g29340). The signatures we found are summarized in Figure 6 and all domain arrangements in class C1 are presented schematically in Figure 7. The classification of the signatures is clearly reflected in the neighbor-joining tree of all C1 domains ( Figure 8).

Other conserved patterns
The motif search tool MEME revealed a number of other conserved patterns in family C1 (Figure 7). Most remarkable is a leucine rich stretch at the C-terminus of almost all ZFPs in this group. This was also described for homologs in petunia [24]. This stretch is called ERF-associated amphiphilic repression motif and is essential for repression activity [34,35]. We found a basic stretch adjacent to all Q2-1 ZF domains of subclass C1-1i. For SUP, RBE and ZFP11 this basic stretch was already described and suggested to be either involved in DNA binding or nuclear localization (see above). We suggest that it could have a dual function and may be important for both, as it was shown for other proteins [36,37]. Furthermore we found the motifs 'CLMLL' and 'KRKSTKR' N-terminal of Q2-2 in C1-2i and in varying places in C1-3i. The conserved stretches of basic amino acids found in different positions in ZFPs of subclasses C1-2i, 3i and 4i may also serve as nuclear localization signal. The location of all conserved patterns is shown in Figure 7.

Evolution
The majority of ZFPs in A. thaliana are plant specific and not conserved in other eukaryotes. Comparisons of pairwise distances revealed that ZF domains of subsets C2 and C3 show greater pairwise distances than those of the families C1 and A1. Therefore we can conclude that ZFPs of C2 and C3 are evolutionarily older than A1 and C1 which is supported by our finding that the proportion of conserved proteins is highest in subset C3 followed by subset C2 and that many of them are involved in ancient processes such as RNA metabolism and chromatin-remodeling. Families A1 and C1 are probably the result of a recent expansion. Both families almost exclusively contain plant and AT-specific proteins which supports the notion that they are younger families.
Five main helix signature types:  Kubo and coworkers [24] investigated members of the family C1 with two, three and four fingers and suggested, based on the distribution of domains, that multi-fingered proteins in petunia are probably older than those with two fingers. Based on our more comprehensive analysis of family C1 that includes also ZFPs with a single finger, we favor the alternative hypothesis that the single and two fingered proteins are older and the three and four fingered are derived. Q2-2 and Q2-3 are conserved between Arabidopsis and petunia (where so far only one single fingered protein has been reported) and Q2-1, Q2-2 and Q2-3 are highly conserved between single and two fingered ZFPs of rice and Arabidopsis. The K-types vary between rice and Arabidopsis which could indicate that there is less selective pressure on this type of zinc finger. We suggest the following scenario: the ancestor domains evolved to Q2-1 domains and duplicated to evolve into Q2-2 and Q2-3 domains, respectively, leading to the C1-2i ZFPs. Another duplication (probably Q2-3) led to three fingered ZFPs (C1-3i) and the domains K2-1 and K2-2. Recombination and also loss of domains could have led to the different three and four fingered types we see today, but can also explain the rare occurrence of one and two fingered proteins with K-type or similar domains (Figure 7). Based on the signatures of At1g49900 (Figure 5c and 6) we conclude that it arose from the duplication of a C1-2i protein.
The second four fingered protein At5g56200 probably arose from recombining proteins of the subset C1-3i. The only five fingered ZFP we found is too diverged in sequence to allow inferences about its evolution. We think that the number of the members of the respective subgroups, the distribution of Q2-2 and Q2-3 as well as the distribution of non-finger conserved motifs (Figure 7) favor the assumption of evolution from a low number to a higher number of domains and not vice versa. All main signatures we found seem to be conserved in the plant kingdom. The conservation of the signatures, especially the Q-type implies that ZF types with the same signature may recognize similar DNA sites.

DNA recognition by C1 family zinc fingers
We found five main signatures that are prevalent in the ZF helices of the C1 family which suggests variability in the DNA recognition sites. DNA binding assays determined binding sites with an AGT core sequence for SUP (C1-1i) [30], like in earlier reports for ZPT2-1 (or EPF1) and ZPT2-2 of petunia (C1-2i) [38]. However, the ZF-DNA recognition mechanism of this family is not entirely understood. Experimental data published so far for SUP [30,31] and for the two petunia proteins [24,38,39] revealed a peculiar DNA recognition mechanism that is only partially in line with the canonical binding mode of tandem ZFs [8]. It was suggested for SUP [30,31] that all or some of the amino acid residues in positions -1, 2 and 3 of the alpha-helix and/or residues at the C-terminus of the helix are responsible for the base specific DNA recognition. Similar conclusions were drawn from amino acid mutation studies of ZPT2-2 [24]. These results emphasized the importance of both the invariant QAL-GGH sequence of the helix and the C-terminal flanking residues for DNA binding. In extension of their previous reports Yoshioka and coworkers showed recently [40] that the optimal binding sites of petunia ZPT2-2 are slightly different with AGC(T) and CAGT for the first and second ZF, respectively, which is in accord with the observation of different signature sequences (Q2-2 and Q2-3 or A and B) for the two fingers [24]. We suggest that differences outside the invariant QALGGH could be responsible for the slightly different optimal DNA binding sites of the first and second finger of ZPT2-2. Additionally, the optimal DNA binding site of SUP might also vary in comparison Comparison of putative DNA contacting amino acids in SUP and ZPT2-2  with the two DNA core sites of ZPT2-2. This is supported by an extended basic C-terminal flank of the ZF of SUP (Q2-1) which is characteristic for many other Q2-1 type fingers (Figure 5a), but is shorter or lacking in the two ZFs of ZPT2-2. In the case of SUP the flank contains five arginine residues and it was shown in [30] that mutations from R to A in three of them abolishes the DNA binding activity. In Figure 9 we compare the ZF helices and the Cterminal flanks of SUP and ZPT2-2 and suggest positions that may contact DNA. We emphasize that detailed conclusions concerning DNA recognition mechanisms of the C1-family ZFs, i.e. which amino acid residues make direct base contacts and which make stabilizing interactions with the phosphate backbone, cannot be drawn until the structure of a ZF/DNA complex is solved by NMR or X-ray analysis.

Conclusions
We showed that the minority of AT-ZFPs is evolutionarily conserved and our analysis further suggests that most of them could be involved in ancient biological processes like RNA metabolism and chromatin-remodeling. The majority of AT (plant)-specific ZFPs are known or suggested to be involved in transcriptional regulation and exhibit remarkable differences in the features of their ZF sequences and ZF arrangements compared to animal ZFPs. In A. thaliana we found two major families with recent expansions, one with zinc fingers arranged in tandem (A1), the other with a varying number of dispersed zinc fingers and the plant-specific invariant QAL-GGH motif in the alpha-helix (C1). However, our studies showed that most ZFPs of A. thaliana have their domains arranged in a dispersed manner and not in tandem. Additionally, novel plant specific ZFP-associated domains were detected that may be involved in DNA binding or repressor functions. Our results reflect the diversity of the transcriptional regulation guided by ZFPs in plants compared to animals. Our findings on signatures in zinc finger domains of the largest family C1, and on conserved nonfinger motifs give insight into the evolution of the ZFPs and will help to understand their DNA binding function.

Identification of ZFPs and of conserved ZFP-associated motifs
For the identification of the ZFPs we searched the Arabidopsis proteome (MatDB_v110103) using the HMMer package 2.1.1 [41] and the Pfam domain ZF-C2H2 (PF00096) [2]. The minimal cut-off for the search was chosen at a score of 0. The choice of this rather low threshold permits the detection of all ZFs/ZFPs, but also results in the detection of many false-positives. Therefore all identified ZFs/ZFPs subsequently were checked for overlaps with other protein motifs by manual inspection with Pfam and SMART [42] search tools and by BLAST search [43]. Putative C2H2 hits that overlapped with more significant hits of other motifs were eliminated. Usually, questionable C2H2 hits have very low scores and do not exactly fit the spacing of the Pfam C2H2 pattern. Pfam and SMART were also used for the identification of conserved non-finger domains in the AT-ZFPs. In addition, we have applied the program "Multiple Expectation Maximization for Motif Elicitation" (MEME) [44] for the detection of short conserved sequence parts that have not been described yet as Pfam and/or SMART motif. The program MEME detects conserved domains (with unknown sequence) in unaligned sequences. It starts with an initial alignment which provides an estimate of the amino acid composition at each position of the respective conserved stretch that is found. The two steps that follow, the expectation and maximization steps, are applied repeatedly to finally converge to a solution that offers the best location of the motif in each sequence and an estimate of the amino acid composition of each position of the motif.

Classification of AT-ZFPs into families and subgroups
The identified AT-ZFPs were compared to the NR protein database of the NCBI in order to find evolutionarily conserved proteins. Furthermore, all against all BLAST searches of the AT-ZFPs were performed to define families and subgroups and the number of their members. ZFPs in any genome can be classified first into a few main sets based on the number, types and arrangements of their fingers as proposed earlier by us for ZFPs of the yeast genome [3]. All ZFPs containing tandem ZFs in one array or in more than one array are assigned accordingly to sets A and B, respectively, and all ZFPs containing a single ZF or dispersed ZFs are assigned to set C. Based on the results of our statistical analysis of linker lengths in ZFPs (S.B. unpublished data) we have defined tandem ZFs as fingers linked by zero to ten amino acid residues, with five residues as the most frequent (consensus) linker length. ZFs separated by longer spacers of eleven or more residues are considered as dispersed ZFs. Our choice of the upper and lower limits of ten and eleven linker/spacer residues for tandem and dispersed ZFs may seem somewhat arbitrary. However it reflects experimental data on DNA binding ZFPs from literature, where a range of two to seven residues for the linker is given, but most frequently a consensus linker with five residues and the conserved sequence 'TGEK/RP'. ZF domains of large subfamilies were also subjected to phylogenetic analyses using Clus-talX [45] for alignments and the PHYLIP package [46] for pairwise sequence distance (PAM Dayhoff matrix) and neighbor-joining analyses.

List of abbreviations
ZFPs C2H2 zinc finger proteins ZF C2H2 zinc finger