### Transcription factor numbers

We focused on ten major super-families of transcription factors: lambda repressor-like, C-terminal effector domain of the bipartite response regulators, winged helix, srf-like domains, DNA binding domain (GCC box), helix-loop-helix, Zn2/Cys6, glucocorticoid receptor-like (hormone receptors), C2H2 and C2HC zinc fingers and homeodomains (Table 1). We used the superfamily database (version 1.69) to obtain the numbers of TFs from each super-family in different organisms. The superfamily database contains extensive annotations of structural domains of proteins in 250 sequenced organisms using Hidden Markov Model profiles [64]. The database contains 1439 super-families. We focused on the major super-families of transcription factors studied in [15], and added all super-families that contained the terms "DNA binding" or "transcription". This resulted in 32 super-families. We further filtered super-families in which the maximal number of predicted proteins in a single organism was smaller than 50. For the remaining 10 super-families, we determined the maximum number of TFs as the maximal number of proteins from each super-family after discarding organisms with less than 5 proteins and discarding the top 1% of the remaining organisms. It is important to note that the super-family domain assignment may contain predicted transcription factors due to the appearance of the relevant structural domain, which are in fact not functional, or which have other roles in the cell [65]. Thus, the maximal numbers presently found may be an overestimate.

The coding arguments presented here apply to the number of distinct DNA binding domains. Since the vast majority of known binding sites correspond to only one TF each in a given organism, one may assume that the TF family size is approximately equal to the number of DNA binding domains used by that family.

### Binding sequence databases

Position-Specific Score Matrices (PSSM) for 46 *E. coli* transcription factors, were constructed based on the RegulonDB database [66]. The set of transcription factor binding sequences for each TF were searched for aligned motifs using AlignACE [67]. We chose the top-scoring motif, and considered only TFs with four or more aligned sequences contributing to that motif. Finally we removed the non-specific DNA binding factors FIS, HNS and IHF [21].

For the yeast *S. cerevisiae*, we used 94 PSSMs based on a set of 102 PSSMs constructed by Harbison et al [9]. We filtered out proteins that either do not bind DNA directly or always bind as a complex: Gal80, DIG1, STB1, Met4, HAP2, 3, 4, 5. All PSSMs were converted to a probability representation, where the sum of each PSSM column is 1.

For humans we used the PSSMs in the JASPAR database [68]. This data set consisted of 49 PSSMs.

### Measurement of sequence similarity

To measure the similarity between binding sequences of a pair of factors we assessed the distances between their PSSMs. To compare pairs of PSSMs, we use a distance measure related to the one used by Wang et al [69]. The present measure, described below, is stringent in the sense that it scores bases according to their conservation within the PSSM, and compares to randomized PSSMs that preserve these conservation profiles. It is more appropriate for the present purpose than simpler methods such as direct comparison of sequence Hamming distances, because the present interest is in the active base pairs in the site, rather than base pair differences that have little functional impact on binding.

We denote the length of the binding sequence for TF_{i} as *n*
_{
i
}, and its PSSM by *M*
_{
i
}. The PSSM is an 4**n*
_{
i
} matrix in which each column, *p*
_{i,k} is a vector of length 4 holding the probability of observing letters A,C,G,T at position *k* in the set of aligned binding sequences of TF_{i}. For each PSSM we created an information profile [7] denoted by *I*
_{
i
}. The information profile is a *n*
_{
i
} length vector whose *k*'th element quantifies the conservation of position *k* in the PSSM:

*I*_{i,k} = 2 - *H*(*p*_{i,k}) (1)

where

is the entropy of *p*
_{i,k}. *I*
_{i,k} has a minimum of 0 when all four bases have equal probability of appearing at position *k*, and a maximum of 2 when all aligned binding sequences have the same base at position *k* (the small sample size correction of [7] was applied). As the PSSMs in our database have different lengths, and the 'important' positions for the TF-target recognition are the conserved positions, we used the information profiles *I* as weight vectors when comparing two PSSMs.

We define the similarity between TF_{i} and TF_{j} as:

The maximum is taken over all relative shifts *A* of the two PSSMs, with a minimum of 5-positions of overlap. *d*
_{ij,k} is a similarity of the *k*'th position of the relatively shifted PSSMs. We used two different measures for *d*
_{
ij
} : One minus the Shannon-Jensen distance [70]:

and a simple correlation between the two probability vectors *p*
_{i,k} and *p*
_{j,k}. Both measures gave very similar results. For each pair we computed the similarity with the reverse complement as well and took the maximal similarity. To detect pairs with significant similarity we first chose only pairs for:

*D*_{
ij
} > *f** min(*D*_{
ii
}, *D*_{
jj
}) (5)

Where *f* is a numerical factor (we used *f* = 0.75). This amounts to requiring that the similarity between the sets of binding sequences of two TFs comparable to the similarity between the sequences of each TF. For the pairs that passed this criterion, we created an ensemble of 1000 random PSSM pairs and computed a distribution of similarities *D*
^{
r
}
_{
ij
}. The random PSSMs were created by randomly exchanging the A-T and C-G positions in each column of the original PSSMs. This operation preserves the information profile, as well as the GC content, and therefore forms a stringent ensemble. Similar sequences were sequences which had a p-value<0.005 for *D*
_{
ij
} using both similarity measures.

### Measurement of similarity of biological function of TFs

We defined two pairs of transcription factors as functionally similar if they jointly co-regulate a significant number of target genes. This information was obtained from transcription regulation networks: For yeast we used the network of [35, 36]. For E. coli we used the network based on the data in the Ecocyc database [34]. It is important to note that these networks are based on direct experimental measurements, and not on putative interactions based on binding site predictions. For each pair of TFs we used a hyper-geometric test to assess whether the number of genes regulated by both TFs is significantly larger than expected from the fraction of target genes of each TF alone. This measure of functional similarity normalizes for the variable number of target genes for different TFs. We used a hyper-geometric test to detect enrichment of TF pairs with significant target co-regulation in the group of TF pairs with similar sequences.

A second measure of biological similarity was based on the similarity of the functional annotation of the gene targets of each factor. We used functional annotations[33, 34] for the top tier of the annotation tree (except sub-cellular localization and general annotations such as "protein with binding function"). For yeast we used the following functional categories from the FunCat database [33]: metabolism, energy, cell cycle and DNA processing, cell rescue, defence and virulence, interaction with the cellular environment, interaction with the environment, transposable elements, viral and plasmid proteins, cell fate, development, biogenesis of cellular components, cell type differentiation.

For *E. coli* we used the following functional categories from the Ecocyc physiological roles annotation [34]: carbon utilization, degradation of macro-molecules, energy metabolism - carbon, energy production/transport, biosynthesis of building blocks, biosynthesis of macromolecules (cellular constituents), central intermediary metabolism, metabolism of other compounds, cell division, cell cycle physiology, motility, chemotaxis, energytaxis (aerotaxis, redoxtaxis etc), genetic exchange/recombination, adaptations, protection, defense/survival, DNA uptake. Each TF was assigned a profile vector in which each position holds the fraction of its target genes with the respective functional annotation. We used a Student t-test to compare the average of the Euclidean distances between profile vectors of TF pairs with significant sequence similarity to the average of the Euclidean distances between the profile vectors of all pairs. Lack of data in humans currently prohibits a systematic measure of functional similarity of TFs.

### Assessment of the number of possible sequences

We considered three features in the binding mechanism of each super-family which contribute to the number of possible sequences: the number of variable positions in each half-site, the relative spacing and orientation between them, and whether the sites are identical (for homo-dimeric TFs) or not (hetero-dimeric TFs). The number of possible sequences for a super-family with P variable positions in each half-site, O possible half-site orientations and S possible half-site spacings is 4^{P*H}*O*S/2, where H = 1(2) if the super-family binds as homo-dimers (hetero-dimers) respectively. In our calculations we divide by 2 to account for reverse complementary sequences. The present study presents these estimates for six of the ten TF super-families (Table 1) for which data on these three features were available.

For the Lambda repressor-like super-family, previous work suggested 3 variable positions in each half site of the homo-dimer [21, 22] and 2 relative orientations of the two monomers [71], resulting in 4^{3}*2/2 = 64 possible sequnces. Helix-loop-helix proteins can bind as either homo-dimers or hetro-dimers, and have two variable positions at each half-site, resulting in an estimated number of possible sequences of 4^{4}/2 = 128. Zn2/Cys6 proteins, such as the yeast GAL4 protein, bind two sequences of length three, with variable spacing (0–12 nucleotides) and three possible orientations (direct, inverted or everted repeats) [43], resulting in 4^{3}*3*13/2 = 1248 possible sequences. The glucocorticoid receptor-like proteins have two variable positions at each half site [21], and can bind as hetero-dimers in three different orientations (direct, inverted or everted repeats) and variable spacing ranging from 0–8 [72], resulting in 4^{4}*3*9/2 = 3456 possible sequences. Winged helix and homeo-domain recognize 6 positions as either homo-dimers or hetro-dimers respectively, with a constant relative orientation and half-site spacing, resulting in 4096/2 and 4096^{2}/2 possible sequences, respectively. An estimation of the number of possible sequences for multi-domain C2H2 zinc fingers is not given, as these proteins can contain a variable number of finger domains (between 2 and more than 30) per protein.

### Coding theory bounds - coloring number bound

We treated the mapping between transcription factors and binding sequences as a coding problem, where the code-words are short DNA sequences of a given length and the messages are TFs. The code-words are abstracted as nodes of a graph, where edges connect any two code-words which differ by one base-pair (Hamming distance 1). This scenario has some similarities with the genetic code, where the code-words or "codons" are 3-base-pair strings, and the messages are the amino acids encoded by these codons. The error-load is the reduction in organism fitness due to erroneous binding of factor A to the code-word assigned to factor B. At one extreme, minimal error-load can be achieved by mapping all code-words to a single transcription factor. At the other extreme, maximal diversity is achieved by mapping each code-word to a different transcription factor, resulting in the same number of TFs as possible code-words.

If one assigns a biological function to the TFs that bind each code-word, it is clear that minimization of error-load would tend to smoothen this mapping. That is, TFs that are likely to bind similar sequences should have a similar biological function. For every mapping of binding sequences to factors, one can assign an error load, which measures the average impact of erroneous binding. It has recently been proposed [29] that in the limit of large errors, the maximal number of coded messages is bounded by the coloring number of the minimal surface which can embed the code-word graph [29]. Heawood's formula states that the coloring number is:

where int [x] denotes the largest integer not greater than x. The coloring number depends on γ, the genus of the surface embedding the graph:

Here V = q^{n}/2 is the number of possible code words encoded by a q-letter string of length n, assuming a code-word which is the reverse complement of another is not available for independent assignment. E = V*(d/2) is the number of graph edges, and F = V*(d/4) is an estimate of the number of faces of the surface embedding the graph. d = (q-1)n is the number of neighbors of each code-word (by identifying sequences and their reverse complements each code-word has twice as many neighboring code-words as a code without the reverse-complement constraint but about half of these code-words are not available for independent assignment). Using this we get the following estimate for the coloring number:

The bound for codes with different n are shown in Table 2.

### Coding theory bounds - sphere packing bound

An alternative possibility for the code mapping is one in which every sequence is assigned to only one transcription factor [6], and the probability of a misread error is thus negligible. The target DNA binding sequences of each transcription factor can be represented as a sphere in the code-word space (Fig 3). The center of each sphere is the consensus sequence, and all sequences differing from the consensus sequence by *e* positions (Hamming distance *e* from the consensus sequence) are assumed to be bound with a non-negligible probability by the TF [2, 8, 10]. Here we assumed *e = 1*.

Unlike the smooth code, the spheres here are non-overlapping. The volume of a "sphere" of radius *e*, which contains all code-words with Hamming distance of *e* or less from a given code-word, is [73]:

The number *N* of non-overlapping spheres of radius 1 is bounded by:

where:

The upper bound in (11) is called the sphere-packing bound [27]. Codes which achieve this bound are called "perfect codes". In such codes the code-word space is fully covered by non-overlapping spheres of Hamming radius 1. Generally, the number of non-overlapping spheres is smaller, as some code-words remain uncovered by any sphere. The factor of one half in Eq. (10) stems from the fact that each sequence effectively represents also its reverse complementary sequence [27].