The method of optimization proposed in [18] used three starting input elements to build the new PWM, namely, an existing PWM, a database of promoter sequences and a set of experimentally verified TF sites. Consensus motifs can also be used instead of existing PWM and the considered database should be well populated with TFBS of interest. We have adopted this method with slight improvements.

### Building initial PWM

To build the initial PWM form the training set of experimentally defined binding sites, we used 63 experimentally defined motifs for human GATA-3 from Jaspar database [

40,

48,

49]. The matrix information was adopted from [

44]. As described in [

16,

18,

50] to build the PWM we first find out the frequencies for each base at each position of the aligned known 63 GATA-3 binding sites. The frequencies are further converted into odds scores by dividing the observed frequency by expected frequency or the background frequency of each nucleotide at each position, averaged over the proximal promoters [

50]. We have derived the expected frequencies using the formula described in [

18]:

where *b* is one of 4 nucleotides (A,C,G or T) at position *i*, *n*
_{
bi
} is the number of times base *b* occurs at the *i*
^{th} position of the motif and *L* is the length of the sequence.

The expected frequencies were derived from the human promoter sequences from Eukaryotic Promoter Database (EPD) [31–33] release 105 (http://www.epd.isb-sib.ch/).

The database contains 1870 non-redundant experimentally verified human promoter sequences. We extracted 600 bp promoter sequences from this database, which comprise up to -499 positions upstream of the transcription start site (TSS) to position +100 downstream with TSS at 0. The promoters are aligned with respect to the TSS. Therefore the value for L in our case is 600, the length of the promoter area.

The positional distribution of the GATA-3 motif derived from the above database is also compared with Database of Transcription Start Sites (DBTSS) [34, 35]. This database contains 32102 promoter sequences aligned with respect to the TSS. The rationale of using both databases (EPD and DBTSS) is that EPD is manually curated and is highly reliable, whereas DBTSS contains more promoters.

The weight for each position of the matrix is derived using the formula described in [

18] which is a modification of Bucher’s formula:

Here *b* is one of the 4 nucleotides, *n*
_{
bi
} is the number of times base *b* occurs at the *i*
^{
th
} position of the motif, *c*
_{
i
} is a constant providing column maximum value to be zero, *s*
_{
i
} is a smoothing parameter preventing the logarithm of zero (or too small a value).

(The parameter

*S*
_{
i
} in Bucher’s formulae is used as the smoothing percentage.) We adopted the criteria as described in [

18]:

*S*
_{
i
} = 0 if the first term under logarithm in Formula 2 is larger than

and

otherwise, where

To calculate weights for the di-nucleotide matrix we used the same Formula 2. In this case

*b* represents one of the 16 di-nucleotides,

*n*
_{
bi
} is the number of times di-nucleotide

*b* is present in position

*i* of the motif and

*e*
_{
bi
} is the expected frequency of the di-nucleotide

*b* at the

*i*
^{
th
} position,

*c*
_{
i
} and

*s*
_{
i
} have the same meaning as for the mono-nucleotide PWM,

*s*
_{
i
} = 0 if the first term under logarithm in Formula 2 is larger than

and

otherwise, where

The mono-nucleotide matrix thus built has 4 rows where each row represents each nucleotide and the columns represent positions inside the motif. The di-nucleotide matrix has 16 rows, with each row representing each di-nucleotide. The number of columns of the matrix represents the length of the motifs which is less by one for di-nucleotide PWM comparing to those for mono-nucleotide.

To calculate the weight score

*S* for a specific sequence we use the formula:

where *L*
_{
m
} is the length of PWM, *w*
_{
bi
} is the weight of nucleotide *b* at position *i* in the PWM. For di-nucleotide matrix we use the same formula with *L*
_{
m
}
*= L*
_{
m
} -1 instead of *L*
_{
m
} and *w*
_{
bi
} represents weight of di-nucleotide.

### Finding the functional window and optimization of the matrix

To obtain the positional distribution of the GATA-3 motif we compare the observed occurrence frequency of the GATA-3 motif with its background or expected frequency along the promoter sequences. The background frequency is determined by shuffling each sequence from the promoter database which results in a randomized DNA sequence with the same nucleotide content.

Shuffling of the sequences was done by cutting each sequence in randomly chosen positions into randomly chosen smaller fragments and rearranging these fragments. The sequences were fragmented with segment lengths from 1 bp to 10 bp. This step was repeated 100 times and the whole process was repeated 100 times. The EPD database was used for the shuffling; therefore the shuffled sequence database contained sequences of same number and length. Since we exclusively considered the promoter region for shuffling, we thereby preserved the proportion of all the nucleotides in the shuffled sequences as that in the promoter datasets. The reason to preserve the proportion is to retain the GC-rich property of the promoter in the shuffled sequences. The GC-proportion was checked with the help of program called “geecee” from the collection of program suite EMBOSS [51] after shuffling the promoter sequences and found to be similar. After shuffling the sequences to compare the proportion of CpG di-nucleotide for first-order dependencies we have used the Promoter Classifier [39] to see how many sequences in both the original and the shuffled promoter sequences are CpG-rich (contain CpG islands). We found that both the datasets have similar number of CpG rich sequences. Moreover, we have further examined the shuffled sequences to see if we reduced the number of CpG while shuffling the sequences. To do this we have used the “cgpreport” program from EMBOSS [51] and calculated the average CpG per sequences in both shuffled and original promoter sequences. From this we found that the average number of CpG in the shuffled sequences is consistent with those in the original promoters.

To identify the area where the GATA-3 binding site motif is over-represented along the aligned promoter sequences, we looked into the distribution of the z-score derived as

where *Obs* is the observed occurrence frequency of GATA-3 element in the promoter sequences and *Exp* is the expected occurrence frequency as found in the shuffled promoter sequences.

The occurrence frequencies were calculated as
where *n*
_{
i
} is the number of promoters containing considered motif starting at position *i* and *N*
_{
s
} is the number of sequences.

The area where the occurrence of the GATA-3 motif is statistically higher than expected, which is represented by z-scores ~3 or higher, is regarded as the initial “functional window” (Figure 1). This region shows the occurrence frequency of GATA-3 binding sites much higher than the background or expected frequency.

We assume that statistically significant occurrence of the sites in the “functional window” reflects importance of this window in biological function. The functional window thus obtained is the initial approximate interval from where the new sites can be incorporated to build a new PWM. The final matrix after optimization would define the exact functional window.

### Calculation of a new GATA-3 PWM from the existing PWM

PWMs are routinely used for prediction of the binding affinities for TFs to a segment of DNA sequence in prokaryotes and eukaryotes [52–56]. TRANSFAC database holds numerous PWMs for large variety of TFs. However the majority of the existing PWMs provide a low level of both sensitivity and specificity [55]. Therefore the need to optimize PWM parameters in order to improve its performance is essential. The method developed by Gershenzon et al. [18] iteratively modifies the PWM by incorporating new putative binding sites located inside the identified functional window from the promoter sequence database until the best possible correlation coefficient is achieved as described below. We follow this method in the present study.

We start with the initial PWM built from the 63 GATA-3 experimentally defined motifs adopted from Jaspar, as described in the previous section. As a control set of sites, we use the 26 unique motifs from the 63 experimentally defined GATA-3 binding sites in human genes from Jaspar. The removal of redundancy from the experimental motifs was important to avoid any biasness toward any motif. The method also utilizes a database to incorporate new putative binding sites of interest to build new PWM. We have used the EPD for this purpose.

Since the initial PWM is not provided with a given cutoff we determine the cutoff value from the correlation coefficient (CC) distribution. The CC is calculated as:

CC is calculated for each cutoff starting from a very stringent threshold and relaxing the threshold until we get the maximal CC. To calculate CC here, we designate TP as the number of sites from the experimentally defined dataset positively identified by the matrix with a given cutoff. FN is defined as the difference between the total number of sites in the experimental dataset and TP. We designate negative sites as all possible sites from the shuffled sequence datasets, which can be calculated as 594 × 1870 where 594 is length of the shuffled promoter sequence (600) minus the length of the matrix and 1870 is the number of sequences in the shuffled dataset. We define FP as the sites picked up as positive from the total negative sites and TN as the difference between the total negative sites and FP.

The method starts with extracting putative binding sites for GATA-3 based on existing PWM with the cutoff determined at the above step. The PWM extracts putative binding sites from inside the identified initial functional window. The functional window was defined comparing the occurrence frequency distribution of the GATA-3 binding sites against the shuffled sequences. A new matrix is built from these aligned sites using the formulae described in [18]. These sites do not need any additional alignment as they are of similar length. The new matrix is obtained from new sites extracted by the original matrix with the given cutoff and sensitivity, from the functional window.

Sensitivity was calculated as:

The optimization is done in three levels, as follows: cutoff value, then motif length and finally functional window.

The objective function that is optimized in this method is the correlation coefficient (Formula 7). This criterion utilizes all the four parameters: true positives, false positives, true negatives and false negatives.

The definitions of the TP, FP, TN and FN for the optimization procedure are slightly different from the previously described.

The TP here is defined as the number of sites positively identified by the new matrix from the given functional window identical to the sites extracted by the original matrix. FP is defined as the difference between the total number of sites identified as positive by the new matrix and the number of sites identified as positive by the original matrix. FN is defined as the difference between the total number of sites from the experimental dataset recognized by the original matrix as positive and the TP. And TN is the total number of possible sites from the functional window subtracting TP, FP and FN:

CC is calculated every time after building a new matrix by changing any parameters. (See the flow chart of the adopted optimization process in [18] [Figure 2 for details.)

First the optimal cutoff value is obtained for the given position and size of the functional window and for the given motif length. This is attained by calculating CC parameter for every changed cutoff value. The cutoff value varies around the initial given cutoff value. The range we have used is from -0.5 to -4.0 with the increment of 0.1. The cutoff value is considered to be optimal where the CC reaches the maximum. Next, the length of the matrix is varied while the optimal cutoff value is kept. If the CC reaches higher value than at the previous step, the modified length is considered as optimal at the current stage. Thus we obtain a modified matrix with optimal modified length and cutoff values. This modified matrix is regarded as the initial matrix for further process of optimization. The optimization cycle continues with this new initial matrix and all the aforementioned steps are repeated. This continues until we reach the maximal CC = 1. Usually it takes 6 to 12 cycles for the matrix to converge, which is consistent with the previous work [18]. After that we proceed to a new functional window by increasing or decreasing its length by 1 bp from either side. The CC is maximized again for each length and position of the functional window. This process is repeated for each considered functional window. Finally we would get a PWM with optimized cutoff value and matrix length from optimal functional window. From this pool of new matrices we use the criteria of sensitivity and specificity to select the best.

The parameters TP, TN, FP and FN described earlier are internal parameters of the procedure, and they are not used to evaluate the sensitivity and specificity of the final PWM. The sensitivity of the optimized PWM is calculated as the number of experimentally confirmed sites recognized by the new matrix. And to calculate the specificity we use the occurrence frequencies of predicted TFBS in the randomized sequences. We assume that the sites recognized as positive from the randomized sequences are the false positives. We calculate occurrence frequency as the average number of positive predictions per bp in the random shuffled dataset:

where *N* is the total number of sequences in the shuffled sequences database and *L* is the length of the sequence subtracting the length of PWM. We will use the notation *OF*
_{
r
} to designate occurrence frequency calculated from shuffled sequence dataset. Therefore higher the occurrence frequencies from the shuffled sequences are, lower is the specificity. Now we choose the matrix resulted from the process of optimization that has sensitivity and specificity higher than the initial PWM.