Artificial ants deposit pheromone to search for regulatory DNA elements
© Liu and Yokota; licensee BioMed Central Ltd. 2006
Received: 07 March 2006
Accepted: 30 August 2006
Published: 30 August 2006
Identification of transcription-factor binding motifs (DNA sequences) can be formulated as a combinatorial problem, where an efficient algorithm is indispensable to predict the role of multiple binding motifs. An ant algorithm is a biology-inspired computational technique, through which a combinatorial problem is solved by mimicking the behavior of social insects such as ants. We developed a unique version of ant algorithms to select a set of binding motifs by considering a potential contribution of each of all random DNA sequences of 4- to 7-bp in length.
Human chondrogenesis was used as a model system. The results revealed that the ant algorithm was able to identify biologically known binding motifs in chondrogenesis such as AP-1, NFκB, and sox9. Some of the predicted motifs were identical to those previously derived with the genetic algorithm. Unlike the genetic algorithm, however, the ant algorithm was able to evaluate a contribution of individual binding motifs as a spectrum of distributed information and predict core consensus motifs from a wider DNA pool.
The ant algorithm offers an efficient, reproducible procedure to predict a role of individual transcription-factor binding motifs using a unique definition of artificial ants.
To extract biologically meaningful information from a large amount of gene expression data and genomic information is one of the most challenging tasks in the post Human Genome Project era [1–3]. Since transcriptional processes are regulated combinatorially by multiple regulatory elements, computational modeling is inevitable and an efficient algorithm capable of solving combinatorial optimization problems is desirable . Swarm intelligence is a computational technique that mimics the collective behavior of social insects such as ants and bees [5–7]. Although there is no centralized module that dictates the behavior of individuals, local interactions cause a global optimization pattern to emerge [8–10]. Algorithms based on swarm intelligence have been applied successfully to a large number of hard discrete optimization problems including traveling salesman, quadratic assignment and routing in telecommunications networks [6, 8].
The ant algorithm is designed to predict a set of transcription-factor binding motifs using array-derived gene expression data. Based on the idea that gene expression levels are regulated by the combinatorial actions of multiple transcription-factor binding motifs, we formulated a linear relationship between the observed gene expression patterns and the appearance number of potential transcription-factor binding sites in the regulatory region of each gene. Thus, the ant algorithm seeks the motifs whose occurrences in the regulatory region collectively correlate with the global expression levels.
We examined whether the unique version of ant algorithms presented here would predict a potential contribution of all conceivable transcription-factor binding motifs. Most of the existing methods evaluate either a single oligo sequence at a time [11–14], or a fixed set of multiple binding motifs with little information on a role of individual motifs [14, 15]. We previously utilized the genetic algorithm  as well as a particle swarm optimization procedure , and searched for a suboptimal combination of binding motifs. Such approaches, however, were not ideal because of the astronomical number of possibilities in the solution space (motif combinations) and the limited scope in the presence of redundant regulatory mechanisms. Here, we describe a novel application of ant algorithms to evaluate the role of each of individual binding motifs from a population of random DNA sequences. Unlike the genetic algorithm or particle swarm optimization, the ant algorithm aims to evaluate individual binding sites rather than searching for the best motif combination using a unique artificial pheromone spectrum.
In the current study prediction of transcription-factor binding motifs was formulated as a foraging problem of artificial ants using human chondrogenesis in vitro as a model system. Each ant represented a set of random DNA sequences and deposited a varying amount of artificial pheromones depending on the deviation from the array-derived data, which is defined in Eq. 4 later in the section of Methods. Pheromones were constantly renewed by deposition as well as partial evaporation, and ants were attracted to DNA sequences with high concentration of pheromones. This reinforcement process was aimed to select the most desirable group of transcription-factor binding motifs from random DNA sequences.
The microarray-derived mRNA expression data, being used throughout the study, were published by Sekiya et al.  (see additional file 1). The prediction of transcription-factor binding motifs was conducted previously with the genetic algorithm, and the results were experimentally validated using the genes specific to chondrogenesis such as type II collagen . Here, we extended our analysis by developing a unique version of ant algorithms and evaluated the potential role of all DNA sequences in the solution space. All conceivable random sequences from 4 bp to 7 bp in length were analyzed, although the results with 5-bp DNA sequences were mainly described to validate the novel ant algorithm using the previous results with the genetic algorithm. The TRANSFAC database was used to derive biologically known consensus sequences (ranging from 5 to 30 bp) .
The overall strategy with the ant algorithm is to evaluate individual motif candidates considering combinatorial effects with other motifs. Briefly, each ant is defined to represent a potential combination of multiple transcription-factor binding motifs where the number of ants was determined based on the statistical studies with Akaike information criterion [15, 18]). Using a mathematical model for expression profiles (Eqs. 1–3), each ant is associated with a model error (Eq. 4). A pheromone concentration is assigned high if the model error of the ant is small and vice versa (Eq. 5). In the next iteration of ant migration, the probability of particular motifs being selected by any ant is affected by the previous pheromone concentration (Eq. 6). The pheromone concentration deposited to each motif candidate is additive and evaporative (Eq. 7).
In the current study, 100 ants (N A = 100) were utilized with 1000 computational iterations for deposition and evaporation of pheromones. The final pheromone concentrations assigned to individual DNA sequences were plotted in a form of pheromone spectrum (See Methods). Two key parameters here were a "pheromone preference factor (ε)" and a "pheromone evaporation factor (δ)": the pheromone preference factor regulated affinity of ants towards artificial pheromone, and the pheromone evaporation factor was used to reduce influence from the earlier computational outcomes. Using these two factors, "reproducibility (r)" and "selectivity (s)" of the motif selections were analyzed. Reproducibility was defined as cross-correlation among the pheromone spectra, and selectivity was defined as "1 – information entropy" (See Methods).
Reproducibility and selectivity
Motif length analysis
Comparison of the predicted binding motifs to the biologically known motifs
Known binding sites
This study described a novel application of the ant algorithm in predicting a role of each of the random DNA sequences as a transcription-factor binding motif in human chondrogenesis. The prediction procedure was formulated as the combinatorial problem to select a set of multiple motifs followed by a histogram analysis to build a spectrum of potential contributions among all conceivable motifs. Using human chondrogenesis in vitro as a model system, we demonstrate that the ant algorithm is capable of identifying DNA sequences found in the biologically known motifs such as AP-1, CREB, Sox9, NFκB, Erg-1, AP-2, Stat, Smad, E47, and Oct-1 as well as unknown candidates. We discuss the described ant algorithm focusing on its characteristic formulation, selectivity and reproducibility, computational efficiency, and biological relevance.
The first feature of the described algorithm is definition of artificial ants as a set of m transcription-factor binding motifs (m = 10 in this study). In this algorithm, each ant was assigned to its own set of m binding motifs and it deposited an equal amount of pheromone to the assigned set. A pheromone spectrum was then built from the sum of deposited pheromones by N A ants (N A = 100). The value of m > 1 allowed us not only to evaluate a combinatorial effect of multiple binding motifs but also to reduce the number of computational iterations.
The second feature among biology-inspired algorithms is a well-characterized choice of selectivity and reproducibility by the two key parameters: pheromone preference factor, ε, and pheromone evaporation factor, δ. The value of ε determines the preference to pheromones, and the value of δ regulates fading of the previous concentration of pheromone. Selectivity was defined as "1 – informational entropy" to evaluate distance from randomness, while reproducibility was defined as correlation among the spectra. A higher selectivity in general implies that the limited number of motif candidates receive a significantly higher pheromone concentration than most of the other candidates. The extreme case for ε ~ 0 or ε > 100 yielded low reproducibility, since the final pheromone spectrum was predominantly influenced by the ants in the first or the last generation. A value of δ controlled evaporation of previous information, and a large value of δ placed more emphasis on recent decisions. The values of ε and δ can be selected arbitrarily depending on the purpose of a particular study. In this study we determined these parameters to reproducibly obtain ~10 peaks in the spectrum. The number of peaks was pre-determined using Akaike information criterion .
A clear advantage of the ant algorithm is a computational efficiency to reach a stable solution compared to the other evolutionary algorithms such as the genetic algorithm. The major difference between the ant algorithm and the genetic algorithm is their solution space. The ant algorithm searches for a group of the best motifs in a space of individual binding motifs, while the genetic algorithm seeks the best combination. The former space is apparently more restricted than the latter combinatorial space. Therefore, the ant algorithm has a clear advantage to terminate the search. Furthermore, the advantage of the ant algorithm includes identification of redundant transcription-factor binding motifs in eukaryotic gene regulation. Neither the genetic algorithm nor particle swarm optimization is well suited to include redundant motifs in a final solution.
The ant algorithm is also different from other model-based approaches such as REDUCE  and the principal component analysis . In REDUCE, for instance, motifs are selected recursively in an add-on manner to reduce the model error by the largest degree at each selection. This selection strategy makes later selections strongly affected by the earlier ones. The ant algorithm, on the other hand, can avoid such a potential conflict in selection. In order to compare these two algorithms, numerical simulations were conducted using a benchmark dataset for 200 artificial genes (see additional file 1). Simulation results suggested that both REDUCE and ant algorithm offers similar power in predicting correct number of embedded motifs. Although the advantage of REDUCE is its superior reproducibility with a shorter computational time, the ant algorithm is apparently more suited to predict a motif longer than 5 bp including a motif consisting of a dimeric binding site. Biologists are usually advised to use several complementary computational tools to identify regulatory elements from microarray data . The ant algorithm seems to complement analytical approaches such as REDUCE and the principal component-based method in identifying longer motifs and dimeric binding sites.
The ant algorithm can still evaluate combinatorial effects among multiple factors like other model-based approaches reported previously [11, 14–16]. Defining a background model is a general approach in searching for over-representation of DNA words within a sequence set. The ant algorithm, however, aims at searching for the DNA sequences whose occurrences in the regulatory region correlate with the observed expression levels in the context of combinations of multiple motifs. Therefore, the described ant algorithm is in principle not overly sensitive to over-represented DNA words. The expression levels are described as a linear combination of the role of individual motifs with different functions. Therefore, it is possible that some motifs, predicted to be a stimulator by the ant algorithm, may appear in the regulatory region of the gene whose expression level is down-regulated.
A motif length analysis and sequence comparisons supported, at least in part, statistical and biological significance of the selected transcription-factor binding motifs. The 25 predicted motifs matched with the sequences of 10 known binding motifs known to be involved in human chondrogenesis. These binding motifs include AP-1, CREB, Sox9, NFκB, Erg-1, AP-2, Stat, Smad, E47, and Oct-1. Interestingly, two of the 5-bp motifs (GCCCA and ACGCA) together with a 6-bp motif (GCCCAC) and a 7-bp motif (CGCCCAC) constituted a contiguous 10-bp binding motif of Egr-1 ([A/T]TGCGTGGGCG [G/T]), confirming a strong involvement of Egr-1 in chondrogenesis. Furthermore, two 5-bp motifs (GATCC and AGGGG) coincided with 9-bp consensus sequence of NFκB (p50) binding sites (GGGGAT [C/T]CCCC [A/T]NTC [C/G]). It is possible to evaluate a pool of candidates with varying length simultaneously by including them together in the simulation. Note that the prediction by the ant algorithm should be used to address a set of hypotheses, and biological experiment is inevitable.
In summary, the described procedure is the first application of ant algorithms for prediction of transcription-factor binding motifs. Other definitions of artificial ants and pheromones are possible. For instance, a group of heterogeneous ants could behave like transcription factors or RNA polymerases and deposit different kinds of pheromones directly onto genomic DNA sequences. We believe that this application will be advanced by further studies for improving computational efficiency and biological relevance.
We developed one form of ant algorithms for prediction of transcription-factor binding motifs. The consensus sequences of 10 biologically known binding motifs have significant similarities with the predicted motifs. Unlike healing capability of bones, joint tissues such as articular cartilage hardly regenerate and therefore in vitro chondrogenesis is an extremely challenging subject in tissue engineering. The transcriptional mechanism of human chondrogenesis remains largely unknown. With its efficient search procedure and its controllable reproducibility and selectivity, the described version of ant algorithms allows us to provide a known and novel set of molecular targets for biological verification.
Biological model system
We focused our analysis on 55 genes whose alterations in mRNA were statistically significant during human chondrogenesis as published by Sekiya et al. in Table 1, and their 5'-end flanking DNA sequences were identifiable with the UCSC genome browser (see additional file 1). The logarithmic ratios in the gene expression levels on days 1, 7, 14, and 21 relative to day 0 were used in the model:
where y i (t) = mean mRNA level of the i-th gene on day 1, 7, 14 or 21, and = y i (0) mean mRNA level of the i-th gene on day 0. The positive and negative ratios indicate upregulation and downregulation to day 0, respectively.
The global gene expression patterns were modeled using the number of occurrences of the potential transcription-factor binding sites in the 5'-flanking regulatory region of the gene:
Z n (t) = H nxm X m (t) (2)
where Z n (t) represents logarithmic ratios of differential gene expression levels derived in Eq. 1, the element h ij in H nxm denotes the number of j-th transcription-factor binding motif in the regulatory region of the i-th gene, and X m (t) corresponds to the functional levels of m predicted binding sites at time t. The positive and negative values in X m (t) suggest stimulatory and inhibitory roles of the corresponding transcription-factor binding motif, respectively. Here, n is the number of genes. The upstream regulatory sequence of each gene was acquired from the USCS genome browser. Based on the results of our previous studies [15, 16] and others [20, 21], we used the 1000-bp upstream region of the transcription starting site (see additional file 1). This region, however, should be considered as a parameter to be chosen, and the model can include further upstream regions, downstream regions, or untranslated regions. As potential transcription-factor binding motifs, a complete set of random DNA sequences of 4 to 7 bp in length (AAAA, AAAC, AAAG, AAAT, etc.) was considered and their distribution was identified on the 5'-end flanking region of the genes in the model. In formulation of Eq. 2, the reverse complementary motifs were combined with their counterparts. Namely, we counted the appearances of both the forward and the reverse motifs and treated them together as a single motif candidate. Therefore, the total numbers of 4-, 5-, 6-, and 7-bp motifs are 136, 512, 2080, and 4192, respectively.
We evaluated a set of random sequences in two steps. First, the functional level of each motif (x value) was estimated using a least-square procedure (Eq. 3). Second, the cost function of each set was defined as sum square error of the differences between the experimental and the predicted gene expression levels (Eq. 4):
where e k is the cost function of the k-th ant, z i (t) and i (t) are the observed and the predicted gene expression vectors in Z n (t) and n (t), respectively.
From the group of random DNA sequences, the ant algorithm with N A ants was used to select the set of m transcription-factor binding motifs that would minimize the cost function. The ant algorithm in this study included three steps such as deposition of pheromones, pheromone-guided selection, and evaporation of pheromones.
Deposition of pheromones – First, each ant was assigned m random DNA sequences as potential transcription-factor binding motifs and evaluated from its cost function in Eq. 4 [22, 23]. Based on their cost performance in Eq. 4, N A ants deposited the same amount of pheromone to each assigned DNA sequence. At the i-th iteration, for instance, the amount of deposition on each motif was defined:
where fj,i= amount of pheromone deposited to the j-th potential binding motif by the i-th ant, Kj,i= all the ants that host the j-th motif in the i-th iteration, e k = cost function of the k-th ant derived from Eq. 4, and α = power factor for error evaluation (α > 1).
Pheromone-guided selection – Based on the pheromone concentration assigned to each of the potential transcription-factor binding motifs, each of N A ants selected m DNA sequences at the (i+1)-th iteration:
where pj,i+1= probability of selecting the j-th binding motif, ε = pheromone preference factor (ε >0), and Fj,i= cumulative pheromone concentration of the j-th binding motif. Note that M is the total number of potential transcription-factor binding motifs, and it is 136, 512, 2080, and 8192 for 4-, 5-, 6-, and 7-bp selections, respectively. When ε = 0, the selection of DNA sequences would be conducted randomly without any preference to pheromones.
Evaporation of pheromones – The pheromone concentration, Fj,i, was updated at each iteration step:
Fj,i+1= (1 - δ)Fj,i+fj,i (7)
where δ = pheromone evaporation factor (0 ≤ δ ≤ 1). When δ = 0, the pheromone would be preserved without evaporation. On contrary, the previous pheromone information was completely lost with δ = 1.
Evaluation of reproducibility and selectivity
Two key parameters in the ant algorithm were ε (pheromone preference factor), and δ (pheromone evaporation factor). In order to evaluate the role of these two parameters in reproducibility and selectivity of transcription-factor binding motifs, we defined reproducibility, r (0 ≤ r ≤ 1), and selectivity, s (0 ≤ s ≤ 1):
r = ρ(Φ, Φ') (8)
where ρ = correlation coefficient between two pheromone spectra Φ and Φ', M = total number of DNA sequences in the model, and φ j = final pheromone concentration of the j-th potential transcription-factor binding motif. With r = 1, two pheromone spectra become identical. The selectivity parameter was defined as "1 – informational entropy," and with s = 1 only one binding motif received pheromones with no pheromone deposition on others. Note that the similar definition of informational entropy was first employed to evaluate variations in expression profiles .
Monte Carlo simulation and comparison to TRANSFAC database
The independent models using a different length of DNA sequences as potential transcription-factor binding motifs resulted in several common core DNA sequences. In order to evaluate statistical significance of identifying 4-bp core sequences, Monte Carlo simulation was conducted. First, 25 DNA sequences were randomly selected in each of the models with 4-, 5-, 6-, or 7-bp binding motifs. Then, the number of 4-bp common DNA sequences in the four models was counted. This procedure was repeated for 1,000 times, and a p-value for finding a particular number of 4-bp core sequences was evaluated. The predicted motifs were compared with the biologically known motifs in the TRANSFAC database using the procedure previously published .
Comparison between the ant algorithm and REDUCE
In order to compare capabilities of the ant algorithm with REDUCE , a benchmark dataset consisting of 200 artificial genes was generated and numerical simulations were conducted. In the dataset the promoter sequences of 1000 bp in length were randomly generated and a set of hypothetical binding motifs (6–10 bp long) with a known functional level were embedded arbitrarily in the promoter sequences The expression levels of 200 genes were modelled using Eq. 2, and the predicted expression levels using the ant algorithm and REDUCE were evaluated (see additional file 1).
The authors appreciate Hui Zhao and Andy Chen for technical support. This study was in part supported by NIH R01 AR50008.
- Collins FS, Green ED, Guttmacher AE, Guyer MS: A vision for the future of genomics research. Nature. 2003, 422 (6934): 835-847. 10.1038/nature01626.PubMedView ArticleGoogle Scholar
- Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001, 2: 343-372. 10.1146/annurev.genom.2.1.343.PubMedView ArticleGoogle Scholar
- Kitano H: Standards for modeling. Nat Biotechnol. 2002, 20 (4): 337-10.1038/nbt0402-337a.PubMedView ArticleGoogle Scholar
- de Jong H: Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol. 2002, 9 (1): 67-103. 10.1089/10665270252833208.PubMedView ArticleGoogle Scholar
- Bonabeau E, Dorigo M, Theraulaz G: Inspiration for optimization from social insect behaviour. Nature. 2000, 406 (6791): 39-42. 10.1038/35017500.PubMedView ArticleGoogle Scholar
- Kennedy J, Eberhart R: Particle swarm optimization: November 1995.1995, 4: 1942-1948.Google Scholar
- Robinson J, Rahmat-Samii Y: Particle swarm optimization in electromagnetics. IEEE Transactions on Antennas and Propagation. 2004, 52 (2): 397-407. 10.1109/TAP.2004.823969.View ArticleGoogle Scholar
- Dorigo M, Gambardella LM: Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation. 1997, 1 (1): 53-66. 10.1109/4235.585892.View ArticleGoogle Scholar
- Dorigo M, Maniezzo V, Colorni A: Ant system: optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 1996, 26 (1): 167-171.View ArticleGoogle Scholar
- Stützle T, Dorigo M: A short convergence proof for a class of ACO algorithms. IEEE Transactions on Evolutionary Computation. 2002, 6 (4): 358-365. 10.1109/TEVC.2002.802444.View ArticleGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet. 2001, 27 (2): 167-171. 10.1038/84792.PubMedView ArticleGoogle Scholar
- Pilpel Y, Sudarsanam P, Church GM: Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001, 29 (2): 153-159. 10.1038/ng724.PubMedView ArticleGoogle Scholar
- Xu XL, Olson JM, Zhao LP: A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. Hum Mol Genet. 2002, 11 (17): 1977-1985. 10.1093/hmg/11.17.1977.PubMedView ArticleGoogle Scholar
- Wen X, Li Y, Hu K, Dai C, Liu Y: Hepatocyte growth factor receptor signaling mediates the anti-fibrotic action of 9-cis-retinoic acid in glomerular mesangial cells. Am J Pathol. 2005, 167 (4): 947-957.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu Y, Yokota H: Modelling and idenification of transcription-factor binding motifs in human chondrogenesis. Systems Biology. 2004, 1 (1): 85-92. 10.1049/sb:20045012.PubMedView ArticleGoogle Scholar
- Sekiya I, Vuoristo JT, Larson BL, Prockop DJ: In vitro cartilage formation by human adult stem cells from bone marrow stroma defines the sequence of cellular and molecular events during chondrogenesis. Proc Natl Acad Sci U S A. 2002, 99 (7): 4397-4402. 10.1073/pnas.052716199.PubMedPubMed CentralView ArticleGoogle Scholar
- Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000, 28 (1): 316-319. 10.1093/nar/28.1.316.PubMedPubMed CentralView ArticleGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Automatic Control. 1974, AC-19 (6): 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23 (1): 137-144. 10.1038/nbt1053.PubMedView ArticleGoogle Scholar
- Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29 (4): 412-417. 10.1038/ng780.PubMedView ArticleGoogle Scholar
- Mach V: PRESTA: associating promoter sequences with information on gene expression. Genome Biol. 2002, 3 (9): research0050-10.1186/gb-2002-3-9-research0050.PubMedPubMed CentralView ArticleGoogle Scholar
- Qian L, Liu Y, Sun HB, Yokota H: Systems analysis of matrix metalloproteinase mRNA expression in skeletal tissues. Front Biosci. 2002, 7: a126-34.PubMedView ArticleGoogle Scholar
- Sun HB, Liu Y, Qian L, Yokota H: Model-based analysis of matrix metalloproteinase expression under mechanical shear. Ann Biomed Eng. 2003, 31 (2): 171-180. 10.1114/1.1540635.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.