- Research article
- Open Access
Inference of transcriptional regulation using gene expression data from the bovine and human genomes
© Zadissa et al; licensee BioMed Central Ltd. 2007
- Received: 29 August 2006
- Accepted: 03 August 2007
- Published: 03 August 2007
Gene expression is in part regulated by sequences in promoters that bind transcription factors. Thus, co-expressed genes may have shared sequence motifs representing putative transcription factor binding sites (TFBSs). However, for agriculturally important animals the genomic sequence is often incomplete. The more complete human genome may be able to be used for this prediction by taking advantage of the expected evolutionary conservation in TFBSs between the species.
A method of de novo TFBS prediction based on MEME was implemented, tested, and validated on a muscle-specific dataset.
Muscle specific expression data from EST library analysis from cattle was used to predict sets of genes whose expression was enriched in muscle and cardiac tissues. The upstream 1500 bases from calculated orthologous genes were extracted from the human reference set. A set of common motifs were discovered in these promoters. Slightly over one third of these motifs were identified as known TFBSs including known muscle specific binding sites. This analysis also predicted several highly statistically significantly overrepresented sites that may be novel TFBS.
An independent analysis of the equivalent bovine genomic sequences was also done, this gave less detailed results than the human analysis due to both the quality of orthologue prediction and assembly in promoter regions. However, the most common motifs could be detected in both sets.
Using promoter sequences from human genes is a useful approach when studying gene expression in species with limited or non-existing genomic sequence. As the bovine genome becomes better annotated it can in turn serve as the reference genome for other agriculturally important ruminants, such as sheep, goat and deer.
- Bovine Genome
- Motif Prediction
- Unique Motif
- Dissimilarity Score
- Human RefSeq
To date, many evolutionary studies have focused on conservation of protein coding sequences between species, however, gene expression patterns across species can also be used for evolutionary studies . The focus of this study is on the proximal promoter . Proximal promoters consist of a heterogeneous collection of smaller regulatory elements including transcription factor binding sites (TFBSs). TFBSs are short DNA sequences which bind transcription factors that modulate the level of expression of their cognate gene(s) . Many stimulatory TFBSs are positioned near the transcription start site (TSS), within the proximal promoter region . In some cases, co-regulated genes or tissue-specific genes contain a common set of TFBSs and are controlled by same transcription factors. Analysis of promoter regions is one of the major approaches in understanding the transcriptional regulatory mechanisms. Hence, by identifying the binding sites in the promoters, the pattern of transcriptional regulation may be inferred . Tissue specific promoters, such as those examined in this study often have a single dominant peak and clustering of sequence specific factor binding sites around the TSS [6, 7].
The region analysed in this study is 1500 bases prior to the TSS. Deletion assays of the region 1000 bases prior to the TSS of 45 ENCODE promoters  indicated positional preference, relative to the TSS, for elements that contribute negatively and positively to the promoter activity. Negative elements resided between -1000 to -500 bp upstream of the TSS, while the positive elements were placed closer to the TSS, between -350 to -40 bp upstream of the TSS in 55% of the genes tested . These conclusions supported previous analyses of functional regions in promoter sequences where it has been established that most (91%) of promoters spanning 550 bp upstream of the TSS had significant transcriptional activity, indicating that the region contains active binding sites . This study does not examine cis-regulatory elements, particularly in enhancers and silencers outside this region .
Co-expressed genes can be determined by a number of techniques, such as microarray experiments or EST profiling. These methods usually examine a subset of genes in the genome. However, identifying common TFBSs in these genes allows one to infer the mechanism of co-regulation, establishing genome-wide frequencies of the TFBSs, and detect additional co-regulated genes not present on the array or detected in the original EST library. This approach also has the major benefit that it is sequence-based and is less dependent on prior knowledge on which many other methods of interpreting genomic studies depend. The current challenge is to extend this technique from well-characterised model species, such as human, to those with limited genomic annotation information, such as cattle.
In order to test the feasibility of using the human genome as a reference set for motif prediction, we used a set of bovine and human expression data from cardiac tissue. Initially, we used the homologous human genes and identified a set of motifs common in the promoter regions of the genes. To determine which transcription factors might bind to the predicted motifs and regulate the genes, the motifs were compared with previously described TFBSs and their best hit among known binding sites was identified. These results were then compared with the orthologous bovine promoter sequences where available, applying the same procedure. Finally, we combined and contrasted the various approaches and data sets and examined the underlying substructure of the results using clustering of motif frequencies in the promoter sequences.
We propose and test a general method to deduce regulatory motifs in promoter regions of mammalian species with restricted genomic sequence by using a well-characterised reference genome.
Analysis of a de novo motif prediction and identification method
Common motifs in promoter sequences were detected using the motif prediction programme MEME . In a recent survey of tools for promoter motif prediction MEME performed well among the thirteen different tools that were assessed . Sequences were analysed on both strands for common cis-regulatory elements of length 8 – 12 bp. The predicted elements were subsequently compared to the TFBS databases TRANSFAC  and JASPAR , for identification of the motifs as potential regulatory elements. For each predicted motif, a series of matches were produced, each representing the best hit from their corresponding length category by a scoring mechanism (see Methods). To determine significant matches, a permutation test was also performed.
Summary of the identified promoter motifs in Wasserman-Fickett muscle genes of group a
Motif (Seqs; Sites)
a1 (42; 31)
5.00 × 10-3
a2 (42; 29)
3.60 × 10-2
3.70 × 10-1
3.64 × 10-1
a4 (11; 7)
2.00 × 10-2
a5 (7; 9)
3.00 × 10-3
a15 (6; 6)
Motifs a4 and a15 had significant hits to known TFBS which were different to the expected binding sites. The a4 motif matched the complementary strand to POU6F1_01. The corresponding factor for this matrix is the POU domain, class 6, transcription factor 1. The gene encoding this protein has been shown to be expressed in muscle tissues and is involved in transcriptional regulation of several genes in cardiac tissue . The a15 motif had a high similarity to TEL2_Q6 and its related matrix ETS_Q4, both belonging to the v-ets erythroblastosis virus E26 oncogene gene family.
None of the remaining motifs had matches that met the selection criteria and, therefore, they may represent novel regulatory elements. To estimate false positive rates one thousand randomisations of the original 46 sequences were done and motifs predicted. Among the original set of predicted motifs, seven had no similar hits in the 1000 randomisation sets (indicated in Figure 2) [see Additional file 1]. Four of these are novel a3, a7, a13, although a7 has some similarity to the MYOGNF1_01 site, the myogenin binding site.
Screening all 24030 equivalent RefSeq promoters using MAST gave the genome-wide frequencies of the predicted motifs. Motif a2 is the most frequent among the promoters (4847 occurrences) [see Additional file 2]. The other four motifs similar to known sites, that is a1, a4, a5 and a15, had less than 500 occurrences in the promoters, suggesting that they may be restricted to a subset of genes.
Comparison of the de novo predicted motifs in the muscle specific set (a) with three sets of known TFBSs demonstrates that the methodology adopted for the analysis found both known and novel motifs. Five of the 14 unique motifs (36%) could be identified as resembling known TFBSs. The most significant novel motifs were a3 and a7 (Figure 2). These two motifs were common in the dataset with 9 and 16 occurrences but absent in the control randomisation sets [see Additional file 1].
Analysis of genes expressed in bovine cardiac cDNA libraries and homologous human genes (group b)
Human genes in group b and their bovine orthologues
Human RefSeq ID
Bovine orthologue Ensembl ID
Myosin binding protein C
Myosin, heavy polypeptide 7, beta
Myosin, light polypeptide 3, alkali
Troponin I type 3
Myosin, light polypeptide 2, regulatory, slow
Troponin T type 2
p21(CDKN1A)-activated kinase 4
FERM domain containing 5
Aetyl-Coenzyme A carboxylase beta
Cadherin 13, H-cadherin
Acyl-CoA synthetase long-chain family member 1
Myosin, heavy polypeptide 6, alpha
Ornithine aminotransferase-like 1
N-ethylmaleimide-sensitive factor attachment protein, alpha
Heat shock 27 kDa protein family, member 7
Zinc finger, FYVE domain containing 26
Glioma tumor suppressor candidate region gene 2
Zinc finger protein 358
Collagen, type VI, alpha 2
Collectin sub-family member 12
Cardiomyopathy associated 5
Hypothetical protein MGC50559
Cardiomyopathy associated 1
Human genes in group c
Human RefSeq ID
Troponin C type 1, slow
Myosin, heavy polypeptide 7, beta
Cysteine and glycine-rich protein 3
Myosin, light polypeptide 7, regulatory
Alpha myosin heavy chain
Myosin, light polypeptide 3, alkali
Myosin light chain 2
Troponin I type 3
Troponin T type 2
Actin, alpha, cardiac
Heat shock 27 kDa protein family, member 7
Nebulin-related anchoring protein
GUP1 glycerol uptake/transporter homolog
Ras-related associated with diabetes
Cardiomyopathy associated 1
Solute carrier family 25
Cysteine-rich protein 2
Evolutionarily conserved signaling intermediate in Toll pathway
Ankyrin repeat and SOCS box-containing 2
Patatin-like phospholipase domain containing 2
The EST sequences from 48 selected bovine tissue libraries were assembled into contigs and those contigs containing six or more ESTs were retained. These contigs were then analysed for cardiac tissue specific expression and 36 were selected. Putative human homologues for the contigs were obtained through BLASTN analysis , by comparing the bovine contig consensus sequences to the human RefSeq database (RefSeq release as of 02/09/2005), giving rise to 23 unique human RefSeq genes (Figure 1b, Table 2). The genes include well-known muscle-specific genes encoding sarcomeric or sarcomeric-associated proteins, such as Troponin T type 2 (TNNT2) and Troponin I type 3 (TNNI3), myosin binding protein C (MYBPC3) and other myosin-related genes.
A segment of 1500 bp upstream of the TSS for each of the genes was analysed for common cis-regulatory elements. This region is expected to contain most of the binding sites that are involved in transcription initiation [7, 8].
Summary of the identified promoter motifs in the 23 human RefSeq genes of group b
Motif (Seqs; Sites)
b1 h (17; 45)
5.71 × 10-1
7.82 × 10-1
b3 h (7; 10)
4.00 × 10-3
b5 h (15; 29)
1.10 × 10-2
b8 h (4; 6)
9.00 × 10-3
b11 h (6; 6)
MZF 5–13* (10)
5.00 × 10-3
Analysis of genes expressed in bovine muscle cDNA libraries and human cardiac tissue (group c)
Comparison with the human genome identified 73 unique human RefSeq genes [see Additional file 5]. These were later filtered for expression in human cardiac tissue using microarray expression ratios  extracted from the UCSC genome browser. Of the 73 human sequences, 25 met the cutoff of an expression ratio of > 2 in cardiac tissue. Therefore, this approach also employs gene expression information from the reference genome. The methodology is useful when the equivalent information in the original genome is not available.
Identified motifs in promoters of 25 human RefSeq cardiac genes in group c
Motif (Seqs; Sites)
c1 h (20; 49)
4.54 × 10-1
6.02 × 10-1
c2 h (15; 36)
2.40 × 10-2
c4 h (11; 30)
c9 h (6; 7)
3.00 × 10-3
c10 h (5; 6)
1.80 × 10-2
c11 h (3; 8)
1.70 × 10-2
c12 h (12; 8)
4.00 × 10-2
c14 h (8; 12)
4.00 × 10-3
c15 h (12; 17)
2.70 × 10-2
Analysis of the motifs predicted in group c shows the presence of several potentially muscle-specific regulatory elements found in the promoters of human genes. Compared to genes in group b, there was far less redundancy in the predicted motifs, which may be due to slightly larger data sets and also the expression specificity of the selected data, as expression array information may more accurately predict expression than EST profiling.
Frequency of the predicted motifs in the combined cardiac specific promoters
The unique MEME motifs, seven predicted in group b (Figure 3) and all the 15 motifs in group c (Figure 4), were combined together (indicated with red arrow in Figure 1). Similar motifs were identified and collapsed using MAST .
An examination of the motif frequency patterns in the promoters shows that five of the seven unique motifs (71%) are in common between the two sets [see Additional file 7]. This is an indication that using the inherent bovine EST frequencies in tissue libraries is sufficient for the analysis of regulatory patterns.
The 40 promoters predicted by either methods (b and c) to be cardiac specific were examined for overlap with CpG islands , only 11 overlapped. None of the eight promoters predicted by both methods had CpG islands within the 1500 bases upstream of the TSS only one had a TATA box within 50 bases. This is consistent with our finding of c1 h_b1 h/GC/Sp1 sites in these promoters (Figure 5).
Analysis of bovine genome sequences
The bovine genome is the first of the artidactyl genomes to be sequenced. The Btau2.0, 6.2 fold coverage assembly comprises 1.7 Gb with only 24% in contigs > 1000 bases. The small contig size creates problems for extraction of regions upstream of the TSS or first coding exon. Orthologous bovine promoters from the 23 human RefSeqs in group b, were retrieved using Ensembl's gene orthology criteria . Fifteen orthologues could be retrieved (Table 1).
The 15 sequences were analysed as before. Of the six unique motifs, three were in common with motifs predicted in their human orthologues corresponding to AP2 (b11b group, b15b, b11b) and MEF2 (b2b group) [see Additional file 8]. Two novel motifs (b1b, and b3b) occurred frequently with 19 and 25 occurrences, in 8 and 11 of 15 sequences [see Additional file 7].
A common problem in agricultural science is incomplete genomic sequence for the species of interest, and therefore limited or no access to the promoter sequences. The proposed solution is to identify orthologous genes in related species that have sequenced genomes and extract the relevant promoter regions. These promoters can then be used to search for common regulatory motifs. The approach used here was to examine the promoter regions of co-expressed genes for shared motifs. Such a method can identify both known and novel TFBSs, it can also potentially identify genes that may be co-expressed but that have not been measured as part of the experiment.
In this paper we present a simple approach for deducing regulatory information in mammalian genomes with restricted sequence by using a reference genome. We evaluate this approach using an independent data set. We used MEME  for predicting motifs in promoter sequences of the genes. MEME performed well in a recent assessment of tools employed for prediction of TFBSs  and offers a flexible set of parameters.
Wasserman and Fickett  compiled and analysed a group of experimentally verified muscle regulatory regions and discovered the presence of five specific binding sites in these sequences. This human subset (group a) of this collection was used as the training set to evaluate the proposed approach. Promoter analysis of sequences in the training set identified three of the five expected binding sites for muscle-specific transcription factors. We identified MEF2, Sp1 and Myf binding sites significantly in the set. The SRF binding site also matched to the same motif identified as a MEF2 site (a1). This result suggests that MEME may be clustering these two binding sites together due to the similarity at the centre of both sites. The randomisation of the training data also supported this observation. The composition, that is number of sequences and sites that contribute to the structure of a motif, of several low ranking motifs, such as a7, a9, a10 and a13, were very distinct from motifs generated from the random sequences. Hence, the approach used was to examine all 15 predicted motifs regardless of their MEME reported e-values in the subsequent analyses. We conclude from this initial comparison that our approach and selection criteria identified the majority of the motifs present.
The predicted motifs in group a were also compared to known TFBSs in TRANSFAC  and JASPAR . These databases are the most comprehensive collection of known TFBSs available. The TRANSFAC database is the largest available set of known TFBSs. The core matrices of many profiles have also been computationally extended – introducing potential redundancy into the data. Many of the longer binding sites are contributing to non-specific matches. The size of the JASPAR database is much smaller than TRANSFAC but it consists of experimentally verified and highly curated profiles. Overall in this study more of the TRANSFAC hits had p-values < 0.05 compared to those in JASPAR.
The MEF2 binding site was identified. The MEF2 gene family members are expressed during early embryogenesis and throughout the developing myocardium and are well-characterised for their involvement in muscle differentiation [25–27]. The conserved DNA binding domain in these regulators recognises an AT-rich consensus sequence, present in the regulatory regions of many muscle-specific genes [28, 29]. The MEF2 motifs in the TRANSFAC and JASPAR databases are largely based on in vitro SELEX data and differ slightly from each other and from that in .
Sp1 like motifs were also be identified, including the CHCH_01 motif in group a, that is a partial motif of the Sp1 binding site. The POU6F1 binding site identified in the set may be a novel muscle element. POU6F1 is a member of the homeobox protein family, involved in developmental processes, and has been shown to be expressed in muscle tissue . Another potentially novel factor for transcriptional regulation of muscle genes is TEL2, a member of the ETS family. However, this same motif matched the ETS-1 motif itself, which has been shown to be involved in cardiac morphogenesis [30, 31] and the TEF-1 motif in JASPAR also matched the same motif, indicating a similar composition in the two motifs.
Once we had evaluated our method, we subsequently utilised bovine cDNA expression data for the study. Contigs resulting from assembling the initial ESTs were analysed for muscle and cardiac tissue-specific expression using two methods. The first method used only the bovine EST frequency per library data. The second method employed both bovine and human expression data. Fourteen motifs (47%) of the initially 30 predicted motifs in the human promoters were identified as known TFBSs. Comparison of the unique motifs detected in the human promoters of groups b and c show that 71% in group b are in common between the two sets. These results indicate that even though the genes in the two sets were selected partially independently from each other, they produce similar results.
Once promoter sequences sharing common regulatory elements were identified, the combination of motifs present in the promoters could then be used to determine additional substructure within the results using clustering based on the number TFBSs present in the promoters. This suggests that the number and order of TFBSs present in a promoter encodes information that can be easily extracted from the results of this approach. The eight common promoters contained multiple Sp1 like sites (Figure 5) and generally lacked CpG islands and TATA boxes near to the TSS. Promoters with these characteristics generally had a single dominant peak of TSS  in other tissues, but potential alternative promoter starts were not examined in this study.
For the bovine genome two assemblies are currently in use (Btau2.0) used here and Btau3.1. The latest assembly has increased the size of the contigs to 61% > 1000 (rather than 24%) . However, this is still much lower than that of human at 94% > 5000. Therefore the assembly in regulatory regions, particularly the link between these and the coding regions remains much better for the human genome, and the human genome will remain a useful reference.
However, the current method suffers from some caveats. It does not use all available information, such as co-localisation and order of the TFBSs. Additionally, the method is not well-suited for determining gene regulation in tissues not shared between species, such as the rumen in artiodactyl mammals. This will therefore impose restrictions on the kind of analyses that may be performed. This barrier will, however, be overcome with a comprehensive annotation of bovine genome.
In the latest version of MEME, it is now possible to compare the produced PFMs to known motifs in JASPAR , a procedure similar to this study. It produces similar results but is also limited by the incompleteness of the JASPAR database as found here.
Examining the evolutionary conservation of the motifs can give further information about individual motifs and their inferred involvement in transcriptional regulation. Using available tools for phylogenetic footprinting studies, such as multi-species sequence alignments from public databases, e.g. the UCSC or Ensembl genome browsers, can aid in this task. The recently developed PhyMe  and PhyloGibbs  programmes address this proposed approach, using motif over-representation coupled with phylogenetic comparison to calculate significance of the predicted motifs. These will become more powerful as the number of genomes sequenced increases and the coverage and quality of their assembly improve.
By comparing two methods, both initially based on bovine EST data, we show that using human promoter regions as a reference platform in interpreting ruminant expression studies is a viable solution for the analysis of gene regulation patterns in the bovine genome.
The proposed method is simple and easy to implement with existing software and is robust when sufficient co-expressed (co-regulated) sequences can be identified. Finally, as the bovine genome becomes better annotated, it can serve as an interim platform for many other agriculturally important animals, such as sheep and goat, until their genome sequences become available.
Collection of data sets
In this paper, we analyse three data sets. The first set consists of regulatory regions involved in muscle-specific gene expression. This group was compiled by Wasserman and Fickett to study transcription factors associated with skeletal muscle gene regulation . They constructed a set of five position frequency matrices based on binding sites for MEF2, Myf, SRF, TEF-1 and Sp1 transcription factors which were found to bind within these sequences. The regulatory regions for the human subset and the matrices served as the training set to validate our approach. The sequences used were analysed with the MEME programme  (v3, release date: 02/04/2002) to identify common motifs. Frequency matrices of predicted motifs were subsequently compared to the known binding sites above using the matrix similarity implementation of mutual entropy described below. This group is referred to as group a in the paper. The collection of elements was retrieved from .
The second and third data sets (groups b and c) were based on bovine contigs assembled using bovine sequences – a collection of sequences from AgResearch bovine EST libraries along with all bovine cDNAs submitted to NCBI sequence repositories at the time (September 2005). The contigs were generated from the ESTs using CAP3  and resulted in 6223 contigs, each containing ≥ 6 ESTs. In order to distinguish cardiac-specific genes, the contigs were subjected to the following criterion: more than 90% of the ESTs in the contig should be present in the bovine cardiac library. The contigs were then used for retrieving their human homologues. Annotation of the sequences was performed using their corresponding human RefSeq sequence (RefSeq release as of 02/09/2005). The BLASTN  sequence analysis was employed to identify sequence homology and each bovine contig was annotated with the best human RefSeq homologue. If several bovine contigs matched the same human gene, the gene was reported only once, resulting in a list of unique human RefSeq genes. The human RefSeq sequences comprised the second data set for the analysis, group b. In parallel to this approach, human orthologues of contigs in four bovine muscle libraries (NCBI dbEST ID numbers: 18993, 18997, 18987 and 19022) were examined for their expression levels in human cardiac tissue using publicly available human gene expression results. The human RefSeqs were submitted to the University of California-Santa Cruz (UCSC) Human Gene sorter  using the GNF Gene Expression Atlas 2 . Genes with an expression log ratio of 2 or higher in heart tissue were selected. These human genes comprised group c. Figure 1 displays a diagram of the different approaches and the data selection procedure.
A region of 1500 bp upstream of the transcription start site for all the human RefSeq genes was extracted using the UCSC human genome browser . The sequences were masked for repetitive elements during the retrieval process. Intersect with CpG island prediction (cpgIslandExt) was done using the UCSC table browser. The promoters of the orthologous bovine genes were retrieved from Ensembl.
All data sets, groups a – c, were separately examined to identify common and potentially functional elements present in the promoter regions using MEME. Based on the assumption that transcription factor binding sites would likely consist of small and highly conserved motifs, the programme was set to output the top 15 motifs from each set, with motif length ranging between 8 and 12 bases long. We also allowed for each motif to occur any number of times in a single sequence. The minimum and maximum number of sites were allowed to vary between 6 and 50 respectively. The reverse complement strand of each sequence was also considered in the analysis. Sequence logos for all the motifs were generated using WebLogo . Using the MAST programme, promoter sequences for all human RefSeq genes were screened for the presence of all unique motifs from each data set.
The promoters of the human genes in groups b and c were combined together, as were the motifs predicted in the two sets, for identification of all motifs shared between the sets. Duplicated sequences were removed. Any motifs with a Pearson's correlation coefficient > 0.6 were identified and collapsed. The motif frequencies in the different promoters were analysed using hierarchical clustering in R , where the dissimilarities between the frequencies were taken into account. The resulting matrix was then used to generate a colour map. The same procedure was also applied to the bovine promoter sequences from groups b and c.
Identification of known motifs
To test for known biological relevance of the MEME motifs, the transcription factor databases TRANSFAC v8.1  and JASPAR  were used. The flat file for the TRANSFAC database was accessible through a commercial licence while the JASPAR matrix files were freely available from . The database flat files were parsed using the TFBS Perl modules  to generate position frequency matrices (PFMs). Only the vertebrate matrices from the databases were employed, resulting in 522 TRANSFAC and 83 JASPAR matrices respectively.
where M1 and M2 are assumed to have the same number of columns, l. As the comparison is asymmetric, the average between S(M1, M2) and S(M2, M1) is selected as the score for comparing two matrices. The dissimilarity score (S), ranging between 0 and 6 in this study, is an indication of the degree of dissimilarity between the two matrices – i.e. a value of zero indicates a perfect match while higher values indicate less homology between the two matrices.
Given two sets of PFMs, one comprising the predicted matrices and the other a database of known matrices for TFBSs, a Perl  script was used to carry out matrix comparisons – comparing each predicted matrix to all the entries in the database. To compare two PFMs of different lengths, a sliding window with length equal to the shorter PFM was used. At each comparison instant the following criteria has to be satisfied: the length of the comparison window has to be at least six bases long, and the overlapping segment of at least one of the matrices has to have a minimum of 60% information content in total. The final score for a comparison is normalised by dividing by the length of the comparing window. As transcription factors have the ability to bind to the DNA on either strand, the Perl script also tests the reverse complement of the predicted MEME matrices. The database matrices are divided into categories depending on their length. Best matches from each length category are reported, resulting in a series of hits for each of the MEME matrices. The final best match for each predicted matrix was selected from these by using random permutations of each MEME matrix to estimate statistical significance (see below).
In order to evaluate the statistical significance of the obtained matrix matches, each MEME matrix was randomly permuted 1000 times to obtain a p-value. The permutations re-sort the base composition of the matrix while keeping other associations such as the GC content and the depth of the matrix, i.e. the number of sites, unchanged. The false discovery rate (FDR) of the procedure was estimated by adjusting the p-values resulting from the permutations for the number of hypotheses tested, i.e. 15 matrix comparisons. The p.adjust programme in the R stats package  was used for the purpose. Sequences in the training set were shuffled 1000 times to generate 1000 random sequence sets. These were subsequently examined by MEME where in total 15000 motifs were generated. The motifs were used to obtain background distributions for number of sequences and number of sites that comprised the motifs.
Software is available at http://guinevere.otago.ac.nz/TFBS_Zadissa/.
We thank Ken Dodds and Mik Black for statistical advice, Alan McCulloch for assembling the bovine ESTs used in this analysis, Craig Miskell for help with permutation runs on the Linux computer farm, Orla Keane, Andrew Firth and Theresa Wilson for proofreading and invaluable comments on the manuscript and AgResearch for providing access to the ESTs before public release. A.Z. was supported by a New Zealand Foundation for Research, Science and Technology Bright Future Enterprise Scholarship.
- Khaitovich P, Pääbo S, Weiss G: Toward a neutral evolutionary model of gene expression. Genetics. 2005, 170 (2): 929-939. 10.1534/genetics.104.037135.PubMed CentralPubMedView ArticleGoogle Scholar
- Maston GA, Evans SK, Green MR: Transcriptional Regulatory Elements in the Human Genome. Annu Rev Genomics Hum Genet. 2006, 7: 29-59. 10.1146/annurev.genom.7.080505.115623.PubMedView ArticleGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science. 2004, 304 (5675): 1321-1325. 10.1126/science.1098119.PubMedView ArticleGoogle Scholar
- Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM: Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006, 16: 1-10. 10.1101/gr.4222606.PubMed CentralPubMedView ArticleGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434 (7031): 338-345. 10.1038/nature03441.PubMed CentralPubMedView ArticleGoogle Scholar
- Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, FANTOM Consortium, RIKEN Genome Exploration Research Group, Genome Science Group (Genome Network Project Core Group) m: The transcriptional landscape of the mammalian genome. Science. 2005, 309 (5740): 1559-1563. 10.1126/science.1112014.PubMedView ArticleGoogle Scholar
- ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticleGoogle Scholar
- Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM: Identification and functional analysis of human transcriptional promoters. Genome Res. 2003, 13: 308-312. 10.1101/gr.794803.PubMed CentralPubMedView ArticleGoogle Scholar
- Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006, 444 (7118): 499-502. 10.1038/nature05295.PubMedView ArticleGoogle Scholar
- Bailey T, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, AAAI Press, 28-36.Google Scholar
- Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E, Favorov A, Frith M, Fu Y, Kent W, Makeev V, Mironov A, Noble W, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.PubMedView ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel A, Kel-Margoulis O, Kloos D, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-378. 10.1093/nar/gkg108.PubMed CentralPubMedView ArticleGoogle Scholar
- Sandelin A, Alkema W, Engström P, Wasserman W, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004, 91-94. 10.1093/nar/gkh012. 32 DatabaseGoogle Scholar
- Wasserman W, Fickett J: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998, 278: 167-181. 10.1006/jmbi.1998.1700.PubMedView ArticleGoogle Scholar
- Zhao C, Meng A: Sp1-like transcription factors are regulators of embryonic development in vertebrates. Dev Growth Differ. 2005, 47 (4): 201-211. 10.1111/j.1440-169X.2005.00797.x.PubMedView ArticleGoogle Scholar
- Wey E, Lyons GE, Schäfer BW: A human POU domain gene, mPOU, is expressed in developing brain and specific adult tissues. Eur J Biochem. 1994, 220 (3): 753-762. 10.1111/j.1432-1033.1994.tb18676.x.PubMedView ArticleGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
- Lingbeck JM, Trausch-Azar JS, Ciechanover A, Schwartz AL: E12 and E47 modulate cellular localization and proteasome-mediated degradation of MyoD and Id1. Oncogene. 2005, 24 (42): 6376-6384.PubMedGoogle Scholar
- Bragança J, Eloranta JJ, Bamforth SD, Ibbitt JC, Hurst HC, Bhattacharya S: Physical and functional interactions among AP-2 transcription factors, p300/CREB-binding protein, and CITED2. J Biol Chem. 2003, 278 (18): 16021-16029. 10.1074/jbc.M208144200.PubMedView ArticleGoogle Scholar
- Su A, Wiltshire T, Batalov S, Lapp H, Ching K, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke M, Walker J, Hogenesch J: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralPubMedView ArticleGoogle Scholar
- Peterkin T, Gibson A, Loose M, Patient R: The roles of GATA-4, -5 and -6 in vertebrate heart development. Semin Cell Dev Biol. 2005, 16: 83-94. 10.1016/j.semcdb.2004.10.003.PubMedView ArticleGoogle Scholar
- Bailey T, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998, 14: 48-54. 10.1093/bioinformatics/14.1.48.PubMedView ArticleGoogle Scholar
- UCSC genome browser. [http://genome.ucsc.edu/]
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Gräf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kähäri A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hubbard TJ: Ensembl 2006. Nucleic Acids Res. 2006, 556-561. 10.1093/nar/gkj133. 34 DatabaseGoogle Scholar
- Black BL, Olson EN: Transcriptional control of muscle development by myocyte enhancer factor-2 (MEF2) proteins. Annu Rev Cell Dev Biol. 1998, 14: 167-196. 10.1146/annurev.cellbio.14.1.167.PubMedView ArticleGoogle Scholar
- Anderson JP, Dodou E, Heidt AB, Val SJD, Jaehnig EJ, Greene SB, Olson EN, Black BL: HRC is a direct transcriptional target of MEF2 during cardiac, skeletal, and arterial smooth muscle development in vivo. Mol Cell Biol. 2004, 24 (9): 3757-3768. 10.1128/MCB.24.9.3757-3768.2004.PubMed CentralPubMedView ArticleGoogle Scholar
- Molkentin JD, Firulli AB, Black BL, Martin JF, Hustad CM, Copeland N, Jenkins N, Lyons G, Olson EN: MEF2B is a potent transactivator expressed in early myogenic lineages. Mol Cell Biol. 1996, 16 (7): 3814-3824.PubMed CentralPubMedView ArticleGoogle Scholar
- Edmondson DG, Lyons GE, Martin JF, Olson EN: Mef2 gene expression marks the cardiac and skeletal muscle lineages during mouse embryogenesis. Development. 1994, 120 (5): 1251-1263.PubMedGoogle Scholar
- Lin Q, Schwarz J, Bucana C, Olson EN: Control of mouse cardiac morphogenesis and myogenesis by transcription factor MEF2C. Science. 1997, 276 (5317): 1404-1407. 10.1126/science.276.5317.1404.PubMed CentralPubMedView ArticleGoogle Scholar
- Macías D, Pérez-Pomares JM, García-Garrido L, Carmona R, Muñnoz-Chápuli R: Immunoreactivity of the ets-1 transcription factor correlates with areas of epithelial-mesenchymal transition in the developing avian heart. Anat Embryol (Berl). 1998, 198 (4): 307-315. 10.1007/s004290050186.View ArticleGoogle Scholar
- Lie-Venema H, de Groot ACG, van Empel LJP, Boot MJ, Kerkdijk H, de Kant E, DeRuiter MC: Ets-1 and Ets-2 transcription factors are essential for normal coronary and myocardial development in chicken embryos. Circ Res. 2003, 92 (7): 749-756. 10.1161/01.RES.0000066662.70010.DB.PubMedView ArticleGoogle Scholar
- Bos taurus genome: Statistics – Build 3.1 (based on Btau_3.1). [Accessed on 1 May 2007., [http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9913&build=3&ver=1]
- Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, W369-W373. 10.1093/nar/gkl198. 34 Web ServerGoogle Scholar
- Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004, 5: 170-10.1186/1471-2105-5-170.PubMed CentralPubMedView ArticleGoogle Scholar
- Siddharthan R, Siggia ED, Nimwegen Ev: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005, 1 (7): e67-10.1371/journal.pcbi.0010067.PubMed CentralPubMedView ArticleGoogle Scholar
- Repository of functional regulatory elements. [http://pipmaker.bx.psu.edu/mousegroup/Reg_annotations/]
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMed CentralPubMedView ArticleGoogle Scholar
- Karolchik D, Baertsch R, Diekhans M, Furey T, Hinrichs A, Lu Y, Roskin K, Schwartz M, Sugnet C, Thomas D, Weber R, Haussler D, Kent W: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.PubMed CentralPubMedView ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMed CentralPubMedView ArticleGoogle Scholar
- R Development Core Team, R Foundation for Statistical Computing: R: A Language and Environment for Statistical Computing. 2006, Vienna, Austria, [ISBN 3-900051-07-0].Google Scholar
- JASPAR matrix sites. [http://jaspar.cgb.ki.se/DOWNLOAD/SITES/]
- Lenhard B, Wasserman W: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18 (8): 1135-1136. 10.1093/bioinformatics/18.8.1135.PubMedView ArticleGoogle Scholar
- Perl programming language. [http://www.perl.com]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.