- Open Access
Comparative promoter region analysis powered by CORG
BMC Genomics volume 6, Article number: 24 (2005)
Promoters are key players in gene regulation. They receive signals from various sources (e.g. cell surface receptors) and control the level of transcription initiation, which largely determines gene expression. In vertebrates, transcription start sites and surrounding regulatory elements are often poorly defined. To support promoter analysis, we present CORG http://corg.molgen.mpg.de, a framework for studying upstream regions including untranslated exons (5' UTR).
The automated annotation of promoter regions integrates information of two kinds. First, statistically significant cross-species conservation within upstream regions of orthologous genes is detected. Pairwise as well as multiple sequence comparisons are computed. Second, binding site descriptions (position-weight matrices) are employed to predict conserved regulatory elements with a novel approach. Assembled EST sequences and verified transcription start sites are incorporated to distinguish exonic from other sequences.
As of now, we have included 5 species in our analysis pipeline (man, mouse, rat, fugu and zebrafish). We characterized promoter regions of 16,127 groups of orthologous genes. All data are presented in an intuitive way via our web site. Users are free to export data for single genes or access larger data sets via our DAS server http://tomcat.molgen.mpg.de:8080/das. The benefits of our framework are exemplarily shown in the context of phylogenetic profiling of transcription factor binding sites and detection of microRNAs close to transcription start sites of our gene set.
The CORG platform is a versatile tool to support analyses of gene regulation in vertebrate promoter regions. Applications for CORG cover a broad range from studying evolution of DNA binding sites and promoter constitution to the discovery of new regulatory sequence elements (e.g. microRNAs and binding sites).
Comparative sequence analysis has been a powerful tool in bioinformatics for addressing a variety of issues. Applications range from grouping of sequences (e.g. protein sequences into families) to de novo pattern discovery of functional signatures.
Speaking of gene regulation, it has been known for a long time that there is considerable sequence conservation between species in non-coding regions of the genome. A comprehensive explanation of this observation is still elusive. However, sequence conservation within promoter regions of genes often stems from transcription factor binding sites that are under selective pressure (see  for a review and  for a systematic assessment of binding site conservation in man and mouse comparisons).
Conserved sequence elements of other types have recently caught much attention. Not all non-coding conserved DNA in the vicinity of a gene's transcription start site necessarily functions at the level of transcriptional regulation. For example, most known methylation-guide snoRNAs are intronencoded and processed from transcripts of housekeeping genes . A few microRNAs are apparently linked to protein coding genes, most notably mir-10 and mir-196 which are located in the (short) intergenic regions in the Hox gene clusters of vertebrates [4–7].
A second class of conserved sequence elements exert their function as regulatory motifs in the untranslated region (UTR) of the primary transcript or the mature mRNA. The UTRsite database , for example, lists about 30 distinct functional motifs including the Histone 3'UTR stem-loop structure (HSL3) , the Iron Responsive Element (IRE) , the Selenocysteine Insertion Sequences (SECIS) , and the Internal Ribosome Entry Sites (IRES) . Most of these elements are contained in CORG since short intergenic regions or introns upstream of the translation start site are entirely covered by our definition of an upstream region.
The CORG framework aims at detecting and describing regulatory elements that are proximal to the transcription start site. In this context, the comparison of upstream regions of orthologous genes is particularly valuable. This concept is called "phylogenetic footprinting" and an overview of this approach can be found in .
Phylogenetic footprinting in a strict sense is carried out on orthologous promoter regions. Local sequence similarities can then be directly interpreted as related regions harboring conserved functional elements. We denote these similarities as Conserved Non-coding Blocks (CNBs).
Multi-species sequence conservation
Comparative approaches gain power from the inclusion of sequences from more than two species . Multi-species comparisons help to increase specificity at the expense of intra-species sensitivity since supporting evidence (conservation) stems from many observations. To give an example, Man-mouse-rat comparisons enhance the detection of transcription factor binding sites since the rat genome is more divergent from the mouse genome than anticipated . A nice property of vertebrate microRNAs is the high degree of sequence conservation which is found in alignments of man, mouse and fish microRNAs . Both types of comparisons are available in CORG. In CORG, we consider cross-species conservation between promoter regions from 5 vertebrate genomes, namely Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio and Fugu rubripes. Multiple alignments are built from pairwise CNBs as described in the subsequent section.
Construction and content
Groups of orthologous genes
In this work, we take a gene-centered view of phylogeny. Homology among proteins and thus genes is often concluded on the basis of sequence similarity. The EnsEMBL database  allows to distinguish orthologous from merely homologous genes by taking information on conserved synteny into account. We employed single linkage clustering on the graph of EnsEMBL orthologous gene pairs to define the CORG gene groups.
Genomic mapping of validated promoter regions
Various recent experimental efforts supply information about the position of transcriptional start sites in the human and mouse genome. Table 1 gives an overview on the resources that were employed in CORG.
Some repositories offer genomic coordinates for their start site entries. Existing genomic mapping information was incorporated unless the underlying genome assembly build differed. The remaining data were projected onto the genome with SSAHA (Sequence Search and Alignment by Hashing Algorithm), a rapid near-exact alignment algorithm .
The notion of "promoter region" deserves some further explanation in the context of our approach. Typically, though not exclusively, we expect conserved regulatory regions to appear in the vicinity of the transcription start site of a gene. Since we do not know the precise location of the start of transcription for each and every gene, we chose to compare the sequence regions upstream of the start of translation from orthologous genes. If verified transcription start sites are known, we define a sequence window that is large enough to hold both, translation and transcription start sites, plus 5 kB upstream sequence. In case we lack this information, our observations on known transcription start sites indicate that most promoter regions should be captured in a sequence window of 10 kb size (Additional File 1). The size of a promoter region may be bounded by the size of the corresponding intergenic region. If an annotated gene happens to lie within the primary sequence window, the promoter region is shortened to exclude exonic sequence.
Detection of pairwise local sequence similarities
Significant local sequence similarities (phylogenetic footprints) in two sequences are computed with an implementation of the Waterman-Eggert algorithm. We have already given an account of the algorithm and statistics in [19, 20]. The underlying alignment scoring scheme is the general reversible model :
where Q is the transition rate matrix. We left out the elements on the diagonal, which are constrained by the requirement that the sum of all elements in a row equals zero.
The π i are the stationary nucleotide frequencies, their sum is constrained to be one. Although the two genomes under consideration are in general not in their stationary state with respect to the substitutional process we take the mean of the two observed nucleotide frequencies, , to be the best estimate of the stationary base composition.
From other studies we have further knowledge about the relative rates between transversions, the transition A:T→G:C, and the transition G:C→A:T, which occur in roughly in the ratio 1:3:5 along vertebrate lineages . These ratios of rates would generate sequences with 40% GC in their stationary state. To accommodate the observed nucleotide frequencies π i we have to allow for deviation from those ratios. We do this by choosing for example α ∝ (R(A → T)/π T + R(T → A)/π A )/2, where R(i → j) is either 1, 3, or 5 depending on the process under consideration. At the end we scale the matrix Q, such that the PAM distance  of the substitution model equals the observed degree of divergence between the two species under comparison.
Since we were mainly interested in highly conserved regulatory elements, we demanded an average similarity level at least as high as the average exon conservation between the species under comparison.
The score for aligning two nucleotides i and j is then s(i, j) = log(P(i, j)/(π i π j )) where P(i, j) is the probability of finding the pairing of i and j under the above substitution model .
Joining pairwise into multiple alignments
All CNBs from pairwise sequence alignments are split up into groups as defined by gene homology. For each group a graph O = (V, E) with vertices V and edges E is constructed, which represents the species-internal overlap of CNBs on the genomic coordinate level. Each vertex a ∈ V represents a footprint, which is a pairwise local alignment between two species. An undirected edge is placed between two vertices if the corresponding CNBs have only one species in common and show an overlap of at least 10 bp on the sequence level.
In our graph O, cliques of minimal size three are detected with an implementation of the Bron-Kerbosh algorithm . Only those cliques are selected whose species count is equal to their size. This move prohibits the emergence of multiple alignments by similarity of multiple short CNBs to a single long CNB. Multiple alignments are then computed based on all cliques that meet the outlined criteria. We chose to employ the multiple alignment method of  who applies partial order graphs (POG) to the multiple alignment problem.
Partial order graphs belong to the class of directed acyclic graphs (DAGs). A DAG is a graph consisting of a set of nodes N and edges E, which are one-way edges and form no cycles.
The multiple alignment problem is then reduced to to subsequent alignment steps of individual sequences to a growing multiple alignment graph. If the sequences to be aligned share substantial sequence similarity, the number of bifurcation points within the POG stays low and allows rapid computation of the multiple alignment.
Alignment results are subsequently trimmed to encompass the leftmost and rightmost ungapped block of at least 6 nucleotides.
Annotation of promoter regions
Exon detection with assembled EST clusters
Promoter regions in CORG always extend upstream from the most downstream coding start (ATG). As a consequence, promoter regions may contain exons that are not translated. Our way of detecting such exons is a similarity search of man-mouse footprints versus GENENEST , a database of assembled EST clusters. Database searches are carried out for human and mouse footprints with the BLASTN program . An E-value cut-off of 10-4 is applied.
Annotation with predicted binding sites
The TRANSFAC database  is a repository of experimentally verified binding site sequences and representations thereof. These representations are used for querying the collection of man-mouse CNBs for known binding site patterns.
Potential binding sites are detected with TRANSFAC weight matrices by the method of . Here, the intuition is that there are two random models for a given sequence S: one is given by the signal profile F and the other one by the background model B. Under both models the distribution of weight matrix scores can conveniently be calculated by convolution, since the score is a sum of independent random variables. Probability mass distributions of P F (Score(S)) as well as P B (Score(S)) can be computed by dynamic programming if column scores are reasonably discretized. This allows a fine tuning of the proportions of false positives and negatives for each TRANSFAC weight matrix. Both error levels were set to be equal. All details are given in .
Utility and discussion
We now present an overview of the web interface of the database and several example applications.
The CORG database is accessible via its home page http://corg.molgen.mpg.de and offers a redesigned web interface. From the search page one can quickly jump to gene loci via EnsEMBL or other standard identifers (e.g. HUGO symbol, LocusLink identifier, ...). The search query is processed according to the chosen reference source and a list of all matching database entries is returned to the user. This list serves as a springboard to a summary page where the genomic context of the selected gene and its similarities to other upstream regions is visualized as in Figure 1.
Pairwise as well as multiple comparisons are displayed on demand at this stage with a JAVA applet that complies with the JDK 1.1 standard. Alternatively, upstream region sequence and corresponding annotation can be exported in EMBL format (sequence data also in FASTA format). The JAVA applet should run on all JAVA-compatible web browsers. Detailed information about the conserved non-coding block structure are simultaneously shown for multiple upstream regions of different species. If available, annotation information on putative binding sites of transcription factors and EST matches are displayed for the query sequence. The applet facilitates zooming into sequence and annotation. In addition, web links are assigned to sequence features that relate external data sources to the corresponding annotation.
CORG data may be also embedded into other viewers or programs via the distributed annotation system (DAS, ). DAS facilitates the display of distributed data sources in a common framework with respect to a reference sequence. Our DAS server http://tomcat.molgen.mpg.de:8080/das constitutes such an external data source. Position information on all conserved non-coding blocks and mapped promoters is accessible from this DAS server. Each DAS sequence feature provides a link to the corresponding CORG database entry. New DAS sources can be easily added to the ENSEMBL display. A small tutorial on installing external DAS data sources is available on our web page http://corg.molgen.mpg.de/DAS_tutorial.htm.
Additionally, tools for on-site batch retrieval of CORG data will be added to the web portal in the near future.
Phylogenetic profiling of binding sites
One potential application of CORG is phylogenetic profiling of promoter regions. We define phylogenetic profiling in the context of gene regulation as comparative analysis of presence/absence patterns of binding sites in promoter regions. Here, we consider conserved predicted binding sites and contrast them with validated ones.
Serum Response Factor (SRF) promoter
SRF, a MADS-box transcription factor, regulates the expression of immediate-early genes, genes encoding several components of the actin cytoskeleton, and cell-type specific genes, e.g. smooth, cardiac and skeletal muscle or neuronal-specific genes [31, 32]. Mouse embryos lacking SRF die before gastrulation and do not form any detectable mesoderm [33, 34]. SRF mediates transcriptional activation by binding to CArG box sequences (Consensus pattern: CC(AT)6GG) in target gene promoters and by recruiting different co-factors. SRF regulates transcription downstream of MAPK signaling in association with ternary complex factors (TCFs) (for a review see ). TCFs bind to ets binding sites present adjacent to CArG boxes in many SRF target gene promoters.
Figure 1 gives an overview of the genomic context of human SRF. As expected, the upstream region of SRF shows substantial conservation to its rodent orthologs. Additionally, significant alignments were found in comparisons with fish homologs (one from zebrafish and two from fugu). The same data is presented in the multiple alignment view of the JAVA applet in Figure 2. This view gives a better idea on the location of alignments in the corresponding source sequences. Note, that the spacing between translation start and alignment is greater in fish than in mammals, which hints at different extension of the promoter region in the two subgroups.
We get a better idea on the cause of sequence conservation by browsing the multiple alignment. Textual information can be obtained by clicking on the alignment boxes. Then, the alignment appears in a pop-up window and may be copied to another destination. In Figure 3, we used CLUSTAL X () to render the conservation structure on to the nucleotide level. Here, a striking observation is the conservation of the regulatory feedback loop of SRF to its own promoter in all species under consideration. So far, this feedback loop was experimentally verified in the mouse system  but could exist in all other species under comparison.
Non-coding RNA can be classified as transcribed regulatory elements. Non-coding RNAs are also accessible to the user via the CORG database. Since we were primarily interested in non-coding RNAs rather than small mRNA motifs we restricted our search here to long CNBs. A blast search of our multiple alignments with length L ≥ 50 against the Rfam database  and the microRNA Registry  identifies 21 alignments as 7 distinct microRNAs and a single snoRNA, Table 2.
The snoRNA U93 is an unusual mammalian pseudouridinylation guide RNA which accumulates in Cajal (coiled) bodies and it is predicted to function in pseudouridylation of the U2 spliceosomal snRNA . It appears to be specific for mammals. The genomic copy of the human U93 RNA is located in an intron of a series of reported spliced expressed sequence tags (ESTs); furthermore, it has been verified experimentally that U93 is indeed spliced from an intron . It was detectable in the CORG footprint dataset because of its location upstream of a conserved putative gene C14orf87 with unknown function.
The known microRNAs belong to four different groups. The mir10 and the mir196 precursors are located at specific positions in the Hox gene clusters [4–7]. The mir-196 family regulates Hox8 and Hox7 genes, the function of mir10 is unknown.
Substitution pattern of non-coding RNAs
For a microRNA we expect a subsequence of about 20 nt that is almost absolutely conserved among vertebrates (the mature miRNA) and a well-conserved complementary sequence forming the other side of the stem from which the mature microRNA is excised. In contrast, the substitution rate should be much larger in the loop region of the hairpin . mir10 is a good example of this typical substitution pattern, which gives rise to a hairpin structure. The pairwise correlation structure of nucleotides is depicted on top of the multiple alignment in Figure 4. A different pattern is observed for the Iron Responsive element in the 5'UTR of SLCA1, a member of the sodium transporter family. This time the substitution pattern does not meet the minimal length of the microRNA definition above. Nevertheless, it is conserved across all vertebrate species as shown in Figure 5.
We have improved and extended our framework of comparative analysis and annotation of vertebrate promoter regions over previous releases (see ). The following features have been added to the CORG framework:
• Mapping of validated promoter regions and proper adjustment of the extent of upstream regions.
• Multiple alignments from significant local pair wise alignments.
• Novel approach to predict transcription factor binding sites.
• Web site offers now a genomic context view (as in Figure 1) and an option to export sequence and annotation data.
The CORG database is accessible via our web site. The user is guided step-by-step through the process of selecting and analyzing her promoter region of choice. CORG features an interactive viewer based on JAVA technology, which is tailored to detailed promoter analysis. Large-scale studies make direct use of our DAS service or the MySQL implementation of CORG in conjunction with an application interface (contact authors for details).
We presented selected application examples from the realm of vertebrate gene regulation. Conserved regulatory elements of different kinds (binding sites, microRNAs and UTR elements) are readily accessible to CORG users. New genomes and annotation will be continuously added to CORG.
Availability and requirements
The database is freely accessible through the website http://corg.molgen.mpg.de. Programs, scripts and MySQL database dumps are available from the authors upon request.
Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000, 16 (9): 369-72. 10.1016/S0168-9525(00)02081-3.
Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. 2004, 14 (3): 451-458. 10.1101/gr.1327604.
Bachellerie JP, Cavaillé J, Hüttenhofer A: The expanding snoRNA world. Biochimie. 2002, 775-790. 10.1016/S0300-9084(02)01402-5.
Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification of novel genes coding for small expressed RNAs. Science. 2001, 294: 853-858. 10.1126/science.1064921.
Lagos-Quintana M, Rauhut R, Meyer J, Borkhardt A, Tuschl T: New microRNAs from mouse and human. RNA. 2003, 9: 175-179. 10.1261/rna.2146903.
Yekta S, Shih Ih, Bartel DP: MircoRNA-directed cleavage of HoxB8 mRNA. Science. 2004, 304: 594-596. 10.1126/science.1097434.
Tanzer A, Amemiya CT, Kim CB, Stadler PF: Evolution of MicroRNAs Located Within Hox Gene Clusters. J Exp Zool: Mol Dev Evol. 2004,
Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, Gissi C, Saccone C: UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs: Update 2002. Nucl Acids Res. 2002, 30: 335-340. 10.1093/nar/30.1.335.
Williams AS, Marzluff WF: The sequence of the stem and flanking sequences at the 3'end of histone mRNA are critical determinants for the binding of the stemm-loop binding protein. Nucl Acids Res. 1995, 23: 654-662.
Hentze MW, Kuhn LC: Molecular control of vertebrate iron metabolism: mRNA based regulatory circuits operated by iron, nitric oxide, and oxidative stress. Proc Natl Acad Sci USA. 1996, 93: 8175-8182. 10.1073/pnas.93.16.8175.
Walczak R, Westhof E, P C, Krol A: A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA. 1996, 2: 367-379.
Le SY, Maizel JV: A common RNA structural motif involved in the internal initiation of translation of cellular mRNAs. Nucl Acids Res. 1997, 25: 362-369. 10.1093/nar/25.2.362.
Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol. 1997, 7 (3): 399-406. 10.1016/S0959-440X(97)80058-9.
McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 2002, 12 (10): 1523-32. 10.1101/gr.323602.
Mullins LJ, Mullins JJ: Insights from the rat genome sequence. Genome Biol. 2004, 5 (5): 221-10.1186/gb-2004-5-5-221.
Lim LP, Glasner ME, Yekta S, Burge CB, Bartel DP: Vertebrate microRNA genes. Science. 2003, 299 (5612): 1540-10.1126/science.1080372.
Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T, Clamp M: An overview of Ensembl. Genome Res. 2004, 14 (5): 925-928. 10.1101/gr.1860604.
Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11 (10): 1725-1729. 10.1101/gr.194201.
Dieterich C, Cusack B, Wang H, Rateitschak K, Krause A, Vingron M: Annotating regulatory DNA based on man-mouse genomic comparison. Bioinformatics. 2002, 18 (Suppl 2): S84-90.
Dieterich C, Wang H, Rateitschak K, Luz H, Vingron M: CORG: a database for Comparative Regulatory Genomics. Nucleic Acids Res. 2003, 31: 55-57. 10.1093/nar/gkg007.
Lio P, Goldman N: Models of molecular evolution and phylogeny. Genome Res. 1998, 8 (12): 1233-44.
Arndt PF, Petrov DA, Hwa T: Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. Mol Biol Evol. 2003, 20 (11): 1887-96. 10.1093/molbev/msg204.
States D, Gish W, Altschul S: Improved sensitivity of nucleic acid database searches using application- specific scoring matrices. Methods: A companion of Methods in Enzymology. 1991, 3: 66-70.
Bron C, Kerbosch J: Algorithm 457. Finding all cliques of an undirected graph. Commun ACM. 1973, 16: 575-10.1145/362342.362367.
Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics. 2002, 18 (3): 452-64. 10.1093/bioinformatics/18.3.452.
Krause A, Haas SA, Coward E, Vingron M: SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein. Nucleic Acids Res. 2002, 30: 299-300. 10.1093/nar/30.1.299.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-402. 10.1093/nar/25.17.3389.
Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel A, Kel-Margoulis O: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research. 2003, 31: 374-378. 10.1093/nar/gkg108.
Rahmann S, Mueller T, Vingron M: On the Power of Profiles for Transcription Factor Binding Site Detection. Statistical Applications in Genetics and Molecular Biology. 2003, 2: 7-
Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The distributed annotation system. BMC Bioinformatics. 2001, 2: 7-10.1186/1471-2105-2-7.
Miano JM: Serum response factor: toggling between disparate programs of gene expression. J Mol Cell Cardiol. 2003, 35 (6): 577-93. 10.1016/S0022-2828(03)00110-X.
Treisman R: Journey to the surface of the cell: Fos regulation and the SRE. EMBO J. 1995, 14 (20): 4905-13.
Arsenian S, Weinhold B, Oelgeschlager M, Ruther U, Nordheim A: Serum response factor is essential for mesoderm formation during mouse embryogenesis. EMBO J. 1998, 17 (21): 6289-99. 10.1093/emboj/17.21.6289.
Weinhold B, Schratt G, Arsenian S, Berger J, Kamino K, Schwarz H, Ruther U, Nordheim A: Srf(-/-) ES cells display non-cell-autonomous impairment in mesodermal differentiation. EMBO J. 2000, 19 (21): 5835-44. 10.1093/emboj/19.21.5835.
Buchwalter G, Gross C, Wasylyk B: Ets ternary complex transcription factors. Gene. 2004, 324: 1-14. 10.1016/j.gene.2003.09.028.
Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003, 31 (13): 3497-500. 10.1093/nar/gkg500.
Spencer JA, Misra RP: Expression of the SRF gene occurs through a Ras/Sp/SRF-mediated-mechanism in response to serum growth signals. Oncogene. 1999, 18 (51): 7319-27. 10.1038/sj.onc.1203121.
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy S: Rfam: an RNA family database. Nucl Acids Res. 2003, 31: 439-441. 10.1093/nar/gkg006.
Griffiths-Jones S: The microRNA Registry. Nucl Acids Res. 2004, 32: D109-D111. 10.1093/nar/gkh023. [Database Issue]
Kiss AM, Jády BE, Darzacq X, Verheggen C, Bertrand E, Kiss T: Cajal body-specific pseudouridylation guide RNA is composed of two box H/ACA snoRNA-like domains. Nucl Acids Res. 2002, 30: 4643-4649. 10.1093/nar/gkf592.
Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome Biol. 2003, 4: R42-10.1186/gb-2003-4-7-r42. (20 pages)
Spencer JA, Major ML, Misra RP: Basic fibroblast growth factor activates serum response factor gene expression by multiple distinct signaling mechanisms. Mol Cell Biol. 1999, 19 (6): 3977-88.
Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. PNAS. 2005, 0409169102-[http://www.pnas.org/cgi/content/abstract/0409169102v1]
Schmid CD, Praz V, Delorenzi M, Perier R, Bucher P: The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 2004, D82-5. 10.1093/nar/gkh122. 32 Database
Suzuki Y, Yamashita R, Sugano S, Nakai K: DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004, D78-81. 10.1093/nar/gkh076. 32 Database
Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Ya-masaki C, Takeda J, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Mde FBM, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, Gopinath GR, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, Otsuki T, Piatier-Tonneau D, Poustka A, Ren SX, Saitou N, Sakai K, Sakamoto S, Sakate R, Schupp I, Servant F, Sherry S, Shiba R, Shimizu N, Shimoyama M, Simpson AJ, Soares B, Steward C, Suwa M, Suzuki M, Takahashi A, Tamiya G, Tanaka H, Taylor T, Terwilliger JD, Unneberg P, Veeramachaneni V, Watanabe S, Wilming L, Yasuda N, Yoo HS, Stodolsky M, Makalowski W, Go M, Nakai K, Takagi T, Kanehisa M, Sakaki Y, Quackenbush J, Okazaki Y, Hayashizaki Y, Hide W, Chakraborty R, Nishikawa K, Sugawara H, Tateno Y, Chen Z, Oishi M, Tonellato P, Apweiler R, Okubo K, Wagner L, Wiemann S, Strausberg RL, Isogai T, Auffray C, Nomura N, Gojobori T, Sugano S: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004, 2 (6): E162-10.1371/journal.pbio.0020162.
Bono H, Kasukawa T, Furuno M, Hayashizaki Y, Okazaki Y: FANTOM DB: database of Functional Annotation of RIKEN Mouse cDNA Clones. Nucleic Acids Res. 2002, 30: 116-118. 10.1093/nar/30.1.116.
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-40. 10.1093/nar/29.1.137.
We greatly acknowledge funding by the EU (BioSapiens Network of Excellence). Steffen Grossmann is supported by the Deutsche Forschungsgemeinschaft (DFG) as a member of the Sonderforschungsbereich (SFB) 618 – Theoretical Biology: Robustness, Modularity and Evolutionary Design of Living Systems.
Christoph Dieterich built the entire pipeline and some parts of the web interface. Steffen Grossmann annotated transcription factor binding sites and provided parts of the web interface. Andrea Tanzer analyzed known and novel RNA elements in the multiple alignments of the CORG database. Stefan Röpcke set up our database of binding site descriptions. Peter F. Arndt worked on an appropriate alignment scoring scheme. Peter F. Stadler and Martin Vingron initiated this work and provided all necessary infrastructure.
Electronic supplementary material
Additional File 1: Distribution of distance between start of transcription and translation. Histogram of observed genomic distances between start sites of transcription and translation in man for 1,700 entries from the EPD. The red and blue line indicates the 90% and 95% quantiles, respectively. Distances greater than 106 bp were exluded from the analysis as they mostly occur due to mismappings in the ENSEMBL database. (PDF 4 KB)
About this article
Cite this article
Dieterich, C., Grossmann, S., Tanzer, A. et al. Comparative promoter region analysis powered by CORG. BMC Genomics 6, 24 (2005) doi:10.1186/1471-2164-6-24
- Transcription Start Site
- Transcription Factor Binding Site
- Internal Ribosome Entry Site
- Serum Response Factor