Exploring the reasons for the large density of triplex-forming oligonucleotide target sequences in the human regulatory regions
© Goñi et al; licensee BioMed Central Ltd. 2006
Received: 21 December 2005
Accepted: 27 March 2006
Published: 27 March 2006
DNA duplex sequences that can be targets for triplex formation are highly over-represented in the human genome, especially in regulatory regions.
Here we studied using bioinformatics tools several properties of triplex target sequences in an attempt to determine those that make these sequences so special in the genome.
Our results strongly suggest that the unique physical properties of these sequences make them particularly suitable as "separators" between protein-recognition sites in the promoter region.
The presence of a TFO in the major groove of the duplex leads to major distortions in the capacity of the target duplex to be recognized by specific proteins [2, 6, 7]. This produces major changes in the functionality of the target duplex, which has been used for biotechnological and biomedical purposes [2, 3, 8–10]. Thus, modified TFOs containing suitable chemical compounds have been used to develop artificial nucleases [11, 12], to induce recombination in mammalian cells , and to trigger mutations in target DNA [13–16]. In all these cases, the formation of the triplex guides the active chemical compound to the proper position in the target genome. Unmodified TFOs increase the rate of mutations at the triplex target sequence (TTS), which opens the possibility for knocking down genes [9, 16, 17]. Triplex formation inhibits mRNA synthesis [2, 8, 9, 18–23] when the TTS is located at a regulatory region. Furthermore, when the triplex is formed in the middle of a gene, mRNA elongation is stopped just before the TTS, which indicates that triplex binding is strong enough to displace complex transcriptional machinery [24, 25]. These two findings open up the possibility to use TFOs as "anti-gene" drugs. These pharmacological agents would have the capacity to specifically arrest the transcription of pathological genes, thus leading to an intense and targeted therapeutic action [3, 8–10]. However, despite their promise, anti-gene therapies still face many technical problems [2, 3] and the density and location of TTSs in human genes is unclear.
In a recent paper, we explored the presence of TTSs (polypurine tracts which are expected to lead to stable triplexes in physiological conditions) in the human genome . Our analysis showed that these sequences are vastly over-represented when compared to what randomness predicts. Interestingly, the largest relative concentration of TTSs occurs in the upstream regulatory region (especially at the proximal promoter region: 100 nts upstream) . Recent studies by our group (Goñi et al. Unpublished results) show that these trends are common to many other organisms, from mammals to procaryotes, indicating that many genes may be targets for triplex formation (Goñi et al. Unpublished results). However, this interesting finding raises an intriguing question: why are TTSs so abundant in crucial regions for the control of genome function?
Here we present an extensive descriptive analysis of TTSs in the human genome in an attempt to elucidate why these sequences are so abundant in regulatory regions. Our results indicate that the unique physical properties of TTSs may explain this overpopulation.
Results and discussion
Are TTSs part of an ancient DNA auto-regulatory mechanism?
Are TTSs rich in transcription factor recognition sites?
As described in Methods, we mapped the TRANSFAC database into the human promoter region (up to 200 nts upstream of transcription origin) and computed the occurrences of nucleotides in long TTSs (length equal or greater than 10) in the promoter region around the transcription factor binding site (TFBS). Sequences which were recognized as targets of transcription factors showed much less probability to be in TTS than neighbouring promoter sequences [see Figure 6]. Furthermore, generation of random sequences (TTS and no-TTS) showed that no-TTS random sequences have a much larger probability to be transcription factor binding sites than TTS random sequences. Overall, we must conclude that even in some cases small (4–6) poly-purine segments might be found in TFBS, TTS as defined here (length equal or greater than 10) are very rarely present in TFBS. That means that the hypothesis that TTSs are over-represented in promoter regions because they contain TFBS can be ruled out.
Do TTSs have distinct intrinsic physical properties?
Previous results seem contradictory and somehow difficult to rationalize. Thus, although TTS in promoters (TTSp) are over-represented, appear in key genes for the control of physiological processes and are very well conserved, they are not targets for transcription factors. No evidence is found regarding the possibility that TTS acted as an ancient regulatory mechanism, mimicking the functionality of interference RNAs. How can we reconcile all this findings? In our view, the only possibility will be if TTS have some intrinsic physicochemical properties that are useful when present in the promoter region of certain genes.
In addition to the possible role of TTSs in the organization of DNA in nucleosomes, when these sequences are present, their unique physicochemical properties have a large impact on the promoter region. Thus, large flexibility is desirable for DNAs that need to bind to proteins, and accordingly deform its structure, but rigidity is useful for the definition of spacing elements that should isolate protein-induced DNA deformability in specific regions of the duplex. The larger curvature is also a desired element, since it can help in the relative positioning of transcription factors in 3-D space, helping then to establish physiologically critical protein-protein interactions. Thus, the presence of TTSs at promoter regions can provide the cell with specific mechanisms, probably in most cases not related to triplex formation, by which to enhance activation/repression of genes that are crucial for the regulation of cellular processes mechanisms.
The results presented in this paper show that, irrespectively of whether the cell now uses or once used a triplex-mediated control mechanism, TTSs in the promoter region are very abundant and that genes with TTSp are crucial for the control of cell life. TTSp do not bind to transcription factors, but besides this, they have a conservation profile similar to that of non-TTS segments in promoter regions, including that of sequences recognizing transcription factors. Analysis reported here suggests that the TTSp provide the promoter region unique physical properties that can contribute to a better functioning of regulatory proteins. All these results strongly support the notion that triplex-based anti-gene technology is widely applicable in the control of pathologies related to malfunctioning of the regulatory mechanisms of physiological processes.
Triplex-target sequences (TTS) are over-represented in the human genome. Such an over-representation is especially large when promoter regions 100 to 200 upstream are considered.
Genes with TTS in promoters are over-enriched with functions in the regulation of physiological processes, and very often are characterized as transcription factors or related protein.
TTS are not part of sequences which are directly targeted by transcription factors, but their (human-mouse) conservation profiles suggest that they are important for gene functionality.
TTS are significantly more curved and rigid than normal DNA, which suggests that (in addition to other possible functions) TTS act as spacing fragments which help in the correct positioning of transcription factors.
Sequence information of the human genome was taken from the UCSC database  version hg17; May 2004; developed by the International Human Genome Mapping Consortium . Annotation of the genes, introns, exons and coding regions considered in this study was obtained from the refGene (refSeq) collection. To avoid compute multiple times the same locus of the genome, overlapping entries of refSeq have been ignored. The set of upstream regions at starting positions 5000, 2000, 1000, 500, 200 and 100 nts upstream of the transcription start of refSeq genes were extracted from the upstream5000 file of the same data base. The 0–5000 and specially the 0–100 regions are expected to be largely enriched in promoter sequences. USCS also provide the annotation of CpG islands (CpGisland file). CpG promoters are those with an overlapping feature on the 200-nt upstream region.
Definition of TTS
Possible triplex target sequences (TTSs) were defined as polypurine tracks of any size and in any strand. No mismatching in the triplexes was allowed, implying that a strict triplex definition was used. The number of TTSs would increase substantially if 5% or 10% mismatching were allowed.
Background models of TTS distributions
In order to determine the significance of a given TTS distribution we need to create background distributions. We used first an analytical binomial model  where all base dimmers have the same probability. To generate a more reliable background model we modified the method to account for the dimmer-distributions in human genome and also on human promoter (where Pur-Pyr, Pur-Pur and Pyt-Pur are not equally probable). We consider also a numerical background model (that at the dimmer level fits to the binomial model; [see Figure 2 in reference 26]) which allows us to introduce also trimer-biased in the promoter region. This numerical model was build by creating a 108-mer sequence selected which maintain the trimer (or dimer) population found in human promoters. Finally, a last random model (for promoter region) was created by using promoter-specific suffleseq models created with the EMOSS package version 3.0 . In the later case we generate 10 sets of sequences shuffling our promoter collections. A simple visual inspection shows that the real and background distributions are very different, but in any case, we confirm this by running Clover , a tool for detection of functional DNA motifs via statistical over-representation. For this purpose we create a matrix (length 15-nt), where for all positions A,G scored 1 and T,C score 0 (clover automatically scans both strands).
TTSs in promoters conserved in human and mouse
To evaluate the conservation of TTSs in the promoter (TTSp) region, we took upstream5000 file to build 33 mouse assemblies . In order to match human and mouse promoter regions, we translated gene code to protein name using the loc2ref file from the NCBI database  for both human and mouse genes. We then searched for correspondences in HomoloGene Build 39.1 from the NCBI database. This procedure generated a list of 5000 pairs of human-mouse genes.
We calculated human-mouse identity for a chosen 15-nt sequence (TTS or not) from a human region, aligning it across the corresponding mouse region. The alignment was done using a bit-vector alignment algorithm . For each entry, the greatest percentage of matching bases in the best alignment was processed. We estimated the conservation by averaging this value for all region entries.
For the shake of completeness the comparison was also performed using human/mouse aligned sequences present in the UCSC multiz8way (8 vertebrates) multiple alignments. As before conservation is computed by analyzing identity conservation in 15-mer sequences, averaging the data for all the 15-mer windows in the studied segment. Results obtained with this or the previous alignment protocol are very similar, reinforcing the quality of our results. Using these alignments we computed also the conservation in promoter regions annotated as transcription factor binding sites [34, 35].
Functional annotation of groups of genes
To test whether a group of genes was significantly enriched in one or more functional terms (out of several thousands) with respect to the background (usually the rest of genes), we used the FatiGO algorithm  from the Babelomics suite  for functional annotation of sets of genes. This algorithm uses known functional annotations for genes obtained from the Gene Ontology (GO) consortium databases . Both lists of genes (the group of interest and the background) were converted into two lists of GO terms using the corresponding gene-GO association table. For each GO term the data are represented as a 2 × 2 contingency table with rows representing presence/absence of the GO term, and each column representing each of the two lists. A Fisher's exact test for 2 × 2 contingency tables was used. Since thousands of GO terms are simultaneously tested without an a priori hypothesis on any particular term, p-values must be corrected for multiple testing. For this correction, we used the strict false discovery rate (FDR) method described by Benjamini and Yekutieli ,
Hypothetical human auto-regulated TTSs
Hypothetical auto-regulated TTSs are those that appear in transcribed and promoter (200 nts upstream) regions of two distinct genes. The TFO able to recognize a TTSp is defined as the sequence that matchs with i) the complement of the TTS or ii) the reverse of the TTS. First case maps potential endogenous TFO for parallel triplex whether the second one maps the potential antiparallel TFOs [see Figure 1]. Genes were divided in three sets: i) containing TTS in the promoter region, ii) containing the corresponding TFO in the transcribed region and iii) other genes (used as background). We ran FATIGO [36, 37] to identify possible relationships between genes containing TTS in the promoter and those containing the corresponding TFO in transcribed regions.
TTSs in transcription factor binding sites
The 0–200 nt region (specially the 0–100 segment) is expected to be largely enriched in transcription factor binding sites. We located all the putative transcription factor binding sites in these region of the human genome by mapping the last public version of the TRANSFAC database  to the upstream5000 file in the UCSC-Genome Database  using the TFBS Perl module . We then computed the average percentage of nucleotides in a TTS with a length of 10 or greater as moving apart the centre of the transcription factor recognition sequence.
Physical descriptors of DNA
DNA curvature calculation can be done using the data and the algorithm developed by Bolshoy and co-workers . This algorithm calculates the three-dimensional path of a DNA molecule and estimates the curvature of the axis path. The scale is in arbitrary curvature units (c.u.), ranging from 0 (e.g. no curvature) to 1.0, which is the curvature of DNA when wrapped around the histone core of nucleosome.
To predict the stability of the sequence, we used the base step data from Santalucia et al. and the formula described in their study . DNA stacking energies are predicted using the accurate interaction energies published for Sponer et al. [42, 43] for nucleic acid base pairs.
The flexibility of a track was measured by the configurational volume [see eq. 1 in reference 44] in function of the Twist-twist, Shift-shift and Roll-roll force constants determined from MD simulations by Lankas et al., .
where k is Boltzman constant, T is the temperature (taken as 298 K) and Ktwist, Ktilt and Kroll are harmonic force constants expressed in kcal/mol • deg2
Described methods are implemented in a Perl script library, witch is available upon request.
We thank Drs. Roderic Guigó and Xavier de la Cruz for their helpful discussion. This work was supported by the Fundació La Caixa, Fundación BBVA and the Spanish Ministry of Science (BIO2003-06848).
- Fesenfeld G, Davis DR, Rich A: Formation of a three-stranded polynucleotide molecule. J Am Chem Soc. 1957, 79: 2023-2024. 10.1021/ja01565a074.View Article
- Soyfer VN, Potaman VN: Triple-Helical Nucleic Acids. 1996, Springer-Verlag: New YorkView Article
- Scaria PV, Shafer RH: Calorimetric analysis of triple helices targeted to the d(G3A4G3)·d(C3T4C3) duplex. Biochemistry. 1996, 35: 10985-10994. 10.1021/bi960966g.PubMedView Article
- Robles J, Grandas A, Pedroso E, Luque FJ, Eritja R, Orozco M: Nucleic acid triple helices: stability effects of nucleobase modifications. Curr Org Chem. 2002, 6: 1333-1368. 10.2174/1385272023373482.View Article
- Chandler SP, Fox KR: Specificity of antiparallel DNA triple helix formation. Biochemistry. 1996, 35: 15038-15048. 10.1021/bi9609679.PubMedView Article
- Shields GA, Laughton CA, Orozco M: Molecular dynamics simulations of the d(T•A•T) triple helix. J Am Chem Soc. 1997, 119: 7463-10.1021/ja970601z.View Article
- Jiménez E, Vaquero A, Espinás ML, Soliva R, Orozco M, Bernués J, Azorin F: The GAGA factor of Drosophila binds triple-stranded DNA. J Biol Chem. 1998, 273: 24640-10.1074/jbc.273.38.24640.View Article
- Giovannangeli C, Hélène C: Triplex technology takes off. Nature Biotechnology. 2000, 18: 1245-10.1038/82348.PubMedView Article
- Knauert MP, Glazer PM: Triplex forming oligonucleotides: sequence-specific tools for gene targeting. Human MolecularGenetics. 2001, 10: 2243-
- van Dongen MJP, Doreleijers JF, van der Marel GA, van Boom JH, Hilbers CW, Wijmenga SS: Structure and mechanism of formation of the H-y5 isomer of an intramolecular DNA triple helix. Nature Structural Biology. 1999, 6: 854-10.1038/12313.PubMedView Article
- Strobel SA, Dervan PB: Triple helix-mediated single-site enzymatic cleavage of megabase genomic DNA. Methods Enzymol. 1992, 216: 309-PubMedView Article
- Zain R, Marchand C, Sun J, Nguyen CH, Bisagni E, Garestier T, Hélène C: Design of a triple-helix-specific cleaving reagent. Chem Biol. 1999, 6: 771-10.1016/S1074-5521(99)80124-0.PubMedView Article
- Luo Z, Macris MA, Faruqi AF, Glazer PM: High-frequency intrachromosomal gene conversion induced by triplex-forming oligonucleotides microinjected into mouse cells. Proc Natl Acad Sci USA. 2000, 97: 9003-10.1073/pnas.160004997.PubMedPubMed CentralView Article
- Havre PA, Gunther EJ, Gasparro FP, Glazer PM: Targeted mutagenesis of DNA using triple helix-forming oligonucleotides linked to psoralen. Proc Natl Acad Sci USA. 1993, 90: 7879-PubMedPubMed CentralView Article
- Majumdar A, Khorlin A, Dyatkina N, Lin FL, Powell J, Liu J, Fei Z, Khripine Y, Watanabe KA, George J, Glazer PM, Seidman MM: Targeted gene knockout mediated by triple helix forming oligonucleotides. Nat Genet. 1998, 20: 212-10.1038/2530.PubMedView Article
- Barre FX, Ait-Si-Ali S, Giovannangeli C, Luis R, Robin P, Pritchard LL, Hélène C, Harel-Bellan : Unambiguous demonstration of triple-helix-directed gene modification. Proc Natl Acad Sci USA. 2000, 97: 3084-10.1073/pnas.050368997.PubMedPubMed CentralView Article
- Wang G, Seidman MM, Glazer PM: Mutagenesis in mammalian cells induced by triple helix formation and transcription-coupled repair. Science. 1996, 271: 802-PubMedView Article
- Vasquez KM, Narayanan L, Glazer PM: Specific mutations induced by triplex-forming oligonucleotides in mice. Science. 2000, 290: 530-10.1126/science.290.5491.530.PubMedView Article
- Duval-Valentin G, Thuong NT, Hélène C: Specific inhibition of transcription by triple helix-forming oligonucleotides. Proc Natl Acad Sci USA. 1992, 89: 504-PubMedPubMed CentralView Article
- Cooney M, Czernuszewicz G, Postel EH, Flint SJ, Hogan ME: Site-specific oligonucleotide binding represses transcription of the human c-myc gene in vitro. Science. 1988, 241: 456-PubMedView Article
- Grigoriev M, Praseuth D, Robin P, Hemar A, Saison-Behmoaras T: A triple helix-forming oligonucleotide-intercalator conjugate acts as a transcriptional repressor via inhibition of NF kappa B binding to interleukin-2 receptor alpha-regulatory sequence. J Biol Chem. 1992, 267: 3389-PubMed
- Joseph J, Kandala JC, Veerapanane D, Weber KT, Guntaka RV: Antiparallel polypurine phosphorothioate oligonucleotides form stable triplexes with the rat alpha1(I), collagen gene promoter and inhibit transcription in cultured rat fibroblasts. Nucleic Acids Res. 1997, 25: 2182-10.1093/nar/25.11.2182.PubMedPubMed CentralView Article
- Postel EH, Flint SJ, Kessler DJ, Hogan ME: Evidence that a triplex-forming oligodeoxyribonucleotide binds to the c-myc promoter in HeLa cells, thereby reducing c-myc mRNA levels. Proc Natl Acad Sci USA. 1991, 88: 8227-PubMedPubMed CentralView Article
- Young SL, Krawczyk SH, Matteucci MD, Toole JJ: Triple helix formation inhibits transcription elongation in vitro. Proc Natl Acad Sci USA. 1991, 88: 10023-PubMedPubMed CentralView Article
- Faria M, Wood CD, Perrouault L, Nelson JS, Winter A, White MR, Hélène C: Targeted inhibition of transcription elongation in cells mediated by triplex-forming oligonucleotides. Proc Natl Acad Sci USA. 2000, 97: 3862-10.1073/pnas.97.8.3862.PubMedPubMed CentralView Article
- Goñi JR, de la Cruz X, Orozco M: Triplex forming oligonucletide target sequences in the human genome. Nucleic Acids Res. 2004, 32: 354-360. 10.1093/nar/gkh188.PubMedPubMed CentralView Article
- Anderson JD, Widom J: Poly(dA-dT) promoter elements increase the equilibrium accessibility of nucleosomal DNA target sites. Mol Cell Biol. 2001, 11: 3830-3839. 10.1128/MCB.21.11.3830-3839.2001.View Article
- Karolchick D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Kent WJ: The UCSC Genome Browser Database. Nucleic Acid Res. 2003, 31: 51-10.1093/nar/gkg129.View Article
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View Article
- Rice P, Longden I, Bleasby A: "EMBOSS: The European Molecular Biology Open Software Suite". Trends in Genetics. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView Article
- Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z: Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res. 32 (4): 1372-81. 10.1093/nar/gkh299. 2004 Feb 26
- The NCBI Database. [http://www.ncbi.nlm.nih.gov/]
- Myers G: A Fast Bit-Vector Algorithm for Aproximate String Matching Based on Dynamics Programming. JACM. 1999, 46: 395-415. 10.1145/316542.316550.View Article
- Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Prüβ M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000, 28: 316-319. 10.1093/nar/28.1.316.PubMedPubMed CentralView Article
- Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18: 1135-1136. 10.1093/bioinformatics/18.8.1135.PubMedView Article
- Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics. 2004, 20: 578-580. 10.1093/bioinformatics/btg455.PubMedView Article
- Al-Shahrour F, Minguez P, Vaquerizas J, Conde L, Dopazo J: Babelomics: a suite of web-tools for functional annotation and analysis of group of genes in high-throughput experiments. Nucleic Acids Research. 2005,
- Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, D258-61. 10.1093/nar/gkh036. 32 Database
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics. 2001, 29: 1165-1188. 10.1214/aos/1013699998.View Article
- Shpigelman ES, Trifonov EN, Bolshoy A: CURVATURE: software for the analysis of curved DNA. Comput Appl Biosci. 1993, 9: 435-PubMed
- Santalucia J: A unified view of polymer, dumbbell, and oligonucleotide DNA neares-neighbor thermodynamics. Proc Natl Acad SciUSA. 1998, 95: 1460-10.1073/pnas.95.4.1460.View Article
- Sponer J, Gabb HA, Leszczynski J, Hobza P: Base-base and deoxyribose-base stacking interactions in B-DNA and Z-DNA: a quantum-chemical study. Biophys J. 1997, 73: 76-87.PubMedPubMed CentralView Article
- Sponer J, Jurecka P, Hobza P: Accurate interaction energies of hydrogen-bonded nucleic acid base pairs. J Am Chem Soc. 2004, 126: 10142-51. 10.1021/ja048436s.PubMedView Article
- Pérez A, Blas JR, Rueda M, López-Bes JM, de la Cruz X, Orozco M: Exploring the essential dynamics of DNA. J Chem Theor Comput. 2005, 1: 790-800. 10.1021/ct050051s.View Article
- Filip Lankas , Jirí Sponer , Jörg Langowski , Thomas Cheatham: DNA Basepair Step Deformability Inferred from Molecular Dynamics Simulations. Biophys J. 2003, 85: 2872-PubMedPubMed CentralView Article