STaRRRT: a table of short tandem repeats in regulatory regions of the human genome
- Katherine A Bolton†1, 2,
- Jason P Ross†3, 4,
- Desma M Grice1, 2, 3, 4,
- Nikola A Bowden1, 2,
- Elizabeth G Holliday1, 5,
- Kelly A Avery-Kiejda1, 2 and
- Rodney J Scott1, 2, 6, 7Email author
© Bolton et al.; licensee BioMed Central Ltd. 2013
Received: 27 May 2013
Accepted: 5 November 2013
Published: 15 November 2013
Tandem repeats (TRs) are unstable regions commonly found within genomes that have consequences for evolution and disease. In humans, polymorphic TRs are known to cause neurodegenerative and neuromuscular disorders as well as being associated with complex diseases such as diabetes and cancer. If present in upstream regulatory regions, TRs can modify chromatin structure and affect transcription; resulting in altered gene expression and protein abundance. The most common TRs are short tandem repeats (STRs), or microsatellites. Promoter located STRs are considerably more polymorphic than coding region STRs. As such, they may be a common driver of phenotypic variation. To study STRs located in regulatory regions, we have performed genome-wide analysis to identify all STRs present in a region that is 2 kilobases upstream and 1 kilobase downstream of the transcription start sites of genes.
The Short Tandem Repeats in Regulatory Regions Table, STaRRRT, contains the results of the genome-wide analysis, outlining the characteristics of 5,264 STRs present in the upstream regulatory region of 4,441 human genes. Gene set enrichment analysis has revealed significant enrichment for STRs in cellular, transcriptional and neurological system gene promoters and genes important in ion and calcium homeostasis. The set of enriched terms has broad similarity to that seen in coding regions, suggesting that regulatory region STRs are subject to similar evolutionary pressures as STRs in coding regions and may, like coding region STRs, have an important role in controlling gene expression.
STaRRRT is a readily-searchable resource for investigating potentially polymorphic STRs that could influence the expression of any gene of interest. The processes and genes enriched for regulatory region STRs provide potential novel targets for diagnosing and treating disease, and support a role for these STRs in the evolution of the human genome.
KeywordsShort tandem repeats STR Microsatellites Simple sequence repeats SSR Promoter Regulatory region Neurological disease Neural genes Evolution
Tandem repeats (TRs) are stretches of DNA that contain nucleotide patterns repeated adjacent to one another and are common throughout the human genome . TRs are classified by repeat unit length into further categories including microsatellites, or short tandem repeats (STRs), which are repeats with a unit length of less than 10 nucleotides or base pairs (bp). TRs display a non-random distribution and a particular bias in location to genic and regulatory regions [2, 3]. In humans, approximately 17% of genes contain TRs within their coding regions . In yeast (Saccharomyces cerevisiae), approximately 25% of all gene promoters contain at least one tandem repeat (TR), many of these TRs consisting of short, AT-rich sequences and the distribution of TRs in human gene promoters is similar [5, 6].
TRs have a propensity to mutate and become polymorphic by expansion or contraction in the number of repeat units. This may be due to slippage during DNA replication, through unequal crossing-over during recombination, or by imprecise repair of double-strand DNA breaks [7–9]. TRs exhibit mutation rates around 10 to 105-fold higher than average rates for non-repeated DNA in other parts of the genome [7, 10–12]. Such polymorphic TRs are often described as variable number of tandem repeats (VNTR). The frequency of TR mutations is dependent upon the length of the repeat unit (known as the “period”), the number of repeat units, and the percentage match to the consensus sequence or “purity” of the repeat tract [4, 13]. The number of repeat units and purity of the repeat tract are the most important predictors for repeat variability, with an increase in the number of repeats and/or purity resulting in a higher propensity to be polymorphic [13, 14]. Naslund et al. (2005) found that doubling the repeat unit number corresponded to a 15-fold increase in the likelihood of the repeat being polymorphic and for each 10% increase in repeat purity, an 18-fold increase in likelihood of polymorphism resulted.
STRs are a common source of genetic variation in promoter regions and alleles can be highly variable in length. In humans, the rate of STR length polymorphism within 1 kb upstream of the transcription start site (TSS) is over 12-fold higher than in exonic regions, 1.5-fold higher than in untranslated regions (UTRs) and almost comparable to the rate in intragenic and intronic regions . Despite this hyper-variability, there is also evidence for promoter localised STRs being evolutionarily conserved . The conservation rate of STRs is dependent upon the proximity to the TSS, with closer STRs more likely to be conserved .
Polymorphic TRs can affect transcription by a number of means. Length polymorphism has consequences for transcription, with TR-containing promoters showing significantly higher rates of transcriptional expression divergence . In yeast, it is known that nucleosome position is inversely correlated with tandem repeat positions with nucleosome depletion being especially pronounced around AT-rich repeats . In addition, altering the length of TRs in promoter regions directly affects the local chromatin structure resulting in altered transcriptional activity and gene expression [5, 17]. Further, potential sites of Z-DNA are enriched at the promoter and 5’-end of human genes  and Z-DNA, which expels bound nucleosomes, is more likely to form where the AC/GT dinucleotide repeat is present . Combined, the exceptionally high polymorphism rate, evolutionary conservation around the TSS and evidence for transcriptional regulation suggests that promoter STRs are functional and may be an important source of rapid evolutionary change. If so, STRs should also be associated with disease.
Polymorphic TRs are implicated in more than 40 neuromuscular and neurodegenerative diseases, such as spinobulbar muscular atrophy  and Huntington’s disease ; as well as other complex disorders such as anxiety , mental retardation  and diabetes [24, 25]; and several cancers, such as colorectal [26, 27] and prostate cancer [28–30]. In the regulatory region, polymorphic STRs in the FLI1, ECE-1c and CD30 gene promoters have been associated with lupus , Alzheimer’s disease  and primary cutaneous lymphoproliferative disorders , respectively.
While there is mounting evidence that STRs are an important class of genetic variation with links to disease phenotypes and evolution of the human genome, their use in genetic studies has reduced with the advent of massively parallel single nucleotide polymorphism (SNP) analysis and genome-wide association studies (GWAS) [34, 35]. Compared with SNPs, STRs show extremely rapid evolution, indicative of increased variability between individual sub-populations. The observed enrichment of STRs in genic and regulatory regions  also suggests potentially larger phenotype effects than many common SNPs. Hypervariable STRs in regulatory regions may explain some of the missing heritability unaccounted for by GWAS of complex disease [13, 36, 37]. From a human genetics perspective, this untapped source of regulatory STR variation could be important and also complementary to GWAS studies. Increasing interest over the past decade in the noncoding regions of the human genome, which has been described as “the control architecture of the system” , further highlights the important role that variation in these regions plays. Considering the influential role of STRs in regulating gene expression, the importance of this source of genetic variation has been over-looked.
There is currently no catalogue or easy to use resource available for studying STRs in the regulatory regions of human genes. This study aimed to identify, characterise and compare STRs in the upstream regulatory region of human genes on a genome-wide scale and establish a resource to allow the interrogation of STRs in this region. By screening the entire human genome, using Tandem Repeat Finder , SQL code and the UCSC Genome Browser , for STRs present in a 3 kilobase region at the 5’-end of all human genes, we have identified 5,264 STRs across 4,441 genes. The information describing the location and characteristics of these STRs is presented in the Short Tandem Repeats in Regulatory Regions Table, or STaRRRT (available at http://www.newcastleinnovationhealth.com.au/STaRRRT). This resource is suitable for researchers with limited bioinformatics experience who are interested in specific STRs, genes or phenotypes. We have identified a unique signature of STR enrichment in the regulatory regions of human genes which is most pronounced within neural genes, and calcium signaling and neurological pathways. This paper presents the findings from investigations of the distribution and abundance of STRs in the 5’ regulatory region of human genes, highlighting the importance of STRs in neurological pathways and in recent evolution of the human genome.
STaRRRT is a comprehensive, user-friendly resource with wide application
To increase the utility of STaRRRT, the resource is restricted to short tandem repeats (STRs), due to their abundance, polymorphic nature and frequent use as genetic markers. In order to increase the chance of variable STRs being predominately represented in STaRRRT, we have restricted the purity to greater than or equal to 90%. We define an STR, also known as a microsatellite, as those TRs with period of 1 to 9 bp. Tandem repeats were identified from the UCSC ‘simpleRepeats’ table, which contains output from the Tandem Repeat Finder (TRF) program . TRF uses distribution theory to detect TRs and also uses a minimum alignment score, with smaller period TRs requiring higher numbers of repeats to qualify. The ‘simpleRepeats’ table does not explicitly specify the TRF input parameters - minimum score, scoring weights, mismatch penalties, nor the matching probability (PM) or indel probability (PI). We determined some of these parameters empirically. Within the table the minimum reported score was found to be 50 and dividing this by the product of the period by the number of repeats shows the scoring weight must be set as 2. This infers the minimum reported STR size is 25 bp in length.
Details provided in STaRRRT
Description of field
Example of entry
Chromosome number on which STR is located
Start position on chromosome of the gene
End position on chromosome of the gene
Coding sequence start
Coding sequence end
Strand on which the gene occurs
KnownGene database identifier
RefSeq database identifier
Ensembl database identifier
GenBank transcript accession number
HGNC gene symbol
Affymetrix GeneChip array identifier
Affymetrix GeneChip Plus2.0 array identifier
Type of gene (coding or noncoding)
Position in relation to the TSS
Start position on chromosome for the STR
End position on chromosome for the STR
Length of the repeat unit in the STR
Number of copies of the repeat unit
Total length of the STR
Number of bases in the consensus sequence
% match of STR to consensus sequence; purity
Percent insertions and/or deletions in the STR
Alignment score (minimum = 50)
Percent of A's (adenine) in the repeat unit
Percent of C's (cytosine) in the repeat unit
Percent of G's (guanine) in the repeat unit
Percent of T's (thymine) in the repeat unit
Consensus sequence of the repeat unit; motif
Sample of the resource STaRRRT
Downstream of the TSS, STaRRRT STRs may be located within the 5’-UTR or the coding region. We note 15,029 transcripts of the 41,007 (non-haplotype or unplaced contig) transcripts present in RefSeq (release 56 database) have 5’-UTR regions that will go beyond the 1 kb downstream limit of this resource (Figure 1); hence, STaRRRT is not comprehensive for all STRs in 5’-UTRs. Similarly, for the 25,978 transcripts with a 5’-UTR shorter than 1 kb, an STR (or STRs) presented in STaRRRT may be present in the coding region. The position of the STR within the upstream region, 5’-UTR or coding region can be calculated by comparing the srStart:srEnd coordinates with the chromStart:chromEnd (transcription start and end) and cdsStart:cdsEnd (coding sequence start and end) coordinates.
General characteristics of STaRRRT STRs relative to genic or all STRs
Of the 41,007 (non-haplotype or unplaced contig) transcripts present in RefSeq (release 56 database), 4,448 gene transcripts (within 4,441 unique gene loci) were found to contain at least one STR with purity of at least 90% in the 3 kb regulatory region analysed (Figure 1); so, 18.8% of all genes in the human genome.
We note the more than 2-fold increase in the frequency of STaRRRT STRs (relative to all STRs) with period of 3. This is likely due to the encompassing of the proximal promoter in the regulatory region and the inclusion of some exon regions downstream of the TSS. Compared to all categories other than exons, the number of period 3 STRs in proximal promoters is more than 4-fold increased. More broadly, the distribution of STRs in proximal promoters with a multiple of 3 (being period 3, 6 and 9) is very similar to that in exons (Figure 2). This increase is offset by the relative decrease in frequency of STRs with period 1 and 2.
Distribution of STaRRRT STRs show distinct trends at the TSS and in the proximal and core promoters
When the repeats in Figure 3A are decomposed into subpopulations classified by repeat period, a number of trends emerge (Figure 3B). The most striking observation is the increased density of repeats with period of 3 (trinucleotides; shown in green) in the region approximately 300 bases upstream and downstream of the TSS and the predominance of repeats with period of 2 (dinucleotides; shown in blue) in the region +300 to +1000, downstream of the TSS. Upstream of the TSS, peaks and troughs in repeat density are present with some regularity; in particular, the density of STRs with periods of 2, 4 and 5. Using waves as an analogy, in the region −2000 bases to approximately −800 bases, relative to the TSS, the densities of STRs with periods of 2 and 4 are in phase before becoming anti-phased from −800 bases until approximately −200 bases, relative to the TSS. This change in phase coincides with an increase in the abundance of STRs with period of 5.
The base composition and repeat unit length of STRs in the regulatory region also have distinct patterns. For the most part, repeats are AT-rich; however, there is a profound change towards GC-rich repeats, with fewer repeat units surrounding the TSS (Figure 3C and D). This region of change correlates strongly with the large increase in period 3 (trinucleotide) repeats noted earlier (Figure 3B). These GC-rich, relatively low repeat unit trinucleotide repeats overlap with the proximal promoter, defined as 250 upstream to 250 downstream of the TSS (−250, +250;  and more specifically with the core promoter, which we define here as 60 bp upstream to 40 bp downstream of the TSS (−60, +40). A further decomposition of the data in Figure 3 into 3,479 CpG island overlapping and 1,785 non-CpG island overlapping regulatory regions shows the TSS proximal GC-rich, trinucleotide repeats are situated particularly in CpG island containing regulatory regions (Additional file 1: Figure S1). Interestingly, the smaller set of regulatory regions without an overlapping CpG island, seem to exclude STRs in the region just before the TSS until approximately 100 bp downstream (Additional file 1: Figure S2). They also exhibit a periodic and anti-phased increase and decrease in adenine and thymine base composition.
STaRRRT STRs are found in genes involved in metabolism, signal transduction and the neurological system
KEGG pathway results from HEAT analysis grouped by pathway class
Glycine, serine and threonine metabolism
Inositol phosphate metabolism
Glycan structures - biosynthesis 1
Glycan structures - degradation
Development/Cell growth and death
Dorso-ventral axis formation
Signal transduction/Environmental information processing/Cell communication/Cell motility
Calcium signaling pathway
Phosphatidylinositol signaling system
Wnt signaling pathway
VEGF signaling pathway
Jak-STAT signaling pathway
Regulation of actin cytoskeleton
Hematopoietic cell lineage
T cell receptor signaling pathway
B cell receptor signaling pathway
Leukocyte transendothelial migration
Insulin signaling pathway
Adipocytokine signaling pathway
Type II diabetes mellitus
Epithelial cell sig. in H. pylori infection
Tissue-specific expression results from HEAT analysis
Top Bio Functions
Diseases and disorders
1.27E-04 - 4.94E-02
1.81E-04 - 4.94E-02
9.19E-04 - 4.24E-02
1.38E-03 - 2.00E-02
2.25E-03 - 4.17E-02
Molecular and cellular functions
3.39E-04 - 4.81E-02
Cell death and survival
6.18E-04 - 4.83E-02
Cell-to-cell signaling and interaction
1.07E-03 - 4.81E-02
1.17E-03 - 4.37E-02
Cellular growth and proliferation
1.47E-03 - 4.81E-02
Physiological system development and functions
Cardiovascular system development and function
7.56E-06 - 4.70E-02
3.20E-05 - 4.37E-02
Humoral immune response
1.38E-03 - 4.81E-02
Reproductive system development and function
1.47E-03 - 4.17E-02
Hematological system development and function
1.74E-03 - 4.81E-02
Top 20 canonical pathways
Pyridoxal 5'-phosphate salvage pathway
Reelin signaling in neurons
Neuropathic pain signaling in dorsal horn neurons
Cellular effects of sildenafil (Viagra)
Factors promoting cardiogenesis in vertebrates
Synaptic long-term depression
B cell receptor signaling
Dopamine-DARPP32 feedback in cAMP signaling
D-myo-inositol (1,4,5)-triphosphate biosynthesis
NF-κB activation by viruses
Xenobiotic metabolism signaling
Antioxidant action of vitamin C
Maturity onset diabetes of young (MODY) signaling
Collectively, the GSEA results show that genes with STRs in the regulatory region or exons, or those genes with high intronic STR density, have enrichments for largely the same classes of gene pathways. These pathways are primarily associated with metabolism, signal transduction, environmental information processing, development, cell growth, death, motility and communication and immune, nervous and endocrine system function. There are some differences between the STaRRRT, exonic and high-density intronic gene sets in KEGG pathways. Broadly, STaRRRT genes have more numerous enrichments and are particularly enriched for calcium signaling.
By genome-wide analysis, this study has identified that 18.8% of all human genes contain at least one highly pure STR in their upstream regulatory region. This is consistent with the previous suggestion that TRs of all period lengths are present within promoter regions of 10 to 20% of human genes . The upstream promoter region appears to consist of predominantly short (mostly with repeat period of 1 and 2), AT-rich sequences, which is concordant with the findings of Vinces et al. in the yeast genome and Sawaya et al. in human promoters. We demonstrate that in humans, the proximal promoter (−250, +250) and in particular the region overlapping the typical core promoter region (−60, +40) have GC-rich STRs. As approximately 72% of human promoters have high GC-content [47, 48] with CpG island density reaching a maximum near the TSS , we reason this increase in STR GC-content reflects the underlying GC-rich promoter sequence.
Consistent with a previous genome-wide survey of all STRs , period 2 STRs (dinucleotides) are the most abundant STRs in the regulatory region across human genes. Likewise, the distribution of STaRRRT STRs across repeat periods is very similar to that reported by Gemayel et al. (2010) for the distribution of all TRs in noncoding regions across the human genome . However, similar to coding regions, we find a striking enrichment of trinucleotide repeats (period 3 STRs) in the proximal promoter region, both upstream and downstream of the TSS (Figure 3B). The similarity of this enrichment signature in regulatory regions to that observed in coding regions  is a significant and novel finding, and adds weight to the likely functional significance of these results.
STRs in coding regions almost exclusively have a repeat period which is a multiple of 3 bases ; this is thought to be due to the nature of triplet codons and selection against frameshift mutations . While the region upstream of the TSS is not transcribed, the abundance of trinucleotide repeats suggests a selection pressure of similar magnitude to that observed in coding regions [3, 50]. Possible explanations include alternative translation start sites or other functional constraints, possibly related to chromatin structure, nucleosome positioning and/or transcription factor activity. We note that high abundance TSS proximal GC-rich repeats and trinucleoide repeats are only associated with regulatory regions overlapping CpG islands. Interestingly, the smaller non-CpG island overlapping group is composed of mostly dinucleotides repeats and in the region approximately −500 to 500 bp around the TSS the repeats have a regular wavelike increase and decrease in adenine and thymine abundance. We speculate this pattern may be associated with nucleosome positioning.
Broadly, we suggest that the distribution of STRs around the promoter has functional significance, as also proposed recently by Sawaya et al. following their discovery of a high density of STRs at the TSS and by Kozlowski et al. who found non-random distribution of trinucleotide repeats in the exome. Altered TR length in or near core promoters can change local nucleosome positioning, is likely to hinder transcription factor binding and therefore affect rates of transcription and hence gene expression [51, 52]. It has been shown that changes as small as 2 bp in nucleosome positioning can alter promoter activity . Moreover, it has been shown in yeast that nucleosome position is negatively correlated with the positioning of TRs . Hence, our findings of profound changes in STR period, repeat unit number and base composition around the TSS of human genes is interesting given the findings in yeast and indicate that similar mechanisms of regulating gene expression may be at play in the human genome . In this regard, a recent study has shown that a polymorphic GA-repeat in the human SOX5 gene promoter can affect gene expression, with the longer allele resulting in a 2.7-fold increase in activity . The authors report this as first evidence of a functional STR in a human gene core promoter .
Controlled vocabulary gene set enrichment analysis of gene transcripts with STaRRRT STRs in the regulatory region found a number of significantly enriched KEGG pathways, GO terms and tissues enriched for expression of these genes. These findings have broad overlap with gene set enrichment of gene transcripts having STRs in the exons and those gene transcripts with a high density of STRs in the intronic regions. Regulatory region, exon and intron analyses all show enrichment for expression in neural tissue. Enrichment of neurological genes and pathways in the STaRRRT analysis is consistent with the known role of TRs in neurodegenerative and neurodevelopmental disorders . Several neurological diseases known to be caused by variable TRs also appeared in the STaRRRT IPA results, namely Huntington’s disease and neuromuscular disease, as well as major depression which has a known association with a variable TR . STaRRRT can be used to analyse the role STRs may play in the development of various diseases, such as neurological disorders and cancer in which they have already been implicated. This could potentially lead to the identification of targets for diagnosing and treating diseases.
While the STaRRRT, exonic and intronic gene set enrichment results show a very high degree of overlap, we also note some differences between the enrichment signatures. The calcium signaling pathway was the most enriched KEGG pathway for STaRRRT STRs but is only mildly enriched in the exonic and intronic gene sets. In particular, STRs were significantly enriched in the regulatory region of genes involved in the calcium signaling pathway (KEGG), calcium ion binding (GO Molecular Function) and ion transport and activity (GO Biological Process and Function, respectively, which includes calcium transporters). Intracellular calcium signaling regulates a plethora of cellular processes including apoptosis, gene transcription, proliferation, cell cycle progression and differentiation . Disruption is associated with a number of diseases such as Alzheimer’s disease, diabetes, skin disorders, cardiac disease and cancer . Previous studies have shown STRs can impact calcium signaling with the identification of an expansion in the CAG repeat in exon 1 of isoforms ‘a’ and ‘c’ of KCNN3 and the 5’-UTR of isoform ‘b’ of KCNN3, which encodes a calcium activated potassium channel [45, 57]. The expanded variant of KCNN3 has been reported to reduce channel conductance and is associated with better cognitive performance of individuals with schizophrenia . An enriched presence of STRs in the regulatory region of the calcium signaling machinery has not previously been reported and may have significant consequences for protein expression and function and consequently disease. Further, the second most enriched KEGG pathway, vascular endothelial growth factor (VEGF) signaling, is associated with vasculogenesis and angiogenesis. We note that only STaRRRT genes were enriched for expression in skeletal and cardiac muscle and in the IPA analysis, cardiovascular system development and function was listed as the most enriched physiological system (Table 5).
The GSEA findings are consistent with mechanisms of human evolution. Due to their inherent instability, the presence of variable STRs in regulatory regions may act as a flexible switch to allow ready adaptation through positive selection with implications for human evolution and disease. The enrichment of neural processes and pathways is concordant with the involvement of TRs in the evolution of cognition and behaviour , supporting the idea of Legendre et al. (2007) that repeats may play a role in the swift evolution of the primate brain. The over-representation of STaRRRT genes involved in transcriptional regulation (Additional file 1: Table S1) further supports a role for STRs in evolutionary mechanisms, given the suggested role for polymorphic TRs in modifying transcription and leading to rapid evolutionary changes [59, 60]. Haygood et al. (2007) surveyed base substitution rates in human genomic regions upstream of the TSS and compared these with neighbouring intronic sequence and also substitution rates in chimpanzees. High rates of base substitution (compared to intronic rates) in human, but not chimpanzee promoters, were observed in genes involved in neuronal function, development, glycolysis and carbohydrate metabolism, protein folding, vision, oncogenesis and anion transport . This list of enriched biological processes shows much resemblance with the current study. Therefore, we hypothesise that the set of enriched STaRRRT STRs is reflective of general positive selection in human promoter regions since our divergence from chimpanzees.
The importance of STRs has been recognised due to their abundance in the human genome, high mutation rates, and relevance to disease phenotypes and evolutionary processes. As technologies improve and analysis of repetitive sequences becomes simpler and more cost effective, resources such as STaRRRT will become more valuable and commonly utilised in biological studies. Further applications for the use of STRs include the study of how environmental factors (such as radiation or toxic compounds) affect genomic mutation rate , which would rely upon a thorough understanding of the baseline mutation rates and other characteristics of STRs in the human genome.
STaRRRT acts as a starting point for researchers interested in looking at the role of STRs in promoter regions throughout the human genome. It is publically available and can be accessed at http://www.newcastleinnovationhealth.com.au/STaRRRT. This resource is suitable for researchers with limited bioinformatics experience who are interested in specific STRs, genes or phenotypes. Multiple database identifiers are available in STaRRRT including Affymetrix array probeset identifiers which allow legacy gene expression data to be easily mapped to this table.
This paper presents the findings from investigations of the distribution and abundance of STRs in the 5’ regulatory region of human genes. We have identified a unique signature of STR enrichment in this regulatory region which is most pronounced within neural genes, and calcium signaling and neurological pathways. This functional signature of STR enrichment in the regulatory regions of genes is similar to that previously identified in coding regions, suggesting that regulatory region STRs are subject to similar evolutionary pressures and may have an important role in gene expression. Hence, this study has identified STRs likely to be involved in the expression of genes associated with particular disease phenotypes and recent evolution of the human genome.
The STaRRRT resource was constructed in a series of nested table joins in MySQL database (SQL commands provided in Additional file 2). The tables, in hg19 build coordinates, were downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/index.html). The genome-wide table of tandem repeats identified by the Tandem Repeat Finder program  was reduced to the set of highly pure STRs by filtering for TRs with a length less than or equal to 9 bp and repeat purity of at least 90%. The analysis was then further restricted to those STRs proximal to the transcription start site (TSS) of genic loci with a RefSeq identifier. In instances where genic loci had more than one RefSeq transcript, the canonical transcript as defined by UCSC was used. For each canonical TSS, we entrained analyses to a span around the TSS rather than include all the 5’-UTR. This is due to approximately 11% of RefGene curated transcripts, in particular transcribed pseudogenes and noncoding genes, not having a defined 5’-UTR. The 5’-UTR is also highly variable in size; while most genes have a short 5’-UTR (median length of 292 bp and mean of 9885 bp), some genes have particularly long 5’-UTRs, for example, the transcript (NM_002839) of PTPRD has a 5’-UTR length of around 1.88 Mb.
For each STR in STaRRRT, containment within the regulatory region was defined as having start and end sites contained within the region 2 kb upstream of a TSS to 1 kb downstream (Figure 1). These STRs were given a relative coordinate with respect to the TSS (TxPos), defined as the number of nucleotides upstream or downstream from the STR start coordinate to the TSS. We also joined other identifiers (IDs) to this table such as KnownGene and Ensemble database IDs, NCBI RefSeq and GenBank accession numbers, HGNC gene symbols and Affymetrix array probeset IDs so legacy gene expression data can easily be mapped to this table. The final Short Tandem Repeats in Regulatory Regions table (STaRRRT) is a list of all the highly pure STRs present in the 3 kb regulatory region at the 5’-end of all human genes. This table was exported from MySQL into R (v2.15.0) and converted into an Excel spreadsheet. The SQL code used to construct the table is provided in Additional file 2.
Analysis of density of STRs and base composition in relation to the TSS
Using the functionality of the ‘GenomicRanges’ R library, we calculated from all STRs in the genome the subsets that are located within exons, introns or 5’-UTRs and those STRs located upstream (−2,000 to −1 bp), in the proximal promoter (−250 to +250 bp) or regulatory region (−2,000 to +1,000 bp), relative to the TSS. An STR qualified as being located within an entity if some portion of it overlapped.
To calculate STR density, for each STR the start and end coordinates (relative to the TSS) were used to generate a sum of STRs at each base position across the regulatory region. The sums were used to form a density per base and these densities smoothed using LOWESS local regression. Similarly, the base composition and repeat unit lengths were calculated for each base position across the regulatory region and were smoothed using local regression. For further detail consult the R scripts or the HTML-based report in Additional file 2.
Gene set enrichment analysis
Two gene set enrichment analysis approaches, the H-InvDB Enrichment Analysis Tool (HEAT; http://h-invitational.jp/HEAT/search.do) and Ingenuity Pathway Analysis (IPA; Ingenuity® Systems; http://www.ingenuity.com) were used to functionally characterise the list of genes from STaRRRT. For the HEAT analysis, KnownGene IDs within STaRRRT were mapped to HIT IDs (identifiers of an RNA transcript from the H-InvDB database), using the UCSC ‘knownToHInv’ table. Additional STR tables were prepared by filtering all STRs in the genome to those within exons and introns. Given the high number of transcripts with at least one STR in an intron we needed to reduce this set for GSEA. We created two sets; transcripts with a high-density of STRs in introns and randomly sampled transcripts with STRs in the introns. For the high-density set, filtering was introduced by limiting to those STRs with a purity ≥ 90% and to those genes with the highest quartile of STR density within the intronic region (one high purity STR per 7.32 kb intron). The density was calculated by summing the total intron width per gene and dividing this by the total number of STRs present in the introns of that gene. For the random sampling approach, ten HIT ID sets, each the same size as the STaRRRT set (3,258) were sampled from the 9,299 HIT IDs in the complete high purity intron set.
All sets were subjected to HEAT analysis and the returned tables were imported into R, processed and the p-values multiplicity corrected using a false discovery rate (FDR) correction from the Bioconductor ‘multtest’ library based upon the number of tests performed. An R script in Additional file 2 discloses all the processing steps.
For the IPA analysis, the list of 4,448 RefSeq gene transcript IDs was uploaded and, when compared against the reference set Ingenuity Knowledge Base (Genes Only), a list of 4,377 “analysis-ready molecules across observations” was created. A Core Analysis was run and the output included enrichment in the categories “Top Bio Functions” (including Diseases and Disorders, Molecular and Cellular Functions, and Physiological System Development and Function) and “Top Canonical Pathways”.
False discovery rate
Gene set enrichment analysis
Genome-wide analysis study
H-InvDB enrichment analysis tool
Ingenuity pathways analysis
Potassium intermediate/small conductance calcium-activated channel, subfamily N, member 3 gene
Kyoto encyclopedia of genes and genomes
NCBI reference sequence
Single nucleotide polymorphism
Structured query language
Simple sequence repeat
Short tandem repeat
Short tandem repeats in regulatory region table
Transcription start site
University of California, Santa Cruz
Vascular endothelial growth factor
Variable number of tandem repeats.
This work was supported by an Australian Postgraduate Award (KAB), the CSIRO Preventative Health National Research Flagship and CSIRO Transformational Biology Capability Platform (JPR; DMG), a Hunter Translational Cancer Research Unit Fellowship from the Cancer Institute of NSW (KAAK), and NHMRC Training (post-doctoral) Fellowships (EGH; NAB). Thank you to Peter Molloy and Nicholas Archer for critically reviewing this manuscript.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
- Sawaya S, Bagshaw A, Buschiazzo E, Kumar P, Chowdhury S, Black MA, Gemmell N: Microsatellite tandem repeats are abundant in human promoters and are associated with regulatory elements. PLoS One. 2013, 8 (2): e54710-10.1371/journal.pone.0054710.PubMed CentralView ArticlePubMedGoogle Scholar
- Kozlowski P, de Mezer M, Krzyzosiak WJ: Trinucleotide repeats in human genome and exome. Nucleic Acids Res. 2010, 38 (12): 4027-4039. 10.1093/nar/gkq127.PubMed CentralView ArticlePubMedGoogle Scholar
- Gemayel R, Vinces MD, Legendre M, Verstrepen KJ: Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu Rev Genet. 2010, 44: 445-477. 10.1146/annurev-genet-072610-155046.View ArticlePubMedGoogle Scholar
- Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ: Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009, 324: 1213-1216. 10.1126/science.1170097.PubMed CentralView ArticlePubMedGoogle Scholar
- Ohadi M, Mohammadparast S, Darvish H: Evolutionary trend of exceptionally long human core promoter short tandem repeats. Gene. 2012, 507 (1): 61-67. 10.1016/j.gene.2012.07.001.View ArticlePubMedGoogle Scholar
- Ellegren H: Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004, 5 (6): 435-445.View ArticlePubMedGoogle Scholar
- Wells RD, Dere R, Hebert ML, Napierala M, Son LS: Advances in mechanisms of genetic instability related to hereditary neurological diseases. Nucleic Acids Res. 2005, 33 (12): 3785-3798. 10.1093/nar/gki697.PubMed CentralView ArticlePubMedGoogle Scholar
- Debrauwere H, Buard J, Tessier J, Aubert D, Vergnaud G, Nicolas A: Meiotic instability of human minisatellite CEB1 in yeast requires DNA double-strand breaks. Nat Genet. 1999, 23 (3): 367-371. 10.1038/15557.View ArticlePubMedGoogle Scholar
- Brinkmann B, Klintschar M, Neuhuber F, Huhne J, Rolf B: Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am J Hum Genet. 1998, 62 (6): 1408-1415. 10.1086/301869.PubMed CentralView ArticlePubMedGoogle Scholar
- Weber JL, Wong C: Mutation of human short tandem repeats. Hum Mol Genet. 1993, 2 (8): 1123-1128. 10.1093/hmg/2.8.1123.View ArticlePubMedGoogle Scholar
- Verstrepen KJ, Jansen A, Lewitter F, Fink GR: Intragenic tandem repeats generate functional variability. Nat Genet. 2005, 37 (9): 986-990. 10.1038/ng1618.PubMed CentralView ArticlePubMedGoogle Scholar
- Legendre M, Pochet N, Pak T, Verstrepen KJ: Sequence-based estimation of minisatellite and microsatellite repeat variability. Genome Res. 2007, 17 (12): 1787-1796. 10.1101/gr.6554007.PubMed CentralView ArticlePubMedGoogle Scholar
- Naslund K, Saetre P, von Salome J, Bergstrom TF, Jareborg N, Jazin E: Genome-wide prediction of human VNTRs. Genomics. 2005, 85 (1): 24-35. 10.1016/j.ygeno.2004.10.009.View ArticlePubMedGoogle Scholar
- Payseur BA, Jing P, Haasl RJ: A genomic portrait of human microsatellite variation. Mol Biol Evol. 2011, 28 (1): 303-312. 10.1093/molbev/msq198.PubMed CentralView ArticlePubMedGoogle Scholar
- Sawaya SM, Lennon D, Buschiazzo E, Gemmell N, Minin VN: Measuring microsatellite conservation in mammalian evolution with a phylogenetic birth-death model. Genome Biol Evol. 2012, 4 (6): 636-647.View ArticlePubMedGoogle Scholar
- Jansen A, Verstrepen KJ: Nucleosome positioning in Saccharomyces cerevisiae. Microbiol Mol Biol Rev. 2011, 75 (2): 301-320. 10.1128/MMBR.00046-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Schroth GP, Chou PJ, Ho PS: Mapping Z-DNA in the human genome. Computer-aided mapping reveals a nonrandom distribution of potential Z-DNA-forming sequences in human genes. J Biol Chem. 1992, 267 (17): 11846-11855.PubMedGoogle Scholar
- Sawaya SM, Bagshaw AT, Buschiazzo E, Gemmel NJ: Promoter Microsatellites as Modulators of Human Gene Expression. Tandem Repeat Polymorphisms: Genetic Plasticity, Neural Diversity and Disease. Edited by: Hannan AJ. 2012, Austin, Texas, USA: Landes BioscienceGoogle Scholar
- La Spada AR, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH: Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature. 1991, 352 (6330): 77-79. 10.1038/352077a0.View ArticlePubMedGoogle Scholar
- Huntington's Disease Collaborative Research Group T: A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell. 1993, 72 (6): 971-983. 10.1016/0092-8674(93)90585-E.View ArticleGoogle Scholar
- Lesch KP, Bengel D, Heils A, Sabol SZ, Greenberg BD, Petri S, Benjamin J, Muller CR, Hamer DH, Murphy DL: Association of anxiety-related traits with a polymorphism in the serotonin transporter gene regulatory region. Science. 1996, 274 (5292): 1527-1531. 10.1126/science.274.5292.1527.View ArticlePubMedGoogle Scholar
- Verkerk AJ, Pieretti M, Sutcliffe JS, Fu YH, Kuhl DP, Pizzuti A, Reiner O, Richards S, Victoria MF, Zhang FP, et al: Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell. 1991, 65 (5): 905-914. 10.1016/0092-8674(91)90397-H.View ArticlePubMedGoogle Scholar
- Chen YH, Lin SJ, Lin MW, Tsai HL, Kuo SS, Chen JW, Charng MJ, Wu TC, Chen LC, Ding YA, et al: Microsatellite polymorphism in promoter of heme oxygenase-1 gene is associated with susceptibility to coronary artery disease in type 2 diabetic patients. Hum Genet. 2002, 111 (1): 1-8. 10.1007/s00439-002-0769-4.View ArticlePubMedGoogle Scholar
- Song F, Li X, Zhang M, Yao P, Yang N, Sun X, Hu FB, Liu L: Association between heme oxygenase-1 gene promoter polymorphisms and type 2 diabetes in a Chinese population. Am J Epidemiol. 2009, 170 (6): 747-756. 10.1093/aje/kwp196.View ArticlePubMedGoogle Scholar
- Zecevic M, Amos CI, Gu X, Campos IM, Jones JS, Lynch PM, Rodriguez-Bigas MA, Frazier ML: IGF1 gene polymorphism and risk for hereditary nonpolyposis colorectal cancer. J Natl Cancer Inst. 2006, 98 (2): 139-143. 10.1093/jnci/djj016.View ArticlePubMedGoogle Scholar
- Reeves SG, Rich D, Meldrum CJ, Colyvas K, Kurzawski G, Suchy J, Lubinski J, Scott R: IGF1 is a modifier of disease risk in hereditary non-polyposis colorectal cancer. Int J Cancer. 2008, 123: 1339-1343. 10.1002/ijc.23668.View ArticlePubMedGoogle Scholar
- Stanford JL, Just JJ, Gibbs M, Wicklund KG, Neal CL, Blumenstein BA, Ostrander EA: Polymorphic repeats in the androgen receptor gene: molecular markers of prostate cancer risk. Cancer Res. 1997, 57 (6): 1194-1198.PubMedGoogle Scholar
- Ingles SA, Ross RK, Yu MC, Irvine RA, La Pera G, Haile RW, Coetzee GA: Association of prostate cancer risk with genetic polymorphisms in vitamin D receptor and androgen receptor. J Natl Cancer Inst. 1997, 89 (2): 166-170. 10.1093/jnci/89.2.166.View ArticlePubMedGoogle Scholar
- Giovannucci E, Stampfer MJ, Krithivas K, Brown M, Dahl D, Brufsky A, Talcott J, Hennekens CH, Kantoff PW: The CAG repeat within the androgen receptor gene and its relationship to prostate cancer. Proc Natl Acad Sci USA. 1997, 94 (7): 3320-3323. 10.1073/pnas.94.7.3320.PubMed CentralView ArticlePubMedGoogle Scholar
- Antoniou AC, Wang X, Fredericksen ZS, McGuffog L, Tarrell R, Sinilnikova OM, Healey S, Morrison J, Kartsonaki C, Lesnick T, et al: A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-negative breast cancer in the general population. Nat Genet. 2010, 42 (10): 885-892. 10.1038/ng.669.PubMed CentralView ArticlePubMedGoogle Scholar
- Gymrek M, Golan D, Rosset S, Erlich Y: LobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012, 22 (6): 1154-1162. 10.1101/gr.135780.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Franchina M, Kadin ME, Abraham LJ: Polymorphism of the CD30 promoter microsatellite repressive element is associated with development of primary cutaneous lymphoproliferative disorders. Cancer Epidemiol Biomarkers Prev. 2005, 14 (5): 1322-1325. 10.1158/1055-9965.EPI-04-0826.View ArticlePubMedGoogle Scholar
- Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D: Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013, 41 (1): e32-10.1093/nar/gks981.PubMed CentralView ArticlePubMedGoogle Scholar
- Gulcher J: Microsatellite markers for linkage and association studies. Cold Spring Harb Protoc. 2012, 2012 (4): 425-432.View ArticlePubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Hannan AJ: Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for 'missing heritability'. Trends Genet. 2010, 26 (2): 59-65. 10.1016/j.tig.2009.11.008.View ArticlePubMedGoogle Scholar
- Mattick JS: The human genome and the future of medicine. Med J Aust. 2003, 179 (4): 212-216.PubMedGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27 (2): 573-580. 10.1093/nar/27.2.573.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Butler JE, Kadonaga JT: The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002, 16 (20): 2583-2592. 10.1101/gad.1026202.View ArticlePubMedGoogle Scholar
- Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM: Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006, 16 (1): 1-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Lawson MJ, Zhang L: Housekeeping and tissue-specific genes differ in simple sequence repeats in the 5'-UTR region. Gene. 2008, 407 (1–2): 54-62.View ArticlePubMedGoogle Scholar
- Araujo PR, Yoon K, Ko D, Smith AD, Qiao M, Suresh U, Burns SC, Penalva LO: Before It gets started: regulating translation at the 5' UTR. Comp Funct Genomics. 2012, 2012: 475731-PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40 (Database issue): D130-135.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamasaki C, Murakami K, Takeda J, Sato Y, Noda A, Sakate R, Habara T, Nakaoka H, Todokoro F, Matsuya A, et al: H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Res. 2010, 38 (Database issue): D626-632.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene. 2005, 350 (2): 129-136. 10.1016/j.gene.2005.01.012.View ArticlePubMedGoogle Scholar
- Saxonov S, Berg P, Brutlag DL: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci USA. 2006, 103 (5): 1412-1417. 10.1073/pnas.0510310103.PubMed CentralView ArticlePubMedGoogle Scholar
- Metzgar D, Bytof J, Wills C: Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res. 2000, 10 (1): 72-80.PubMed CentralPubMedGoogle Scholar
- Li YC, Korol AB, Fahima T, Nevo E: Microsatellites within genes: structure, function, and evolution. Mol Biol Evol. 2004, 21 (6): 991-1007. 10.1093/molbev/msh073.View ArticlePubMedGoogle Scholar
- Albert I, Mavrich TN, Tomsho LP, Qi J, Zanton SJ, Schuster SC, Pugh BF: Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome. Nature. 2007, 446 (7135): 572-576. 10.1038/nature05632.View ArticlePubMedGoogle Scholar
- Martinez-Campa C, Politis P, Moreau JL, Kent N, Goodall J, Mellor J, Goding CR: Precise nucleosome positioning and the TATA box dictate requirements for the histone H4 tail and the bromodomain factor Bdf1. Mol Cell. 2004, 15 (1): 69-81. 10.1016/j.molcel.2004.05.022.View ArticlePubMedGoogle Scholar
- Heidari A, Nariman Saleh Fam Z, Esmaeilzadeh Gharehdaghi E, Banan M, Hosseinkhani S, Mohammadparast S, Oladnabi M, Ebrahimpour MR, Soosanabadi M, Farokhashtiani T, et al: Core promoter STRs: novel mechanism for inter-individual variation in gene expression in humans. Gene. 2012, 492 (1): 195-198. 10.1016/j.gene.2011.10.028.View ArticlePubMedGoogle Scholar
- Ogilvie AD, Battersby S, Bubb VJ, Fink G, Harmar AJ, Goodwim GM, Smith CA: Polymorphism in serotonin transporter gene associated with susceptibility to major depression. Lancet. 1996, 347 (9003): 731-733. 10.1016/S0140-6736(96)90079-3.View ArticlePubMedGoogle Scholar
- Berridge MJ, Lipp P, Bootman MD: The versatility and universality of calcium signalling. Nat Rev Mol Cell Biol. 2000, 1 (1): 11-21.View ArticlePubMedGoogle Scholar
- Missiaen L, Robberecht W, van den Bosch L, Callewaert G, Parys JB, Wuytack F, Raeymaekers L, Nilius B, Eggermont J, De Smedt H: Abnormal intracellular ca(2+)homeostasis and disease. Cell Calcium. 2000, 28 (1): 1-21. 10.1054/ceca.2000.0131.View ArticlePubMedGoogle Scholar
- Grube S, Gerchen MF, Adamcio B, Pardo LA, Martin S, Malzahn D, Papiol S, Begemann M, Ribbe K, Friedrichs H, et al: A CAG repeat polymorphism of KCNN3 predicts SK3 channel function and cognitive performance in schizophrenia. EMBO Mol Med. 2011, 3 (6): 309-319. 10.1002/emmm.201100135.PubMed CentralView ArticlePubMedGoogle Scholar
- Fondon JW, Hammock EA, Hannan AJ, King DG: Simple sequence repeats: genetic modulators of brain function and behavior. Trends Neurosci. 2008, 31 (7): 328-334. 10.1016/j.tins.2008.03.006.View ArticlePubMedGoogle Scholar
- Fondon JW, Garner HR: Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci USA. 2004, 101 (52): 18058-18063. 10.1073/pnas.0408118101.PubMed CentralView ArticlePubMedGoogle Scholar
- Caburet S, Cocquet J, Vaiman D, Veitia RA: Coding repeats and evolutionary "agility". Bioessays. 2005, 27 (6): 581-587. 10.1002/bies.20248.View ArticlePubMedGoogle Scholar
- Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA: Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat Genet. 2007, 39 (9): 1140-1144. 10.1038/ng2104.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.