Reference genome sequences
The human reference genome [24] (hg18, NCBI build 36) and the mouse reference genome [25] (mm8, NCBI 36) were used in the analyses. Base pairs assigned 'N' (i.e. gaps) in the reference sequences were omitted in the analysis and the pruned genomes were referred to as the "entire genomes" (human genome length: 2,858,013,089 bp; mouse genome length: 2,550,169,439 bp).
Short tandem repeat identification
We identified STRs by scanning the genomes for DNA segments that fulfil four criteria. A sequence is defined as STR with period p, if it fulfils the following: (1) the length of the sequence is at least 9 bp, (2) a motif (e.g., AT in ATATATATAT) of length p(≥1) is repeated at least three times with (3) at most one bp not matching a perfect repetition of the motif in sliding windows of max(12, 3·p) bp, and (4) the two flanking bp of the sequence must match the motif. Known polymorphic single nuclear substitutions are used to allow mismatches in the reference genome, consequently all possible alleles are analyzed (Figure 1). We used all polymorphic single nuclear substitutions from ENSEMBL 46 (containing dbSNP build 127 [26] for humans and dbSNP build 126 for mouse [26]). The data were downloaded as the "ENSEMBL 46 VARIATION" track from the BioMart Browser [27]. If a STR sequence is assigned more than one period, we used the smallest.
Only 0.8% of all STRs with periods 1–25 have period > 9 (Additional file 1: Figure S3), hence we only used periods <10 bp when analyzing the entire genomes. The entire human genome has 114,996,351 bp tagged as STRs (4.02% of the entire genome), and the entire mouse genome contains 137,927,765 bp tagged as STRs (5.41% of the entire genome).
Indels
We used all insertions and deletions (indels) from ENSEMBL 46 (containing dbSNP build 127 [26]). The data were downloaded as the "ENSEMBL 46 VARIATION" track from the BioMart Browser [27]. To obtain validated indels only, the data were filtered to contain only observations with validation "freq" and/or "doublehit" (the minor allele is seen at least twice) and Mapweight 1 (the highest quality alignments), resulting in 4,351 validated insertions and 16,899 validated deletions. To differentiate between insertions and deletions, we used the state given by dbSNP, which is defined according to the reference sequence.
Disease-related gene sets
Human and mouse genes were downloaded using BioMart (ENSEMBL 46) [27] only including "KNOWN" genes with "KNOWN" transcripts. This resulted in 21,658 human genes with 39,684 transcripts and 21,946 mouse genes with 28,576 transcripts. If a gene had multiple transcripts we clustered all exons from all transcripts into one super-transcript.
The OMIM Morbid Map (August 30, 2007) which contains the cytogenetic map locations of all disease genes described in the OMIM database [28] was used to assign disease status of human genes. We created four sets of human disease genes: The general set (all diseases, 2095 genes) consists of all Morbid Map genes, except genes annotated with terms related to homosexuality and protections against diseases. Three subsets were defined using disease terms: A leukaemia set (70 genes, term: 'leukaemia'), a cancer set excluding leukaemia (151 genes, terms: 'carcinom', 'cancer', 'tumour', 'burkitt lymphoma', 'malignant melanoma', 'multiple endocrine neoplasia', 'neurofibromatosis', 'polycystic kidney disease', 'harvey ras oncogene', 'retinoblastoma', 'tuberous sclerosis' and 'von hippel-lindau syndrome') and an immune system disease set, excluding cancer and leukaemia (52 genes, terms: 'asthma', 'ataxia telangiectasia', 'autoimmune', 'digeorge syndrome' and 'immunodeficiency').
We defined two non-overlapping sets of mouse disease genes. The first set of 294 mouse cancer genes is the result of querying the Mouse Genome Database (MGD) [29] for "increased tumour incidence" in the mammalian phenotype ontology [29]. The second set consists of 764 mouse genes associated with "postnatal lethality" after removal of genes overlapping the cancer set.
Reference gene sets
The reference set of "non-disease-related" human genes was defined as the 11,210 known genes not found in the OMIM database, whereas the mouse reference set was defined as the 17,171 known mouse genes not mapped to the mammalian phenotype ontology [29].
Known tandem repeats
The "Simple Repeats" track in the UCSC Genome Browser [30] act as a de facto definition of tandem repeats (possibly imperfect), identified by Tandem Repeats Finder [31]. The track was created using the following parameter settings for TRF; match = 2, mismatch = 7, indels = 7, matching probability = 0.80, indel probability = 0.10, maximum period = 50, and minimum alignment score = 2000.
STRs outside known tandem repeats
STRs outside known tandem repeats are defined by applying the following two filters: (A) STRs inside known tandem repeats are omitted from the analysis; (B) all contiguous segments of STRs are clustered, and all such clusters which are more than 25 bp long are omitted from the analysis.
Statistical methods
To test for excess of insertions/deletions in STRs, we used a binomial test. The observed number of insertions/deletions inside STRs was compared to the binomial distribution b(n, p) where n is the total number of validated insertions/deletions and p = 0.0402 is the proportion of STRs in the human genome. We define an indel to be inside a STR segment if the midpoint of the indel is within the segment. The midpoint is defined as (s+e)/2, where s is the start coordinate and e is the end coordinate of the indel.
The distribution is relative STR amount is non-Gaussian (Additional file 1: Figure S4) and a standard t-test cannot be applied. Instead, we used the Wilcoxon rank-sum test [32] to compare the relative STR amount in disease-related genes to the relative amount in reference genes, because the test does not require assumptions about the underlying distribution of relative STR amount. The STR overrepresentation for each disease-related gene set is found by comparing the estimated median relative STR content in the disease-related gene set to the estimated median relative STR content in the reference gene set. Confidence intervals of the estimated overrepresentation are obtained by Gaussian approximation of the distribution of rank sums from the Wilcoxon rank-sum test.
All data were analyzed using Python http://www.python.org and R http://www.R-project.org[33]. All scripts are available upon request.