In this study we examined the repeat array size and distribution of eight MSRs in three HapMap populations, a widely used panel to identify and catalog genetic similarities and differences in human beings. Until now, MSRs have been considerably understudied in the HapMap panel because, despite the rapid development of advanced genome research technologies, technical and bioinformatic limitations have thus far precluded detailed analysis of the sequence and composition of MSRs. MSR array size was investigated and analyzed in 210 unrelated individuals for the autosomal repeats RS447, MSR5p, FLJ40296, RNU2, D4Z4 (4q) and D4Z4 (10q) and the X chromosomal repeats DXZ4 and CT47.
The MSRs studied are highly polymorphic, some of them showing size variations of hundreds of kilobases (e.g. MSR5p) while for others the variation was much more limited (e.g. CT47). For DXZ4 and MSR5p a normal distribution is not rejected in the African and Caucasian population, respectively, while the other MSRs are significantly skewed and/or have significant (excess) kurtosis. MSRs do not show consistent behavior within or between populations, but overall we observe among Asians the least genetic variation, while the Africans show the highest genetic variation. This is consistent with the African origin of modern humans, where most human genetic diversity exists in the Africans, indicated previously by studies of microsatellites  and SNPs .
Repeat array instability can either result in contractions or expansions of the array over time. However, none of the MSRs showed a skewed distribution to the shorter repeat sizes while skewness to the longer repeat sizes was often observed, indicating that repeat expansion is more common than repeat contraction. A minimum of two repeat units is required to form an array, but for most of the studied MSRs we observe a higher minimum unit number, e.g. the shortest array detected for RS447 and MSR5p is eight units. This is reminiscent of observations from repeat structures composed of much smaller unit sizes that every repeat only becomes variable in length above a certain minimum threshold , despite the different mechanisms by which changes in copy number are being generated in these structures. Microsatellite copy number variation is mainly created by replication slippage of the DNA polymerase , while previous studies showed that the preferential mechanism by which D4Z4 MSR contracts or expands is by sister chromatid exchange . The lack of MSR arrays below a certain threshold might either suggest that these sizes are less favorable by this rearrangement mechanism or that they might create an unfavorable chromatin structure and related transcriptional activity, perhaps associated with disease, as is seen for D4Z4 in the context of FSHD . Furthermore it is possible that individuals with shorter repeat arrays were missed in this analysis because of the limited sample size. However, for D4Z4 (10q) it is known that approximately 16% of Caucasian individuals contain repeat arrays <11 units . In this study, 17% of the Caucasian individuals and 16% of all individuals showed D4Z4 (10q) array sizes of ≤10 units, suggesting that for D4Z4 (10q) the HapMap panels are representative of a larger population. Therefore it is unlikely that shorter arrays were missed because of a limited sample size. Moreover, D4Z4 (10q) arrays <11 units were observed in 20% of the Asian and 9% of the African individuals, indicating a shift in the prevalence of repeat array sizes where longer repeat sizes are least prevalent in Asians, more in Caucasians and most in Africans.
We did not observe the meiotic mutation rate of 8.3% that was found previously for DXZ4  and RS447 , probably because we only studied a limited number of meioses. However, a striking observation is the high mitotic instability found in all eight MSRs. A mitotic recombination rate of approximately 3% has been reported earlier for the D4Z4 (4q) and D4Z4 (10q) repeat arrays in the European population, where 1% of the individuals was mosaic for D4Z4 (4q) and 1.5% for D4Z4 (10q) . Our findings indicate a comparable mitotic recombination rate of 0.4-2.2% in the other MSRs, which is four to ten times higher than the recombination rate of 0.1-0.2% most often described for microsatellite repeats [27, 28]. This difference in recombination rate is probably explained by the different mechanisms that generate copy number variation in these structures [20, 24]. Thus, MSRs are prone to frequent (mitotic) rearrangements and their polymorphic nature is likely a reflection of these high recombination rates.
Besides meiotic and mitotic instability, for five of the eight MSRs we also observed individuals showing additional bands on the Southern blots that could not be explained by mosaicism or culture-induced instability. These complex repeats suggest the presence of an additional restriction site within the repeat array for the enzyme used to digest the flanking regions or the presence of a homologous or duplicated repeat array elsewhere in the genome but probably on the same locus because the two fragments were always co-segregating when inherited by the offspring. Rearrangements between homologous chromosomes can also occur, which is described for D4Z4 (4q) and D4Z4 (10q) where exchanges between the repeat arrays occurred during evolution, leading to the formation of hybrid alleles, consisting of a combination of units derived from chromosome 4 and 10 .
Size variation in MSRs has been implicated in epigenetic control of the human genome affecting the expression of transcripts within and adjacent to the MSR. Therefore, their size regulation should be under strict control to avoid detrimental epigenomic consequences of copy number variation. Although MSRs often undergo rearrangements, our multimodality analysis indicates that these rearrangements indeed do not occur randomly. Previously we proposed that the D4Z4 array size distribution shows multimodality with three equidistant peaks at intervals of ~65 kb . Given the preferred mechanism of rearrangement, this multimodality can be based on a founder effect, where two ancestral alleles or different chromosomal backgrounds (as has been shown for D4Z4 (4q) and D4Z4 (10q)) give rise to size variation over time and show little inter-chromosomal interactions . Since MSRs are very polymorphic, it is more likely that this multimodality is based on other factors such as chromatin restrictions where certain chromatin states are more favorable than others. In this study we observed evidence for multimodality for seven of the eight MSRs, with CT47 being the exception showing a unimodal distribution. This presence of unimodality can be due to the small size range observed for CT47 where further repeat expansion can give rise to an additional mode. Multimodality in the MSR sizes can also arise by the presence of so-called recombination hotpots within the MSR array . As was found for segmental duplications, also MSRs may be unstable DNA structures and therefore contain certain sequences that can function as a hotspot of structural genomic rearrangements.
For the five MSRs showing very strong evidence for multimodality, it is for MSR5p, RNU2 and D4Z4 (10q) also likely that this multimodality shows equidistant intervals between the modes. The location of the modes and the distance of the intervals between the modes vary depending on the populations and the MSR. The unequal location of the modes can still indicate the remnants of a founder effect, where the newly rearranged alleles originate from the ancestral ones and form a distribution around the size of the ancestral allele. During this process, arrays with energetically more favorable sizes are more frequently produced, since a higher order chromatin structure is imposed upon the MSR limiting its variation in array length. Thus, it seems that each MSR is organized in its own specific way, constrained by its own minimal array size. However, seven MSRs show multimodality and at least three of those also show equidistant intervals between the modes, suggesting a more universal organization of MSRs in our genome where they are arranged into higher order chromatin structures.
By applying Bayesian statistical methods in our study, we were able to perform a powerful analysis concerning multimodality in the MSR size distributions. Since the Bayesian analysis of flexible mixtures of (shifted) Poisson distributions allowed us to estimate the posterior probability without the necessity of making assumptions beforehand, it is a promising method to implement into future studies on frequencies of CNVs or gene expression.