Analysis of the largest tandemly repeated DNA families in the human genome

Background Tandemly Repeated DNA represents a large portion of the human genome, and accounts for a significant amount of copy number variation. Here we present a genome wide analysis of the largest tandem repeats found in the human genome sequence. Results Using Tandem Repeats Finder (TRF), tandem repeat arrays greater than 10 kb in total size were identified, and classified into simple sequence e.g. GAATG, classical satellites e.g. alpha satellite DNA, and locus specific VNTR arrays. Analysis of these large sequenced regions revealed that several "simple sequence" arrays actually showed complex domain and/or higher order repeat organization. Using additional methods, we further identified a total of 96 additional arrays with tandem repeat units greater than 2 kb (the detection limit of TRF), 53 of which contained genes or repeated exons. The overall size of an array of tandem 12 kb repeats which spanned a gap on chromosome 8 was found to be 600 kb to 1.7 Mbp in size, representing one of the largest non-centromeric arrays characterized. Several novel megasatellite tandem DNA families were observed that are characterized by repeating patterns of interspersed transposable elements that have expanded presumably by unequal crossing over. One of these families is found on 11 different chromosomes in >25 arrays, and represents one of the largest most widespread megasatellite DNA families. Conclusion This study represents the most comprehensive genome wide analysis of large tandem repeats in the human genome, and will serve as an important resource towards understanding the organization and copy number variation of these complex DNA families.


Background
Tandemly repeated DNA makes up a significant portion of the human genome. Although historically relegated as "junk DNA", tandem repeats have taken on a new importance with the realization that their tandem organization provides potentially unique functional characteristics.
Tandemly repeated DNA is organized as multiple copies of a homologous DNA sequence of a certain size (repeat unit) that are arranged in a head to tail pattern to form tandem arrays, and thus represent a distinct type of sequence organization shared by all sequenced genomes. Centromeres from fission yeast to humans contain tan-dem repeats that are critically important for establishing heterochromatin formation and proper chromosome segregation (reviewed in [1]). Furthermore, tandem repeats have been shown to play a role in paramutation in Maize [2,3] and FWA gene regulation in Arabidopsis [4]. Overall, many of these functions appear to involve RNA interference-mediated chromatin modifications [5,3].
Tandem DNA in the human genome shows a wide range of repeat sizes and organization, ranging from microsatellites of a few base pairs to megasatellites of up to several kb [6,7]. Microsatellites and variable number of tandem repeats (VNTRs) can be highly polymorphic and are important for use as genetic markers. Contraction of a 3.3 kb polymorphic tandem repeat array on chromosome band 4q35 is associated with facioscapulohumeral muscular dystrophy (FSHD) [8][9][10]. Higher copy number of the salivary amylase (AMY1) gene correlates with protein level in populations with a high starch diet [11]. The Duf1220 protein domain, which is highly expressed in brain, was observed to be amplified specifically in the human lineage and to be undergoing positive selection [12]. And the large macrosatellite array DXZ4 appears to have a unique function in the process of X chromosome inactivation [13]. Thus, tandem repeats play important functional and evolutionary roles in genome biology.
Centromeres of human chromosomes contain the largest tandem DNA family in the human genome called alpha satellite DNA, which has been extensively studied and has emerged as a paradigm for understanding the genomic organization of tandem DNA [14][15][16]. Its fundamental repeat unit consist of 171 bp monomers, which are found in large highly homologous arrays of up to several million bp at the centromeres of all human chromosomes. These tandem arrays are composed of either diverged monomers with no detectable higher-order structure, or as chromosome-specific higher order repeat units (HORs) characterized by distinct repeating linear arrangements of an integral set of 171 bp monomers [17]. This HOR structure correlates with centromere function [16].
The assembled DNA sequence of most human chromosomes ends abruptly in large gaps in centromeres and heterochromatic regions, often in arrays of diverged monomeric alpha satellite or other tandem repeated "satellite" DNA. To date, only chromosome 8 and the X chromosome end in the higher-order repeat units known to be at these centromeres [17,18]. Furthermore, the large regions of classical heterochromatin such as the those on the long arm of chromosomes 1, 9 and 16, and the short arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 are poorly covered by assembled sequence. These regions are rich in tandemly repeated DNA families, which makes their sequence and assembly difficult. Simi-larly, the human Y chromosome is rich in repetitive DNA sequences, and required special efforts to obtain a detailed sequence [19].
The following report represents the first overall genomewide assessment of the number, position and organization of tandem repeats in the sequenced human genome, and therefore represents an important resource for further characterization and overall understanding of genomic organization. An overall genome-wide view of the representation of major tandem repeat families found in the current version of the sequenced human genome (hg18) is important for the study of copy number variation (CNV), because many of these arrays will be highly variable in copy number between individuals. Therefore, bioinformatic analysis was performed in this report using both existing database software and additional searches to extend the databases. The largest most prominent arrays ≥ 10 kb in size currently found in the assembled genome are examined, including large arrays of "simple satellite" sequences such as GAATGn and VNTRs, which reveal unexpected higher-order organization. 96 additional large arrays that are beyond the detection limit of current databases have been identified, which include multicopy gene families and large novel satellite DNA families which display distinct tandem arrangements.

Classical satellite repetitive DNA in the human genome
We performed a bioinformatics analysis of the tandemly repeated DNA using the output from tandem repeats finder (TRF) run against hg18 [6]http://tandem.bu.edu/ cgi-bin/trdb/trdb.exe, which reports 947,696 arrays containing tandem repeats ranging in size from 2 to 2000 bp [7]. Figure 1 shows these arrays >600 bp plotted by repeat unit size vs. array size (note log scales). In order to examine the largest most prominent tandem repeats in the human genome assembly, the 503 arrays larger than 10 kb were further classified into repeat class ( Figure 1). 373 (74%) of these large tandem arrays represented pericentromeric alpha satellite DNA, which showed repeat unit sizes of ~171 bp and multiples (~342 bp, ~512 bp,... 1866 bp), found in arrays as large as 188 kb. Despite this prominence in the TRF dataset, the majority of alpha satellite DNA remains unassembled and is represented by megabased sized centromeric gaps [17,18]. The TRF output ( Figure 1) includes some redundancies where the same array is reported more than once as multiples of a basic monomeric repeat unit e.g. 5 bp repeat units can also be reported as arrays of 10, 15, 20 bp repeat units, etc. We therefore compiled all tandem arrays >10 kb found in the human genome sequence (except alpha satellite DNA arrays) (Table 1 and 2), with all redundancies from TRF removed. We cross referenced these arrays with the "simple, low complexity, and satellite" REPEAT MASKER    Assembly-Gdis-Gap at distal end of array, Gprox-Gap at proximal end of array, Gint-Gap internal to array. Inv-inversion within array. SD-array is associated with a surrounding segmental duplication (not including segmental duplications due merely to repeat unit homology within array). Intra-intrachromosomal duplication of related arrays on same chromosome Inter-Interchromosomal duplication of related arrays on different chromosomes. (1) -(9) Related arrays are grouped by number, either intra or interchromosomal, or both, as indicated. tracks [20] from the UCSC genome browser and their associated entries in the Repbase data base of repetitive elements [21]. Table 1 also includes several large arrays derived by combining multiple smaller (<10 kb) overlapping arrays found by TRF but not highlighted in Figure 1. Table 1 represents the most prominent arrays of satellite sequences currently found in the human genome sequence.
The most abundant tandem repeats after alpha satellite DNA are satellites II and III (Table 1). Satellite III is composed primarily of the pentameric sequence GAATGn (or CATTCn), identified by Repeat Masker as a simple satellite. This family forms a prominent family in the TRF output ( Figure 1), with many large arrays at 5 bp and multiples thereof including some repeat units as large as 70 bp, forming arrays of up to ~100 kb. Satellite II is based on highly diverged arrays of GAATG, of which TRF finds 23 bp or 26 bp repeat units and approximate multiples which are identified by Repeat Masker as HsatII, based on comparison to a 59 bp consensus sequence listed in repbase [22]. As expected, prominent arrays of these "classi-cal" satellites are found in the pericentromeric regions of many chromosomes (Table 1). These repeats have been cytologically located to the heterochromatic blocks on chromosomes 1, 9 and 16 [23]. However, in the current assembly these poorly sequenced and assembled regions are riddled with gaps and do not reflect the large amounts of satellite DNA found there. An ~70 kb array of hsatII is seen on the proximal end of the long arm of chromosome 16, directly abuting the heterochromatic gap, and several arrays of hsatII are found in the chromosome 9 random fragments. On chromosome 1, only a small ~7.5 kb array of satellite DNA containing both GAATG and GAGTG is present flanking the hererochromatin of chromosome 1, consistent with satellite III group II which has been previously described there [24]. An additional prominent ~100 kb array of GAATG is found at the distal edge of the Y long arm and is presumably representative of the large heterochromatic block on the Y chromosome.
Additional simple sequence repeats found in arrays larger than 10 kb include 17, 20 and 27 bp repeats identified by TRF consisting of diverged CCTTGn repeats as identified Analysis of tandem repeats from the human genome Figure 1 Analysis of tandem repeats from the human genome. Output from tandem repeats finder (TRF), plotted showing the repeat unit size on the X axis (log scale) and the array length on the Y axis (log scale). 24,358 arrays between 600 bp and 10,000 bp in length were found (grey squares). 503 arrays ≥ 10 kb found by TRF are shown classified into different types of repeats (see legend at top). Prominent "simple sequence" satellites are shown as color coded triangles. Classical satellites are shown as color coded circles. Single locus VNTR repeats are indicated by color coded diamonds. 373 arrays found at multiples of 171 bp repeat unit size represent alpha satellite DNA (purple circles). Arrays greater then 2 kb not found by TRF are also shown, and listed in Table 2. Some arrays containing repeat units greater than ~1.5 kb are also listed in Table 2 because they contain more complex repeat units than those listed in Table 1. Both the 1.5 kb NBPF repeats (square) and the 1.9 kb "mer5A1" repeats (square) were found by TRF but are listed in Table 2. Multiple LTR arrays (Table 3) are shown as red circles at a repeat unit size of 3.5 kb.  [25], are found at the centromere of chromosome 8 (somewhat interspersed with alpha satellite DNA). GSATX, listed in repbase as a 1205 bp repeat [26], are found at Xp11 and 8q11. GSATII, listed in repbase as a 216 bp repeat, is found in a >100 kb array in 12p11, sometimes mixed with GSATX repeats. Of these Gamma satellites, only the GSATII arrays are found by TRF in arrays >10 kb. An additional satellite repeat is hsat4, based on a 35 bp repeat [16] found in prominent arrays abutting the centromere at chromosome Xp, 16p and 19q, mixed with alpha satellite DNA, and in a 25 kb array just proximal to the 5sRNA repeats in chromosome 1q42.13.
Also apparent from the TRF output are 16 large arrays that consist of locus specific tandem repeats, ranging in repeat size from 20 bp up to 1823 bp. The largest array is the 246.4 kb array (reported by TRF at 218 kb) consisting of DYZ19, a tandem 125 bp repeat first reported by [19] in the euchromatic portion of the Y chromosome (Yq11.222). The inclusion of an array of this size in the finished human genome sequence likely reflects the heroic efforts made in sequencing the Y chromosome. The remaining arrays, which represent classical Variable Number of Tandem repeat (VNTR) arrays, are as large as 50 kb, and do not match known repeat classes as described by Repeat Masker. In several cases, TRF shows redundancy in these VNTR repeats, e.g. the 34 bp VNTR from chromosome band 13q34 is also found as a 101 bp repeat at the same location ( Figure 1).

Higher-order structure in large arrays of satellite DNA
Self-similarity dot plot analysis was performed for all large >10 kb tandem arrays (Table 1) in order to reveal any additional long range repeat unit structure (Figure 2). At low stringency (defined as a 30 bp window with a match of at least 60%), all tandem arrays showed a very dense pattern indicative of the homology between repeat units across the arrays, which was not seen for standard complex DNA at the boundaries of the tandem arrays (e.g. Figure 2A). This analysis revealed the presence of large scale inversions of the orientation of the repeat units in some arrays (Figure 2A, B). As "stringency" is increased (increased window size and % match), this dot plot analysis reveals distinct domains of repeat units with higher similarity within the larger array, shown for the large GSA-TII array on chromosome 12 ( Figure 2C). Remarkably, some "simple sequence" arrays are characterized by highly homologous higher-order repeats (HORs) encompassing part or all of the arrays, as indicated by vertical or horizontal lines that are strongly visible on high stringency dotplots ( Figure 2D-G). The first ~50 kb of the large HsatII array on chromosome 16 shows HORs of ~5.8 kb, with a reversal of orientation ( Figure 2D). The distal ~60 kb of the large array of GAATG satellite in the pericentromeric region of the Y chromosome also contains 3360 bp HORs ( Figure 2E). And the ~100 kb array of GAATG seen in Yq21 also shows a distinct 3600 bp across the entire sequenced array ( Figure 2F). This 3600 bp HOR is consistent with previously reported periodicity of restriction enzyme digests of human DNA [27]. Several of the VNTR arrays also demonstrated HOR structure, such as the 18 kb array of a 61 bp repeat in the pseudoautosomal Xp22.33, which shows irregular HORs of 1.6 and 2.5 kb in a complex arrangement ( Figure 2G). These distinct sequence domains and higher-order repeat structures in these "simple sequence" satellite DNA families were only revealed by examination of the large amount of contiguous sequence now available from the human genome.

Large tandem arrays in the human genome
The TRF analysis is limited to repeats less than 2 kb, and so alternative methods were used to identify tandem repeats larger that 2 kb from the human genome (see materials and methods). This revealed a set of 96 tandem repeat arrays (Figure 1) ( Table 2). These represented unique regions which were usually highly visible on the UCSC genome browser as repeating patterns in the Repeat Masker track and in dot plot analyses as characteristic dense patterns of horizontal lines ( Figure 2H-L). The vast majority of these were described as regions that showed copy number variation (CNV) according to the databases included in the UCSC genome browser (Table 2), although these regions were not distinguished from other CNVs as containing tandem arrays. Some repeats that were less than 2 kb were included in Table 2 and shown in Figure 1 because they represented more complex repeats than those listed in Table 1. These included three arrays of the NBPF genes organized in ~1.5 kb repeats (Table 2B), and the largest exon of the hornerin (HRNR) gene characterized by distinct ~1.4 kb repeating units (Table 2B). Arrays of 5sRNA genes from chromosomes 1, 16 and X are included (Table 2A). And several arrays were included because each repeat contains interspersed transposable elements, including a ~1.9 kb repeat characterized by a Mer5A1 element (Table 2D). 10 repeat families were found in multiple arrays, either located on the same chromosome (intrachromosomal) or  on different chromosomes (interchromosomal), or both ( Table 2). Chromosome band 8p23.1 has the highest concentration of tandem arrays (Table 2), and additional regions with multiple tandem repeats includes chromosome 19q and the Y chromosome, both of which are known to be enriched in tandem repeats [19,28,29].
Several of these arrays appear in poorly assembled regions, characterized by sequence gaps within the array e.g. the Gor1 repeats in 8q21.2 (Table 2A), or directly abutting them on the proximal or distal side e.g. the NBPF gene family found in several arrays in 1q21.1 (Table 2B) [30]. Furthermore, several arrays may have assembly errors, suggested by the fact that overlap of assembled BAC clones is found precisely at the end of the array or at a change in orientation of the repeat units e.g. the array at 5p15 (Table 2D). These gaps and poor assembly likely indicate significant variation in these regions. In order to further investigate this variation in a repeat array, the Gor1 repeats in 8p21.2 were examined, which are represented in the genome by ~6 copies of a 12 kb repeat that is found on either side of an 87 kb gap ( Figure 3A). PFG analysis was performed to assess the size and copy number variation in this repeat, using the PmeI restriction enzyme, which does not cut within the repeat unit and thus releases the whole array in a single large fragment ( Figure 3B). A total of 9 arrays in two pedigrees were analyzed. These arrays were polymorphic and spanned a region from 600 kb to 1.7 Mbp, representing ~50-150 repeat units, but were stably transmitted through up to three generations. To our knowledge, this is the largest array of tandemly repeated DNA in the genome after the centromeric alpha satellite arrays.
The large tandem repeat arrays in Table 2 fell into several categories depending on the organization of genes, if any, within them. 20 tandem repeats contain a gene in most or every repeat unit, such that variation in the copy number of the tandem array would change the gene copy number (Table 2A), e.g. the CT47 gene found in 11 precise tandem copies ( Figure 2H) [31]. The DUB3 gene is represented in the genome by 9 precise tandem copies (ending at a distal gap), which was previously found in meiotically unstable arrays from 20 to 103 copies [32]. Additional repeat arrays are wholly contained within a gene, such that each repeat unit represents an exon or protein domain (Table 2B), such as the DMBT gene [33] (Figure 2I), the NBPF genes (which contain the duff1220 domain) [12,30], and the LPA gene [34] (Figure 2J). Some additional arrays may contain one or a few genes (or mRNAs or spliced ESTs), and several are contained within an intron of a gene (Table 2C).
Other arrays represent large classes of megasatellite DNA which do not contain genes, but are characterized by repeating patterns of interspersed transposable elements (TEs) ( Table 2D). Figure 2K shows an example of a 5.4 kb repeat which contains a Mer33 DNA transposon, several LTR retrotransposons, and many SINE and LINE elements, found in two large arrays on chromosome 19q13. Figure 2L shows ~6 kb repeat units from band 4p11 that contains portions of LINEs, LTRs and DNA transposons, as well as ~1.6 kb blocks of the 147 bp Acro satellite DNA identified by REPEAT MASKER. The presence of horizontal and vertical lines on the dot-plot indicates that there is an inversion within this array. FISH probes made from the non-transposon regions between the Acro satellite blocks ( Figure 2L) hybridize strongly to the short arms of all acrocentric chromosomes as well as chromosome 4p and 3p ( Figure 4B), indicating that it is this entire 6 kb repeat that is amplified on the acrocentric chromosomes. Other large arrays include the SST family of 2.5 kb repeats on Dot plot-analysis of tandem arrays reveals higher-order structure Figure 2 (see previous page) Dot plot-analysis of tandem arrays reveals higher-order structure. For each dot-plot shown, the type of repeat, chromosomal location and stringency (window size and % homology) are indicated. Black dots and horizontal lines represent tandem orientation, whereas blue dots and vertical lines represent inverted orientation. The repeat masker tracks for each region are shown below. A-G) Arrays are listed in Table 1 Analysis of large tandem array in 8q21.2 Figure 3 Analysis of large tandem array in 8q21.2. A) Information from the UCSC genome browser (hg18) showing region containing the 12 kb tandem repeat from 8q21.2. This repeat array contains an "87 kb" gap with ~5 repeat units on the proximal side and ~1.5 repeats on the distal side. The repeats can be seen in the repeating patterns of the Repeat Masker Tracks. The AF495523 (Gor1) gene is found once in each repeat unit. Copy number variation was detected at this repeat array using both BAC microarrays and fosmids. The restriction enzyme PmeI does not cut in the 12 kb repeats, but cuts close to the edge of the array in the genomic DNA sequence. The position of the PCR amplified probes used on the Southern blot are indicated. B) Pulsed Field Gel analysis of the array size in two pedigrees (lanes 1-4, and lanes 5-10). chromosomes 4q28.3 and 19q13 (Table 2D) [35], which are listed in Repbase as 1563 bp repeats, with additional diverged arrays on chromosomes 7p11.2, 17p11.2 and 20p11.1. A 636 bp FISH probe made to the region between the 1563 bp "SST" repeats from chromosome 4 hybridizes specifically to the arrays on chromosomes 4 and 19 ( Figure 4A). A large polymorphic 3 kb repeat (DXZ4) has been described on the X chromosome in an array of 50-100 copies, which does not contain any Repeat Masker identified TEs [36]. Another striking array at band 2q37.1 consists of 28 kb of diverged 89 bp repeats that are identified by Repeat Masker as a dense block of highly diverged fragments of the DNA transposon Mer20 (consensus 128-219 bp).

A large tandem DNA family consists of amplified LTR retrotransposons
One unique megasatellite family was observed which consists entirely of the MaLR class of LTR retrotransposons. This family is organized into ~3.5 kb monomeric repeat units containing fragments of both MSTA and THE-1 LTRs and internal open reading frames (MSTA-ints) in both orientations ( Figure 5, Table 3), which have undergone expansion into a tandem array. These "LTR arrays" are found on 9 different chromosomes in 25 arrays, covering 460 kb of DNA (Table 3), and account for a significant portion of the MSTA LTR transposons in the genome. These arrays appear as solid blocks of LTR retrotransposons in the UCSC genome browser ( Figure 5A), with 10 In situ hybridization of megasatellite DNA families Figure 4 In situ hybridization of megasatellite DNA families. A) FISH using a 636 bp probe to the SST repeats from chromosome 4, which hybridizes to chromosome 19 (two arrays, Table 2d) and chromosome 4. B) FISH using a probe from the acro repeats from chromosome 4p11 (Table 2d), which hybridizes to pericentromeric regions of chromosomes 3 and 4, and the acrocentric chromosomes. C) FISH using a probe to the 3.5 kb repeats from the LTR arrays. Right-Additional acrocentric chromosomes from different individuals showing the variation in hybridization patterns.
distinct arrays on chromosome 9, and the three largest arrays in the pericentromeric regions of chromosomes 13q, 18p and 21q (Table 3). FISH analysis using probes designed to conserved regions of these LTR array repeats reveals the 25 arrays, including small arrays seen in the genome browser with less than 2 repeat units (not included in Table 3) on chromosomes 9q22.1, 3q29, and 2q21 as well as additional arrays on chromosomes 14, 15, 20, and 22 ( Figure 4C). Thus, this family is the largest most widespread megasatellite family yet described, and represents a novel arrangement and expansion of the abundant MaLR class of LTR retrotransposons. FISH analysis of 8 unrelated individuals showed no variation in the presence of LTR arrays on non-acrocentric chromosomes, but revealed significant variation in the presence and number of LTR arrays on the acrocentric chromosomes ( Figure 4C). A centromeric array was seen on all 16 chromosome 13s examined, 7 of which also had an array on the short arm. A centromeric array was also seen on all 16 chromosome 21s examined, 4 of which also has a short arm array, 1 of which had a centromeric and a slightly more distal second array, and 3 of which had all three arrays. Seven chromosomes 14 had only a short arm array, 3 had only a centromeric array, while 5 had both and 1 had no detectable array. Six of the chromosome 15s examined had only a short arm array, and 10 had no detectable array. Thus, the acrocentric chromosomes show up to three arrays resolvable by FISH (short arm, centromeric, and more distal) and a large amount of variability in the population. The presence of LTR arrays at the centromeres of chromosomes 13 and 21 in the genomic sequence are consistent with these results.
These LTR arrays are in general embedded in large interrelated segmental duplications, which on chromosomes 13, 18 and 21 are complex highly related mosaics several hundred kb in size (data not shown). The large LTR arrays from chromosomes 13, 18 and 21 are each characterized by higher order repeat (HOR) structures consisting of 6 3.5 kb monomeric repeats ( Figure 5C), with distinct patterns of MSTA fragments repeated once per HOR ( Figure  5D). In particular, each HOR on chromosomes 13, 18 and 21 is marked by insertion of another type of LTR transposon, a full length LTR6a ( Figure 5C, D). These HORs are readily visible as bold solid horizontal lines on dotplots ( Figure 5E). The dispersal of these HORs likely occurred due to the interchromosomal segmental duplications between chromosomes 13,18 and 21. However, the HORs from each chromosomal array shows unique rearrangements ( Figure 5E). On chromosome 13, the third HOR is missing a 3.5 kb monomeric repeat unit, presumably due to unequal crossing over between HORs (as shown in figure 5D). The array on chromosome 18 has a similar crossover event, but the monomeric repeat was deleted from the second HOR ( Figure 5E). And the array from chromosome 21 has a 3 monomer deletion in the third HOR. Thus, these unique HOR rearrangements on each array suggest that unequal crossing over events have been ongoing since the dispersal of the arrays to each chromosome.

Discussion and conclusion
Tandemly repeated DNA represents a unique class of DNA in the human genome with unusual sequence organization in regular repeat units. We have presented here genomic analysis of the largest tandem arrays in the human genome (Figure 1). For example, analysis of the Analysis of the repeat unit structure of the LTR arrays from chromosomes 13, 18 and 21 Detail of the composition of monomers C and D indicating the MaLR LTR fragments that make up the repeat units, taken from the REPEAT MASKER output and numbered relative to the consensus for each element. The insertion of a full length LTR6A into monomer C can be seen. E) Self-similarity dot plots of LTR array from chromosome 13q11, 18p11 and 21q11 at 50 bp windows 90% homology. The HOR organization is revealed as bold solid horizontal lines, and are shown schematically by arrows below. Putative unequal crossing over events unique to each LTR array are revealed by the gaps and shift of these lines, and deleted monomeric repeat units indicated below.
largest arrays of simple sequence "satellite" DNA in the genome (Table 1) revealed unexpected higher-order structure, including inversions, domains of homology, and extensive HOR structures (Figure 2A-G). HORs in "simple" sequence have been suggested by restriction enzyme periodicities seen as large bands on Southern blots e.g. [27], but the size and extent of these HORs and the diverse sizes seen here for several different classes of satellites has not been previously described. Elucidation of such structures was only possible because of the large (up to several hundred kb) contiguous sequences available from the human genome sequence. The HOR structure of human centromeric alpha satellite has been implicated as important in centromere function [16]. By analogy, it is possible that the simple satellites that display HOR and domain structures ( Table 1) may represent a subclass involved in chromosome function.
We have also examined arrays that contain larger repeats, generally greater than 2 kb, with some monomeric repeat units as large as 40 kb (Table 2). Tandem DNA represents an important source of copy number variation because the number of repeat units can vary by several fold at individual arrays, which for these large arrays can represent Mbps of DNA sequence. We have shown, for example, that the size of a tandem array consisting of 12 kb repeats at chromosome 8q21.2 can vary from 600 kb to 1.7Mbp, even though these were represented in the genome as 5-6 repeats spanning an "87 kb" gap. Many of the repeats listed in Table 2 have been observed to be found in much larger polymorphic arrays than represented in the genome sequence, such as the Dux4 genes which are associated with FSHD [8,9]. Moreover, the majority of the tandem repeats are found in regions with detectable copy number variation, usually by performing aCGH with BAC arrays or sequence analysis of fosmids [37,38] and found in databases for human CNVs. However, these detection methods such as aCGH do not clearly distinguish these large tandem arrays, which may vary by hundreds of copies, from other types of CNVs that are found in fewer copies, such as insertions and deletions. Ideally, analysis of the array size and copy number at each individual allele would be desirable, in order to distinguish between, for example, two medium size arrays versus one large and one small array, which may have significant phenotypic differences. Currently, PFG gel analysis as shown in Figure 3 has been used to distinguish individual alleles of large tandem DNA arrays.
Chromosome band 8p23 contains the highest concentration of large tandem repeats, which occur in the two repetitive proximal and distal regions (RepP and RepD, respectively) which flank a large 4.7 Mbp polymorphic inversion [39,40]. A 7 kb repeat found in 4 different arrays of up to 8 repeat units were seen in opposite orientations in RepD, and homologous single copies of this repeat are seen in RepP (data not shown). Two additional arrays consisting of 4.7 kb repeats are seen in both RepP and RepD, and share homology with the Dub3 array on 4p16 [32,41]. Whether these tandem repeats contribute to, or are a consequence of, the common inversions and duplications seen in 8p23 is not known.
The large tandem repeats listed in Table 2 have been categorized according to whether they contain genes or are contained within genes. 20 arrays contain repeated genes with one gene per repeat unit (Table 2A). In these cases, variation in copy number of the tandem repeat would lead to different numbers of genes, which may effect dosage of the protein product and lead to phenotypic changes, such as observed for the AMY1 gene [11]. 23 repeat arrays were wholly contained within genes (Table  2B), such that each repeat unit contained one or more exons representing protein domains [12]. In these cases, a change in copy number may represent a change in the number of these repeated domains present in a protein product, which may have a strong effect on function and phenotype. 8 other arrays were more difficult to categorize (Table 2C). Some appeared to contain single genes (or spliced ESTs or mRNAs), including small ncRNA sequences. Others showed the repeat array contained entirely within an intron, which may represent a source of microRNAs involved in gene regulation e.g. see FLJ43987 (Table 2C).
17 additional tandem arrays that did not contain genes were categorized as megasatellites because they are made up of repeat units that contain interspersed TEs. These regions presumably represent regions of the genome that accumulated TEs in a normal fashion over time, and were subsequently amplified into large tandem repeats. Each is distinguished by striking repetitive patterns in the REPEAT MASKER tracks from the UCSC genome browser and distinct dot plots containing dense patterns of horizontal and/or vertical lines ( Figure 2H-L). Several arrays appear in the REPEAT MASKER tracks as large dense blocks of LTR or DNA transposons. For example, the large 246 kb array of 125 bp repeats in Yq11.222 [19] appears in the LTR track as a solid block of 49 copies of highly diverged fragments of LTR12B. The 667 bp consensus of LTR12B listed in Repbase has within it 3 tandem copies of an ~125 bp repeat which is ~85% similar to the 125 bp repeat units seen in this array, and thus it seems possible that this 125 bp repeat is derived from a hugely expanded portion of a LTR12B sequence. Another large array on chromosome 2q37 consists of 89 bp repeats with ~80% homology to positions 128-219 of the DNA transposon mer20. The beginning of this array contains a full length mer20 (positions 1-219) in the same orientation, and thus this array likely represents tandem expansion of the latter 89 bp of this LTR transposon into the large array.
The LTR arrays ( Figure 4C, Figure 5) also illustrate novel composition, organization, and evolutionary processes in a tandem repeat. These arrays consist entirely of the MaLR class of LTR transposons, including both LTRs and internal open reading frames of the MSTA elements, organized into 3.5 kb monomeric repeat units. These have been dispersed to over 25 arrays on ~12 chromosomes, including the acrocentric chromosomes with significant variation for the presence of arrays in both the short arm and pericentromeric regions ( Figure 4C). On chromosomes 13, 18 and 21, the available DNA sequence reveals distinct HOR structure consisting of 6 monomeric repeat units, each of which contain an insertion of an LTR6A element ( Figure  5C). The dispersal to multiple chromosomes may be related to the large segmental duplications in which these arrays are embedded. However, unequal crossing over events have occurred within these HORs since this dispersal to multiple chromosomes. Thus, we can putatively order a series of events that shaped these arrays, 1) the transposition of the MaLR elements and amplification into the 3.5 kb monomeric repeat units, 2) insertion of the LTR6a element into one of the monomers, 3) expansion into HORs containing 6 monomers 4) dispersal to chromosomes 13, 18 and 21 and 5) unequal crossing over in the HORs on each chromosome.
In summary, we have compiled the collection of largest tandem arrays from the sequenced human genome. We have demonstrated unexpected secondary higher-order structure in "simple sequence" DNA. We have added to current databases the list of tandem arrays with repeat units greater than 2 kb, including novel megasatellites that contain amplified patterns of interspersed transposable elements. The tandem repeats described in this paper represent an important resource for understanding the genomic organization of the human genome.

Analysis of tandem repeats in the human genome
Tandem Repeat Finder was run on the human genome hg18 at high sensitivity, and the primary results found on tandem repeats database (TRDB) https://tan dem.bu.edu [6]. Large repeats were examined in the UCSC genome browser, and repeats were identified according to their assignment in the simple, low complexity, or satellite tracks of REPEAT MASKER [20], according to their match in RepBase [21]. For larger repeats, a script was performed that identified >250 regions where at least three highly similar interspersed repeats from Repeat Masker SINE, LINE, LTR or DNA tracks were found at a similar distance from each other, set at a limit of ≥ 1.5 kb. These were further examined using the UCSC genome browser or by dot plots, and those that contain tandem repeats were identified and included for further study (Figure 1, Table 2). Self-similarity dot plots were performed using MacVector pustell DNA matrix.

Pulsed Field Gel Electrophoresis
PFG gels, Southern blotting, and hybridization were performed using standard protocols [42]. Briefly, cultured human EBV transformed lymphoblastoid cells were embedded in agarose plugs and digested with proteinase K in 1% Lauryl sarcosine 0.5 M EDTA pH 9.5. Restriction enzyme digestion was performed by equilibration of agarose plugs in restriction enzyme buffer at 4 degrees and addition of restriction enzyme (PmeI). PFG gels (Figure 3) were run on a Biorad CHEF DRII using ramped pulsed times from 120-320 seconds at 150 V for 46 hours, in a 1% agarose gel in 0.5× TBE buffer, and Southern blotted. Probes were designed from within the 12 kb repeat units using PCR primers from conserved regions, which amplified a 2 kb fragment (Gor1AL-TGGATGATCTGTGCCAG-GTA and Gor1AR-AAGGAGAAGTCCCACCCAGT) and a 1.6 kb fragment (Gor1BL-GGGTAGAGGAGGGT-GAAAGG and Gor1BR-TTGGAACCATGCGAGTGATA, which were designed in "unique" sequence and avoided any Repeat Masked sequences. In situ hybridization PCR primers were designed to conserved regions of repeat units using primer 3 http://primer3.sourceforge.net/. PCR products to be used as FISH probes were designed to include only "unique" regions which did not include any interspersed repetitive elements as identified by REPEAT MASKER. PCR amplification of the tandem DNA arrays using conserved PCR primers will amplify multiple repeat units. Purified PCR products were labeled by nick translation incorporating biotin-dUTP. Fluorescent In Situ Hybridization (FISH) was performed using standard protocols [43]. For FISH analysis of LTR arrays, at least 5 metaphases were analyzed per person from primary lymphocyte cultures and were karyotyped using DAPI banding. PCR primers used to amplify FISH probes for LTR arrays were designed to two conserved regions of the 3.5 kb repeat units, and complementary PCR primers were designed that amplified the entire repeat unit in two ~1.7 kb fragments. PCR primers were Pair 1: 1A:CTGTACCT-GTGCATCTTTC and 2B:GGGAGTAGCCTGCTGCAGAGG and the complementary pair 2A:GAAAGATGCACAGGTA-CAG and 1B:CCTCTGCAGCAGGCTACTCCC. To PCR amplify FISH probes for analysis of the repeat on 4p11 ( Figure 4B), primers pairs that encompass 1.1 kb from right side of the acro repeat to left side of mer21 repeat (4p11-bL GCTGGGTGATGGCAGTAAGA 4p11-bR ATCT-GGAGCCACACCTTGAT) were used.