STRP Screening Sets for the human genome at 5 cM density

Background Short tandem repeat polymorphisms (STRPs) are powerful tools for gene mapping and other applications. A STRP genome scan of 10 cM is usually adequate for mapping single gene disorders. However mapping studies involving genetically complex disorders and especially association (linkage disequilibrium) often require higher STRP density. Results We report the development of two separate 10 cM human STRP Screening Sets (Sets 12 and 52) which span all chromosomes. When combined, the two Sets contain a total of 782 STRPs, with average STRP spacing of 4.8 cM, average heterozygosity of 0.72, and total sex-average coverage of 3535 cM. The current Sets are comprised almost entirely of STRPs based on tri- and tetranucleotide repeats. We also report correction of primer sequences for many STRPs used in previous Screening Sets. Detailed information for the new Screening Sets is available from our web site: . Conclusion Our new human STRP Screening Sets will improve the quality and cost effectiveness of genotyping for gene mapping and other applications.


Background
Since their discovery in 1988, multiallelic short tandem repeat polymorphisms (STRPs) (also called microsatellites or simple sequence length polymorphisms (SSLPs)) have been the polymorphisms of choice for linkage mapping and many other genetic studies.
Although there are hundreds of thousands of reasonably informative STRPs in the human genome [1,2], only a small fraction are optimal for genotyping and genome scans. Optimal properties of an STRP include: high heterozygosity, strong and specific PCR amplification, capability to be amplified simultaneously with other STRPs (multiplexed), sharp bands on gels, easy and accurate scoring of allele sizes, relatively low mutation rate, and appropriate position along the genetic map.
We have performed human genome polymorphism scans in our lab since 1989 [3]. Our first human Screening Set of STRPs developed in 1992 had an average STRP spacing of ~20 cM, no sex chromosome STRPs, and consisted almost entirely of dinucleotide repeat STRPs identified at Marshfield. Each subsequent Screening Set from our lab improved on the previous version by adding STRPs, by using more accurate genetic maps to make STRP spacing more uniform and to eliminate large gaps, and especially by replacing relatively low quality STRPs with superior ones. Typing better STRPs leads to higher data quality through fewer missing genotypes and fewer incorrect allele calls. Typing optimal STRPs also leads to lower genotyping costs by providing more information, by reducing the need for duplicate genotyping, by permitting the use of shorter gels (with lower resolving power but shorter run times), and by increasing the efficiency of allele calling.
We have replaced nearly all of the dinucleotide repeat STRPs in our Screening Sets with tri-and tetranucleotide STRPs. Although dinucleotide STRPs are abundant and meet many of the criteria for optimal STRPs, they are also in our hands more difficult to score accurately because of substantial strand slippage during PCR [4]. We also find that dinucleotide STRPs are more difficult to PCR multiplex than tri-or tetranucleotide STRPs.
Similarly, we have eliminated nearly all of the STRPs with frequent (> 2%) "non-integer" alleles. Non-integer alleles are defined as have length differences from the most frequent alleles which are other than integer multiples of the repeat length. For example, an allele of 221 bp (PCR product length) would be a non-integer allele for a tetranucleotide STRP with frequent alleles of 230, 226, 222, and 214 bp. Non-integer alleles are not typing artifacts as they have been observed in many labs and have been confirmed by sequencing of individual alleles [5]http:// www.cstl.nist.gov/biotech/strbase. Non-integer alleles probably exist somewhere in the human population for all or nearly all STRPs, but a significant fraction of STRPs do not have frequent non-integer alleles.
We have also excluded or repaired STRPs with weak or null alleles. In at least most cases, weak and null alleles appear to be due to substitution polymorphisms within the primer annealing sites [6]. They can be repaired by sliding the offending PCR primer to a nearby position along the chromosome.
For most applications of genome polymorphism scans, higher STRP densities are preferable. This is particularly important for gene mapping by association. While analysts have predicted that very high polymorphism densities will be required for association mapping in mixed or outbred populations [see for example reference [7]], promising results have been obtained using genome scans of 600-1200 STRPs in isolated populations where levels of linkage disequilibrium are particularly high [8][9][10]. In this manuscript we describe the development of two new 10 cM human STRP Screening Sets (Sets 12 and 52) which when combined provide average STRP spacing of 4.8 cM.

Building new human Screening Sets
Over about the last decade we have produced at Marshfield twelve separate, but related 10 cM Screening Sets of STRPs for the human genome (see Table 1 and http://research.marshfieldclinic.org/genetics). For each of these Sets, the lowest quality STRPs in the previous Set were replaced with superior ones. Of the most recent collections, Sets 6, 7, 10, and 11 were major overhauls, with 21-52% of the STRPs replaced (Table 1). Sets 5 and 8 were described in the literature [11,12]. Beginning particularly with Set 6, many of the dinucleotide STRPs were replaced with tri-and tetranucleotide STRPs from the Cooperative Human Linkage Center (CHLC) [13]. CHLC STRPs still comprise 81% and 55% of our current Sets (Sets 12 and 52, respectively). Starting in about 2001, the availability of the human genomic draft sequence greatly expanded the number of STRPs from which to choose. Sets 12 and 52 contain 15% and 44%, respectively, newly derived STRPs from the genomic sequence.  [15] for the presence of variable triand tetranucleotide STRs. We focused efforts on AAT and AGAT repeats because these sequences are known to be abundant and to yield useful polymorphisms [13,16].
New PCR primers selected from the sequences flanking the tandem repeats were tested by amplification with ten individual DNA samples and one DNA pool using incorporation of a nucleotide tagged with a fluorescent dye (see Methods). PCR primers labelled with a fluorescent dye at the 5' end were then synthesized for those STRPs which displayed ≥ 4 alleles in the first screen. These were combined with existing CHLC and Utah STRPs, and were used to screen 12 individuals and one pool. All donors of DNA samples used in these first two screens had Northern European ancestry. STRPs which passed these first two hurdles were then used in genome scans within the Mammalian Genotyping Service (see Marshfield web site) using hundreds of DNA samples from various geographical locations.
Only 11% of the 2262 genomic STR sequences that were screened were included within Sets 12 and 52. The great majority of excluded STRPs were rejected because of limited numbers of alleles (low informativeness). About 9% were rejected because of the presence of frequent non-integer alleles. We found that use of candidate genomic sequences with larger numbers of uninterrupted tandem repeats and use of overlapping BAC sequences with alleles which differed by two or more repeats led to higher rates of STRP inclusion into the Screening Sets. Information on all of the STRPs that we found to be polymorphic can be ob-tained from the comprehensive list of indel polymorphisms on the Marshfield web site.
We also improved the amplification efficiency of Screening Set STRPs. Most of the human STRPs developed in the early and mid 90s were based on relatively crude, singlepass sequencing of genomic DNA subclones. Comparison of the PCR primer sequences for Set 10 STRPs with the new public genomic sequences revealed that a surprisingly high 25% of the STRPs had mismatches in at least one of the primers (an example is shown in Figure 1). Nearly all of the mismatches were near the middle or 5' ends of the primers. New primers designed using the public genomic sequences were then tested side by side with old primers. At 55°C annealing temperature and no PCR multiplexing, few differences were observed between the old and new primer pairs, but under more stringent conditions (60°C annealing temperatures), 79 STRPs were found to amplify better with the new primer pairs (two examples are shown in Figure 2).
STRPs alleles are usually identified and labelled as the length of the PCR product as measured on denaturing polyacrylamide gels. Only in a handful of cases have the full spectrum of STRP alleles been sequenced. Therefore, STRP alleles are referenced to allele sizes for standard DNA templates (we use the parents of CEPH family 1331 available from the NIGMS Human Genetic Cell Repository). Allele sizes will also, of course, often change if the PCR primer sequences for a polymorphism are altered. To avoid null and weak alleles, to prevent the formation of doublet bands during PCR [17,18], and to achieve optimal PCR product length, we have modified original primer sequences for a substantial fraction of our Screening Set STRPs. We have used several different letters following the STRP name to indicate changes in PCR primers (see Marshfield web site). As two examples for STRPs on chromosome 1 in Set 12: GATA26G09N indicates that one of the original primers for GATA26G09 was changed to correct a sequencing error without change in allele sizes, and GGAA3A07Z indicates that one of the primers for GGAA3A07 was shifted along the chromosome resulting in different allele sizes. Current PCR primer sequences for all Screening Set STRPs are listed on the Marshfield web site along with allele sizes for individuals 133101 and 133102.

Genetic Map Positions
Initially, new STRPs were selected and incorporated into our Screening Sets based on physical distances obtained from the December 2000 UC-Santa Cruz draft sequence assembly. However, we soon found that the draft assembly contained many errors [see for example reference [19]] and resulted therefore in many STRPs being in the wrong map positions. To correct these mistakes, we utilized the

Figure 1
Correction of PCR Primer Sequences using Genomic Sequence Assemblies. The original single pass sequence for GATA87E02 is aligned with the sequences from several BACs containing overlapping genomic DNA. The original reverse PCR primer mismatched the BAC sequences near its 3' end. Note that because the great majority of the public human genomic sequence was generated from BAC libraries prepared from just a few donors, it is possible that two or even all three of the BAC sequences shown in the figure came from the same chromosome. most recent (June 2002) sequence assembly in addition to linkage analysis using three large Sets of families. In all cases except one (4ptel04), the linkage results matched the June 2002 assembly in terms of STRP order (we assumed the linkage results were correct for 4ptel04). Our confidence in STRP order is therefore high.
With one exception on chromosome 6p (see below) genetic map positions for the Screening Set STRPs were taken from the most recent Marshfield map [20] or by interpolation using the Marshfield map and the genetic and physical map positions described in the previous paragraph. Although the new Iceland genetic map [19] is higher resolution than the Marshfield map, a large fraction (62%) of the Screening Set 12 and 52 STRPs were not typed in the Iceland families. We did, however, check STRP order for all STRPs that were typed in the Iceland families and found no disagreements with the Marshfield map, except for two close (~1 mb apart), adjacent STRPs on chromosome 6p, ATA50C05 and ATC4D09, where the linkage results, the Iceland map and the June 2002 sequence assembly all disagreed with the Marshfield map.

Characterization of Sets 12 and 52
Numbers of STRPs, heterozygosity values, and sex-average genetic map properties for Screening Sets 12, 52, and 12 plus 52 combined, broken down by chromosome, are displayed in Table 2. Of the 39 total X chromosome STRPs in the combined Sets, 3 (GATA2A12, GGAT3F08, and GATA42G01) are in the pter pseudoautosomal region, and 1 (SDF1) is in the qter pseudoautosomal region. The 9 Y chromosome STRPs are all male-specific. Also, two small, tightly-spaced clusters of STRPs are included in Set 12 (six STRPs near the centromere of chromosome 11 and three STRPs on the short arm of chromosome 1) for the purpose of gauging linkage disequilibrium. Genetic distances for the autosomes are sex-average, and for the X chromosome are female (except for pseudoautosomal regions).
Set 12 STRPs with overall average heterozygosity of 76% are more informative than Set 52 STRPs with overall average heterozygosity of 67%. At least part of this difference may simply be a reflection of the populations used to deduce these values (see Methods). As shown in Table 2, X and especially Y chromosome STRPs had lower average informativeness than autosomal STRPs.
The average, sex-average STRP spacing of the combined Sets was 4.8 cM. The maximum gaps are 18.4, 37.5, and 15.5 cM for Set 12, Set 52 and Sets 12 and 52 combined, respectively. There were 13 gaps ≥ 15 cM in Set 12, 51 such gaps in Set 52, and 29 gaps ≥ 10 cM in Sets 12 and 52 combined. Set 12 STRPs were generally closer to telomeres than Set 52 STRPs, resulting in greater total chromosomal coverage.
A summary of repeat length in the Screening Set STRPs is presented in Table 3. Only 3 dinucleotide STRPs remain in Set 12. Fourteen pentanucleotide STRPs were also included in the combined Sets.
Breakdown of the Screening Set STRPs by repeat type is shown in Table 4. STRPs with AGAT and AAT repeats together accounted for 83% of the STRPs in the combined Sets. Note that because of permutation and the complementary strand there are several names for each repeat type. As just one example, AGAT repeats can also be presented as GATA, ATAG, TAGA, ATCT, TATC, CTAT, and TCTA repeats. Following the suggestion of Jin et al. [21] we have chosen the alphabetically minimal name.
We found that AAT repeats in particular, have a relatively low level of non-integer alleles. For example, within Set 10, 11.1% of GGAA and 10.2% of AGAT STRPs had frequent non-integer alleles, compared to only 1.8% of AAT STRPs. Because of high rates of non-integer alleles, STRPs with purines on one strand and pyrimidines on the other (eg AAGG) were avoided even though they are reasonably abundant and often especially informative [13].
Association of Screening Set STRPs with interspersed repeat elements (IREs) is shown in Table 5. STRPs were considered to be associated with IREs if the IRE fell in the 50 bp flanking the STR on either side (total of 100 bp of flanking sequence). Although total numbers for some of the STR types are relatively small, it appears that each type of STR has its own particular signature of IRE association. For example, AAAT STRs are very often (86%) associated with Alu elements, consistent with the hypothesis that most of these repeats evolved from the polyA tail of Alus [22]. An unexpectedly large fraction of AGAT STRs (16%) were found to be associated with LTRs. The results in Table 5 may generally provide clues about the evolution of STRs.

Discussion
Development of human STRP Screening Sets has paralleled advances in construction of genetic and physical maps. Except in regions with long inversion polymorphisms [23], it should soon be possible to specify STRP order within Screening Sets with near certainty. However, because of individual and even possibly population differences in recombination rates [24][25][26], it may never be possible to specify genetic distances between STRPs with high precision.  Similarly, it would also be helpful to carry out extensive sequencing of at least the frequent alleles for each Screening Set STRP. This would eliminate the need to approximate allele sizes. However, this would also be a large and expensive project, and may have to wait until sequencing costs drop so that many human genomes from around the world can be sequenced.
Although nearly all Screening Set STRPs are at least modestly polymorphic in all human populations examined to date, this does not guarantee that they will be free of frequent non-integer alleles or weak or null alleles in some populations. For example, we have observed apparent null alleles for some STRPs in Chinese that were not present in Europeans (eg GATA29A01 on chromosome 6 and GGAA20G04 on chromosome 2). We have also observed non-integer alleles in Sub-Saharan Africans that have not been seen at appreciable frequency in other populations (eg GATA104 on chromosome 7 and GATA11A06 on chromosome 18).
Despite having much higher mutation rates than diallelic polymorphisms, there is abundant evidence that highly informative STRPs of the type found within Screening Sets are generally powerful markers for detection of linkage disequilibrium [eg [31,32]]. It is unclear, however, whether dinucleotide or tetranucleotide STRPs are superior in this regard. Experimental evidence seems to favour higher average mutations rates for tetranucleotide STRPs [33], while theoretical results favour higher average rates for dinucleotide STRPs [34]. Analysis of STRPs typed in CEPH reference families for construction of human genetic maps revealed that the fraction of dinucleotide/dinucleotide STRP pairs < 200 kb apart with linkage disequilibrium at p < 0.01 was 18.7%, whereas the fraction for dinucleotide/tetranucleotide pairs was 7.9% and for dinucleotide/trinucleotide pairs was 22.1% (Broman K, Weber J unpublished results).
Many of the Set 12 and 52 STRPs are superior to the thirteen STRPs used routinely in forensic DNA testing in the U.S. http://www.cstl.nist.gov/biotech/strbase/fbicore.htm. Several of the thirteen forensic STRPs have fre-

Repeat
Total STRPs were screened using Repeat Masker for IREs that are within 50 bp in either direction of the short tandem repeats (excluding the tandem repeats). Sums of the numbers in the columns do not match the totals because some sequences had two different interspersed repeats within the 100 bp.
quent non-integer alleles. Several are not especially informative. Five of the thirteen forensic STRPs are currently included within Set 12. This occurred by chance rather than design. If genome polymorphism scans for either research or clinical purposes become widespread, then overlap between our Screening Sets and forensic Sets will have to be carefully considered.
Although our newest Screening Sets are substantial improvements over previous versions, they are still not perfect. Some STRPs have lower informativeness than desired, and some large gaps in coverage remain. The Set 12 STRPs are generally superior to Set 52 polymorphisms because Set 52 is new. There has not yet been a chance to make many replacements.
We will continue to make improvements in our human STRP Screening Sets and to post upgrades on the Marshfield web site. But are there limits to the quality of STRP Sets? The answer is undoubtedly yes. There are only approximately 65,000 modestly to highly informative triand tetranucleotide STRPs in the human gene pool [1,2]. Within some ~1 mb regions of the genome, we have already exhausted all likely tri-and tetranucleotide STRP candidates. Only a small fraction (11%) of the new STRPs we screened from the genomic sequence were selected for the new Sets. It is quite conceivable, that over the next decade or two we will characterize all human STRPs that have reasonable informativeness. Resequencing different human genomes will undoubtedly contribute much to this effort.
Quite a few investigators have speculated that diallelic polymorphisms such as SNPs or diallelic indels will supplant STRPs in human Screening Sets. Our position continues to be that this question will likely be ultimately determined by typing costs [4]. STRPs provide much more information than diallelic polymorphisms, so diallelic typing costs would need to drop well below those for STRPs. This might happen, but it hasn't yet, and it's not clear that it ever will. There may also be advantages to including both high and low mutation rate polymorphisms within Screening Sets (ie STRPs and diallelics) [35]. In any case, we believe that our STRP Screening Sets will continue to be highly valuable and widely used for many years.

Conclusions
The development of Screening Sets 12 and 52 will improve gene mapping in general, and specifically genome scans where a relatively high STRP density is required.
Complete information on all of our Screening Sets is freely available from the Marshfield web site http://research.marshfieldclinic.org/genetics along with lists of over 200,000 candidate and confirmed human indel polymorphisms, both multi-and diallelic. We plan to contin-ue to improve our human STRP Screening Sets until we have exhausted all available STRPs at specific chromosomal sites.

Identification of candidate polymorphisms
Two different approaches were used to search for new polymorphisms. One approach was to use overlapping BAC genomic sequences to select polymorphisms that varied by more than two repeats [15]. The other approach was to browse the genome for STRs using the December 12, 2000 version of the genomic sequence at University of California -Santa Cruz http://genome.ucsc.edu/ [36].
Once a sequence containing the desired polymorphism was selected (usually 400-700 bp in length), it was run through the Repeat Masker program http://ftp.genome.washington.edu/cgi-bin/RepeatMasker in order to avoid selecting PCR primers within Alu, L1, or other repeats. The Primer 3 program http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi was used to select PCR primers. Candidate sequences which did not permit the placement of at least one PCR primer within unique sequence (ie outside of a repeat identified by Repeat Masker) were not tested further. In cases where one PCR primer was located within a repeat, the primer from within the unique sequence was tagged with a fluorescent dye.

Sequence Alignments
All of the 406 single read STRP sequences from Set 10 were Blasted against genomic sequences from the public labs. For nearly all STRPs, we identified 1 to 3 BACs that showed high homology (Blast criteria were score (bits) > 200, expect (E) value < e-50, and ratio of matched bases to STRP sequence length >85%). Two different multiple alignment programs were then used to align the single read and the genomic sequences: "multalin" http://protein.toulouse.inra.fr/multalin/multalin.html and "clustalw" http://searchlauncher.bcm.tmc.edu/multi-align/ multi-align.html.

Screening of candidate polymorphisms
For initial screening of the PCR primers, we incorporated a dye-labelled nucleotide with a two-step PCR protocol. Briefly, the first step contained 10 mM Tris-HCl (pH 8.3), 50 mM KCl, 1.5 mM MgCl 2 , 0.001% gelatin, 250 µM each dNTP, 4.7 µM of the forward and reverse primers, 0.15 units of Taq polymerase (Roche) in a total 5 µl reaction volume. The second reaction had the same components and volume as in the first step, except that the forward primer was present at 6.2 µM and R6G dUTP (Applied Biosystems) at 0.5 µM with no reverse primer. About 0.5 µl of step 1 PCR product was used as a DNA template for step 2 PCR. Each PCR step initiated with a 95°C soak for 4 min, followed by 30 and 25 cycles for steps 1 and 2, respectively, consisting of 95°C for 40 sec, 55°C for 75 sec, 72°C for 40 sec, and a final extension of 7 min at 72°C. An equal volume of loading solution composed of EDTA (10 mM) and Orange G dye (13.6 mM) (Sigma) dissolved in formamide was added to the reaction following PCR, and 0.6 µl of the product was fractionated on denaturing acrylamide gels (6.0% acrylamide, 7.7 M urea, 89 mM Tris, 89 mM borate, 2.5 mM EDTA, pH 8.3).
For use of fluorescent-labelled primers, 45 ng of template DNA is dried in the wells of 96 well polypropylene plates. PCR amplifications were carried out in a 4 µl volume containing 10 mM Tris-HCl (pH 8.3), 50 mM KCl, 1.5 mM MgCl 2 , 0.001% gelatin, 100 µM each dNTP, 0.075 µM of fluorescent-labelled forward and unlabeled reverse primer, and a 0.12 units of Taq polymerase. PCR amplification was carried out for 27 cycles with the same times and temperatures as listed above.

Genetic map positions for new STRPs
Genetic distances for the new STRPs were obtained by typing the STRPs in several projects with large numbers of European families. The CRIMAP program was used to order the STRPs and to deduce genetic distances. In order to fit new STRPs into the Marshfield map [20], approximate genetic values were obtained by extrapolations using the new sex-average genetic distances and the Marshfield map genetic distances for two flanking, older STRPs. In rare instances, when no neighbouring STRPs with known Marshfield map distances were available, the genetic distances were extrapolated from physical distances from the UC-Santa Cruz sequence assembly, June 2002 version. For the X-chromosome analysis, female genetic distances were used in place of sex-average genetic distances. Heterozygosity values were determined by typing STRPs in two different population groups. For Cooperative Human Linkage Center (CHLC) and Utah STRPs in Set 12, heterozygosity values were deduced by typing the STRPs through several populations of different ethnic groups (African, Asian and European), whereas for newly developed STRPs in Set 12 and all the STRPs within Set 52 (newly developed, CHLC, and Utah STRPs), a European population was used. Heterozygosity estimates of the Set 10 (and many Set 12) STRPs are also available from genotyping of the Human Diversity Panel [see Marshfield web site and reference [27]].