Marker development
Single nucleotide polymorphism markers were developed from bovine EST sequence as described [11]. Briefly, tentative consensus (TC) clusters of bovine EST were obtained from The Institute for Genomic Research (TIGR) Bovine Gene Index[24]. Repeat elements in the TC were masked using RepeatMasker [25] and aligned with the human genome draft sequence via BLASTN [26]. Alignments were checked for the presence of apparent introns using a perl script that computed the predicted intron size. Primers were designed from the bovine EST sequence in such a way as to cross introns and produce products of approximately 800–1300 bp, while including at least 100 bp of putative exon sequence to allow confirmation that the primers targeted the desired gene.
Alternatively, SNP for some genes were developed from sequence associated with bacterial artificial chromosome (BAC) clones carrying all or part of the target gene(s) essentially as described [27]. Briefly, high density filters representing the CHORI-240 BAC library (P. deJong, personal communication) were screened with radiolabeled insert from EST clones representing target genes as recommended by the manufacturer (BACPAC resources, Oakland, CA). Positive clones were partially digested using the restriction enzyme Sau3AI (Promega, Madison, WI) by incubation of isolated BAC DNA. Aliquots of the reaction were removed at 10, 20 and 30 minutes of incubation into 10 mM final concentration EDTA on ice, desalted by gel filtration column (Axygen, Union City, CA), and separated on a 0.8% agarose gel. Fragments of 0.8–1.5 kilobase (kb) were excised from the gel and purified by ion exchange column as directed by the manufacturer (Marligen, Ijamsville, MD). Isolated fragments were cloned into pBluescript vector (Clonetech, Palo Alto, CA) prepared by digestion with restriction enzyme BamHI, and 192 randomly selected subclones for each BAC were picked into 80 ul LB media supplemented with ampicillin at 50 ug/ml in 384-well plates for sequencing from both ends with vector primers. A total of 134 BACs were screened, representing 122 loci (approximately 0.3% of the bovine genome).
Resulting sequences were analyzed using Phred, Phrap, and Consed programs [28] and amplification primers were designed using the autofinish and primer3 options of Consed. Primers derived from EST sequence or BAC subclones were used to amplify DNA from four bulls that were the sires (two Bos taurus × Bos taurus; two Bos taurus × Bos indicus) of the MARC reference panel mapping families [12], and the PCR products were sequenced with the amplification primers to identify heterozygous positions (SNP) in the amplicons using Polyphred and Consed. The SNPs were genotyped in progeny and mates of heterozygous sires from the MARC reference population using the Sequenom MassArray System [29] following established procedures [11].
Map construction
Software and procedures
Map construction was an iterative process completed in cycles. Cycles were initiated when genotypes for new markers, or corrections to previously genotyped markers, were recorded for animals in the MARC reference population. CRIMAP 2.4 [14], modified to reduce occurrence of unsolvable marker orders and controlled by a series of Perl scripts, was used to assign markers to linkage groups, order markers in each linkage group, and identify possible genotyping errors. All markers genotyped in the reference population were considered in map construction, including SNPs developed for this project as well as microsatellites, SNPs and other types of markers with recorded genotypes.
The modifications to CRIMAP included redimensioning arrays to accommodate a larger number of markers, and using logarithmic arithmetic for intermediate calculations to avoid values exceeding the precision limits of the 32-bit CPUs used to solve the maps. The perl scripts were developed to construct sets of alternative marker orders necessary at each step, distribute the needed CRIMAP fixed executions to nodes of a Linux cluster, and identify the most likely order from the set of orders evaluated in each step.
Mapping data sets
Two data sets were used to construct maps. Final marker order and map distances were computed from the complete reference population pedigree and genotype data. A reduced data set was used to initially place new markers and order the maps. The reduced data set was constructed by eliminating progeny genotypes where the progeny and both parents had identical heterozygous genotypes. These ambiguous, like-heterozygous, genotypes provided little information about recombination, although phase of inheritance and recombination are inferred from linked markers with unambiguous genotypes. Genotypes for about one-half of the markers included at least one ambiguous genotype, but fewer than 2.5% of the total number of observed genotypes were eliminated from the reduced data set. Including these ambiguous genotypes increased the number of calculations and computer memory necessary to compute likelihood of a particular marker order. Likelihoods of certain marker orders, usually involving several markers with ambiguous genotypes ordered consecutively, could not be solved using the complete data set. When the reduced data set was employed, uncomputable orders were eliminated and the time required to solve individual likelihoods reduced, making it feasible to evaluate a larger number of alternative marker orders.
Linkage group assignments
A map construction cycle began by extracting genotypes and pedigree data from the MARC database, and formatting the full and reduced data sets. Genotypes exhibiting non-Mendelian inheritance patterns were detected by the CRIMAP prepare option. Two-point analyses, with the complete data, were conducted to assign newly genotyped markers to linkage groups representing entire chromosomes. The two-point analyses established linkage between the new markers and subset of markers from the 1997 MARC map [4], selected to contain the most informative marker within each 5 cM interval. A LOD score greater than 3.0 was required to assign a new marker to a linkage group. Markers were assigned to the linkage group containing the marker with the highest two-point LOD score with the lowest recombination fraction.
Initial ordering of markers within linkage group
Markers assigned to a linkage group were initially ordered using the reduced data set. Starting from the existing order of a linkage group, each marker assigned to that linkage group by two-point analysis, but not present on the multi-point map, was tested in every possible location. These unmapped markers were evaluated in order of decreasing informativeness; the marker with the greatest number of informative meioses was tested first, and the marker with the least informative meioses was the last evaluated in each round of marker insertion. Markers were inserted at the location with the highest log-likelihood, if placement at that position did not increase map length by more than 0.75 cM. This somewhat arbitrary limit on increases in map length was imposed to minimize distortion that can result from errors in genotypes and marker order. Once a set of markers meeting the allowable increase in map length was inserted, alternative marker orders were evaluated in an iterative procedure until a more likely order could not be identified. Log-likelihoods were computed with each pair of adjacent markers interchanged. The current order was then replaced by the order of the switched pair showing the greatest improvement in likelihood, and the process was repeated until switching pairs of adjacent markers did not improve likelihood of the map. Map lengths with each marker temporarily removed from the linkage group were then determined. If removing a marker reduced map length, that marker was evaluated in all possible positions and reinserted at the position with the highest likelihood. If order was changed by removing and reinserting markers into more likely positions, the reordering process, starting with interchanging adjacent pairs of markers was repeated. If the resulting marker order was different than the initial order from the previous attempt to insert markers, and some markers assigned to the linkage group remained unmapped, processing continued with another attempt to insert markers.
Finalizing maps
After no more markers could be inserted that satisfied the criteria defined above, and the algorithms did not reveal a more likely marker order from the reduced data, the full data set was used to finalize the set of markers on the multi-point map, then order and position those markers. Inferred inheritance of ambiguous genotypes, included in the complete data set, suggested some map inflation and marker rearrangements that were not apparent in the reduced data set. Lengths of the map with markers individually removed were determined, and interior markers that stretched the map more than 0.75 cM were eliminated from the set of mapped markers. At this stage, the limit prevented map inflation caused by placing markers in the most likely solvable order, when possibly more likely orders, that did not increase map length, could not be solved using the complete data. The process of switching adjacent marker pairs was repeated with the full data set to establish a final marker order, and marker positions were computed from this order. The final maps represent the most likely marker order identified with the complete data set, although an exhaustive search of all possible marker orders was not conducted (and is not feasible).
After determining the final marker order, the chrompic option of CRIMAP was used to determine likely grandparental origin of marker alleles and identify recombination along each progenys' chromosomes. Several chrompic analyses were conducted for each linkage group; one for the final map and one for each marker that increased map length by more than 0.75 cM. Possible genotyping errors, indicated by two or more recombinations in an individual's chromosomes, were identified and suspicious genotypes were checked by manual inspection of the raw spectrographic data. Where no apparent error in the assay could be detected, animals were genotyped a second time using the Sequenom system, and in selected cases, genotypes were verified by sequencing. Corrections were entered in the database, and used in subsequent map construction cycles.
Confidence interval estimation
Confidence intervals around marker positions were estimated by computing the likelihoods of each marker in all possible positions, while preserving the final order of remaining markers in the linkage group. These likelihoods were computed with CarthaGene [20] using output translated from the CRIMAP [14]chrompic analysis of the final marker order. CarthaGene was used primarily for speed; the CarthaGene analyses ignored the distinction between probable and known allele phase, but yielded similar results in substantially less time and eliminated any occurrence of uncomputable orders. For a given LOD threshold, alternate positions yielding a log-likelihood difference less than the threshold were determined, as well as map positions (cM) of markers holding that order. Confidence intervals were computed from map positions corresponding to marker order. A map containing n markers ordered 1, 2, ... n had corresponding positions p
1
, p
2
, ... p
n
. An individual marker m holding order i could be placed anywhere between the position corresponding to the left index li to the position corresponding to the right index ri (li <= i <= ri) without reducing log-likelihood more than some threshold t. Confidence intervals were estimated from the equations:
CI
mt
= pli + 1- pri - 1for li > 1 and ri <n
CI
mt
= p
li
- pri - 1for li > 1 and ri = n
CI
mt
= pli + 1- p
ri
for li = 1 and ri <n.
The estimated confidence intervals are bounded by 0 and pn, so intervals including either end of the linkage group may be underestimated.
Comparative mapping
The collection of bovine sequences in GenBank sequence tag site (STS), mammal (MAM), patent (PAT), EST, and genome survey sequence (GSS) divisions, excluding sequences from the ongoing bovine sequencing effort [16], were obtained. Bovine sequences associated with markers were identified with e-PCR [30]. Where the primer pair for a marker matched multiple sequences, a consensus sequence representing that marker was obtained with phrap[28], and repetitive sequence was masked [25]. BLAT [31] was used to align resulting sequences to the May 2004 human genome assembly [32]. The highest scoring alignment for each marker sequence was identified, and was considered comparative only if all high scoring alignments for that marker consistently aligned with the same region of the human genome. Marker-human alignments were discarded if the marker sequence aligned with two or more regions of the human genome with similarly high BLAT scores.