Cross strategy
We generated large F2 mapping families by crossing two different geographic races of H. erato to the same stock of H. himera (Figure 1). We followed segregating variation at both the D and the Cr loci in crosses between H. erato cyrbia and H. himera and the D and Sd loci using crosses between H. erato notabilis and H. himera (Figure 1). All crosses were carried out in the Heliconius insectary at the University of Puerto Rico from stocks originally gathered from the wild: H. himera was collected in Vilcabamba, Ecuador (79.13 W, 4.6 S), H. erato notabilis was collected from an area near Puyo, Ecuador (78.0 W, 1.5 S), and H. erato cyrbia was collected near Guayquichuma Glen, Ecuador (79.6 W, 3.9 S). After eclosion butterflies were euthanized, their wings removed for later morphological analysis, and their bodies frozen at -80 °C for later molecular analysis.
Identifying AFLP markers linked to color pattern genes
Genomic DNA was extracted following procedures described in [14] and approximately 500 ng of DNA was used as template of our AFLP reactions. Genomic DNA restriction digestions and the ligation of oligonucleotide adapters were performed using the Core Reagent Kit (Invitrogen Life Technologies) following manufacture's instructions. For all the other steps of the AFLP analysis (e.g. serial dilutions and PCR amplification protocols), we followed the modifications of the original protocol described by Vos et al. [34] as outlined in Papa et al. [35].
To efficiently identify primer combinations that contained loci tightly linked to color pattern genes, we used a modification of the "bulk segregant analysis" method [36]. Specifically, we screened 23 AFLP primer combinations across an initial panel of 48 individuals arranged by color pattern genotype. Each reaction was run on a 10% polyacrylamide gel and visualized using fluorescently labeled (IRDye 700 or 800) EcoCN primers on a NEN® Global Edition IR2 DNA Analyzer (LI-COR® Biosciences, Lincoln, NE). In our screening gels, individuals with distinctive color-pattern genotypes were grouped and these groups were run side-by-side. For example, in the H. erato notabilis × H. himera cross we structured the gel with the following four genotypic groups: 1) DhimDnot, SdhimSdhim; 2) DhimDnot, SdnotSdnot; 3) DhimDhim, SdhimSdnot; 4)DnotDnot, SdhimSdnot(Figure 1). In this way, we easily identified marker loci linked to the color pattern genes and those primer combinations showing tightly linked markers were further assayed on all 120 individuals in the brood.
Because of the nature of our outbred F2 cross design [9, 10] we observed AFLP loci in all possible phases and segregation patterns. For each AFLP primer combination, we scored four different AFLP marker types: (1) monomorphic, (2) female-informative (FI) (AFLP band present in the mother and absent in the father), (3) male-informative markers (MI) (AFLP band present in the father but absent in the mother), and (4) markers that were present in both parents but segregated in offspring (BI). Both MI and FI markers were expected to segregate in a 1:1 ratio, whereas the BI markers, which were heterozygous in the parents, segregate in a 3:1 ratio. There is no crossing over during the oogenesis in Lepidoptera [37, 38] and FI markers on the same chromosome are inherited as a linkage block [10]. Thus, only the MI and BI provided information on recombination distance between AFLP polymorphisms and color pattern genes. Nonetheless, the FI markers were extremely useful because they allowed us to unambiguously identify chromosomal linkage groups and estimate background error rate of the AFLP technique [9, 10].
Genotypes for the color pattern genes were inferred from wing pattern phenotypes. Based on the amount of red, white and yellow in the forewing and hindwing, nine genotypes were observed among both groups of F2 progeny. For the D gene, shared by all three grand-parental types: homozygotes for the H. himera allele (DhiDhi) show only red present on the hind-wing bar, when D homozygotes for H. erato cyrbia or H. erato notabilis alleles (DcyrDcyr; DnotDnot) have red scales on the forewings, while D is heterozygotes between H. himera and the two H. erato races (DhiDcyr; DhiDnot) have red pigmented scales on both forewings and hindwings. The Cr gene segregates only in the H. erato cyrbia × H. himera cross, and has an epistatic effect on the D gene by modulating the amount of red. Homozygotes with H. himera alleles (CrhiCrhi) show a typical yellow forewing band when D is pure H. himera (CrhiCrhi; DhiDhi) and a red/white forewing band (total amount of red pigments < 50%) when D is either homozygous for H. erato cyrbia alleles (CrhiCrhi; DcyrDcyr) or heterozygous (CrhiCrhi; DhiDcyr). When the Cr color pattern gene is heterozygous (CrhiCrcyr) a yellow forewing shadow band is present when the D gene is homozyous for H. himera alleles (CrhiCrcyr; DhiDhi), and an almost all red forewing band (red >50%) when D homozygote for H. erato cyrbia alleles (CrhiCrcyr; DcyrDcyr) or heterozygote (CrhiCrcyr; DhiDcyr). A white trailing dorsal edge and a yellow bar on the ventral hindwing identify the pure H. erato cyrbia form (CrcyrCrcyr) with a totally black forewing when D is homozygous for H. himera alleles (CrcyrCrcyr; DhiDhi), or totally red when the D gene is pure for H. erato cyrbia alleles (CrcyrCrcyr; DcyrDcyr) or heterozygous (CrcyrCrcyr; DhiDcyr).
To score color patterns, one must also understand activity of the Sd gene. Sd segregates in the H. erato notabilis × H. himera cross, where it controls the shape of the melanic window in the middle forewing of H. himera and shows a clear interaction with the distal forewing patch of H. erato notabili s. In the pure H. himera (SdhiSdhi) a typical forewing band is shown, while in the pure H. erato notabilis (SdnotSdnot) a shortened basal forewing patch, as well as a smaller distal forewing patch are evident. The heterozygotes (SdhiSdnot) present an intermediate form lacking a distal patch (as found in H. erato notabilis) and have a shortened basal patch showing melanin anterior to the costal vein (this area is normally light colored in H. himera).
Isolating and characterizing AFLP markers
We screened all primer combinations that generated BI or MI AFLP markers strongly associated with particular color pattern genes in our initial "bulked" sample across all 120 individuals from our mapping family. We excised and cloned those tightly linked AFLP markers larger than 160 base-pairs using a three-step strategy. First, the band was isolated from a polyacrylamide gel using a LI-COR® Biosciences Odyssey® Infrared Imaging System. We followed methods outlined in [39] and used a grid to position and excise specific fragments with a scalpel. We validated each excision by re-scanning the gel to confirm that we had removed the correct fragment. All gel fragments were placed in 15 ml of 1× TE and frozen at -80°C. Next, we re-amplified the AFLP using the original selective primer combination and the original PCR conditions. We generated template for this reaction by freeze/thawing the excised band three times. In each freeze/thaw cycle, we collected the resulting supernatant after the band was frozen for one hour at -80°C, heated at 55°C for 15 min and centrifuged for 15 min at 15,000 rpm. We checked amplification success by running PCR products on polyacrylamide gels adjacent to the positive control original AFLP reactions. Finally, we cloned the PCR product using Invitrogen's TOPO TA® Cloning kit. PCR amplified inserts from 10–15 positive clones were again verified size on a polyacrylimide gel and those of the correct size were sequenced using DYEnamicT ET Terminator Cycle Sequencing Kit (Amersham Biosciences). Sequencing reactions were run on a MegaBACE 500 (Amersham Bioscience) or a 3130 DNA PRISM (ABI) at the Sequencing Facilities of the University of Puerto Rico, Rio Piedras. Resulting sequences were aligned by eye and PCR primers were designed using OLIGO version 4.0, and tested in genomic extracts from a panel of H. erato individuals.
Probing the H. erato BAC library
Two H. erato BAC libraries, one partially restricted with Eco RI and one with BamHI, were created from a line of H. erato petiverana collected in Gamboa, Panama and inbred for several generations. The H. erato BAC library was constructed by C. Zhang (TAMU) and M. R. Goldsmith (URI) following the procedure outlined in Wu et al. [40]. Both libraries contain 19,200 clones arrayed in 384-well plates and the average insert size for the Eco RI and Bam HI libraries was estimated to be 153 kb and 175 kb, respectively. Libraries were gridded onto nylon membranes using a strategy where each clone is spotted twice to facilitate the identification of "true" positives. AFLP probes were labeled with P32 using the Prime-It II Random Primer Labeling Kit (Stratagene, CA, USA). The resulting radioactive labeled products were cleaned using a sephadex purification column and hybridized to the filters overnight at 65 degrees in Church Buffer (0.5 M NaHPO4 pH 7.2, 7%SDS, 1 mM EDTA, and 1%BSA) with rotation. The filters were washed twice with 2× SSC +0.1%SDS, then once or twice with 1× SSC + 0.1%SDS. The washed filters were placed on film for 1–5 days at -80 degrees, depending on signal strength.
BAC fingerprinting, sequencing, and annotation
All positive clones identified in our screen of the H. erato BAC libraray were PCR-confirmed using probe specific primers. Clones were then grown overnight on agar plate and single colony was used to inoculate TB media. Insert DNA was extracted using the Qiagen Maxi prep kit. This insertion size of each BAC clone is estimated by summing all fragments after restriction enzyme digestion using Eco RI or Bam HI. The largest clone identified from our fingerprinting experiments was sequenced and assembled by the Baylor Genomics Center. Clones were first sheared to create 4–6 kb fragments and subcloned into pUC19. Approximately 8× sequence coverage of each BAC was then generated in paired 600–800 bp reads. Data were assembled using PHRAP [41] and edited in a GAP4 [42] database.
The BAC sequences were analyzed using a variety of sequence annotation programs. First we used Kaikogaas [43], an automated annotation package for gene prediction [44], to identify possible genes. Kaikogaas integrates a variety of programs for gene prediction and structural analysis of genomic sequence including software for coding region and splice-site prediction, sequence homology analysis, protein localization site prediction, and protein classification and secondary structure prediction. All putative open reading frames were also compared directly to our database of Heliconius Expressed Sequence Tags (ESTs) located in ButterflyBase [45] using BLAST [46]. Any EST that showed highly significant similarity (e-value of ≥ 10-40) to a putative ORF was then directly compared to the full genomic sequence. In this way, we could better assess possible homology versus similarity due to repetitive DNA elements (see below). To identify the B. mori sequences homologous to the H. erato BAC sequences, we performed a tBLASTn against the whole-genome shotgun reads of B. mori using as a query the translated protein sequence predicted by Kaikogaas.
Comparative sequence analysis
A linkage strategy similar to ours was previously used to identify BACs near the Yb gene complex in H. melpomene [15]. Sequencing and finishing of three contiguous H. melpomene BACs across this region was carried out by the Wellcome Trust Sanger Institute (accession numbers: CR974474, CT955980, CU367882). We used VISTA [21] to identify similarities between H. erato, H. melpomene, and B. mori genomic sequences. Pairwise genomic alignments were performed on the mVISTA server using the Avid alignment algorithm and the results were displayed together with the position of the annotations (ORFs, genes, mobile DNA, microsatellites). Each annotation or new genomic region identified from the comparative analysis that showed a strong similarity (≥ 80% conserved) was verified by aligning the sequences from both species using the LAGAN program [47] to get an accurate DNA sequence assembly.
We also searched the H. erato and H. melpomene BAC sequences for novel repetitive sequence as well as previously described transposable elements. To identify novel repetitive sequences in the BACs, we used the RepeatFinder software [42, 48]. RepeatFinder is explicitly designed to find complex repeated motifs in large contiguous blocks genomic DNA sequence. Its algorithm uses BLAST to iteratively query segments of the input sequence against the intact input sequence. It then combines subsequences with high BLAST similarity into groups, which are further refined by considering the variability in length and divergence among constituent subsequences. The final output is a list of groups of aligned, highly similar subsequences. It is important to note that because the algorithm uses local alignments (generated by BLAST), different groups may overlap in part or in whole. This overlap allows the groups to be clustered into just a few contigs representing a set of "core motifs" to which each group can be uniquely assigned. Thus, each core motif is a consensus of consenses.
We submitted the concatenated BAC sequences from each species to RepeatFinder using default parameters except for the following: Block Size = 2000, Minimum Repeat Size = 20, Maximum Repeat Size = 700. Larger values for the Maximum Repeat Size parameter were tried, but no groups larger than ~600 bp were ever identified. Core motifs were assembled from group consensus sequences >35 bp in length using the default parameters in CodonCode Aligner software (CodonCode Corp., Dedham, MA). For both species' sets of core motifs, all-vs-all BLASTn searches were performed 1) between species to identify motifs shared between species and 2) within species to verify that motifs within species were unrelated. BLASTn was also used to search two additional databases for sequences similar to the core motifs. First we searched a combined database of the completed genomic or whole genome shotgun sequences from all 20 insect genome projects currently available at NCBI. We also searched all arthropod transposable elements available in the Repbase library of transposable elements [19].
We simultaneously identified and masked BAC regions corresponding to both the novel core motifs and known repetitive sequences using RepeatMasker [27]. We combined core motif sequences with sequences of all arthropod transposable elements from RepBase into a single database. We queried this database with the concatenated BAC sequences using RepeatMasker and the CrossMatch search algorithm on the 'slow' setting to maximize sensitivity. The resulting data was summarized using custom scripts implemented in the R statistics package [49].