Highly conserved gene order and numerous novel repetitive elements in genomic regions linked to wing pattern variation in Heliconius butterflies

Background With over 20 parapatric races differing in their warningly colored wing patterns, the butterfly Heliconius erato provides a fascinating example of an adaptive radiation. Together with matching races of its co-mimic Heliconius melpomene, H. erato also represents a textbook case of Müllerian mimicry, a phenomenon where common warning signals are shared amongst noxious organisms. It is of great interest to identify the specific genes that control the mimetic wing patterns of H. erato and H. melpomene. To this end we have undertaken comparative mapping and targeted genomic sequencing in both species. This paper reports on a comparative analysis of genomic sequences linked to color pattern mimicry genes in Heliconius. Results Scoring AFLP polymorphisms in H. erato broods allowed us to survey loci at approximately 362 kb intervals across the genome. With this strategy we were able to identify markers tightly linked to two color pattern genes: D and Cr, which were then used to screen H. erato BAC libraries in order to identify clones for sequencing. Gene density across 600 kb of BAC sequences appeared relatively low, although the number of predicted open reading frames was typical for an insect. We focused analyses on the D- and Cr-linked H. erato BAC sequences and on the Yb-linked H. melpomene BAC sequence. A comparative analysis between homologous regions of H. erato (Cr-linked BAC) and H. melpomene (Yb-linked BAC) revealed high levels of sequence conservation and microsynteny between the two species. We found that repeated elements constitute 26% and 20% of BAC sequences from H. erato and H. melpomene respectively. The majority of these repetitive sequences appear to be novel, as they showed no significant similarity to any other available insect sequences. We also observed signs of fine scale conservation of gene order between Heliconius and the moth Bombyx mori, suggesting that lepidopteran genome architecture may be conserved over very long evolutionary time scales. Conclusion Here we have demonstrated the tractability of progressing from a genetic linkage map to genomic sequence data in Heliconius butterflies. We have also shown that fine-scale gene order is highly conserved between distantly related Heliconius species, and also between Heliconius and B. mori. Together, these findings suggest that genome structure in macrolepidoptera might be very conserved, and show that mapping and positional cloning efforts in different lepidopteran species can be reciprocally informative.

conservation and microsynteny between the two species. We found that repeated elements constitute 26% and 20% of BAC sequences from H. erato and H. melpomene respectively. The majority of these repetitive sequences appear to be novel, as they showed no significant similarity to any other available insect sequences. We also observed signs of fine scale conservation of gene order between Heliconius and the moth Bombyx mori, suggesting that lepidopteran genome architecture may be conserved over very long evolutionary time scales.

Conclusion:
Here we have demonstrated the tractability of progressing from a genetic linkage map to genomic sequence data in Heliconius butterflies. We have also shown that fine-scale gene order is highly conserved between distantly related Heliconius species, and also between Heliconius and B. mori. Together, these findings suggest that genome structure in macrolepidoptera might be very conserved, and show that mapping and positional cloning efforts in different lepidopteran species can be reciprocally informative.

Background
Among emerging evolutionary and ecological model organisms, the passion-vine butterfly genus Heliconius (Nymphalidae: Heliconiinae) offers particularly exciting possibilities for integrative research into the genetic and developmental basis of adaptive variation [1,2]. The genus, composed of around 40 species with hundreds of geographic variants, couples color pattern divergence with multiple cases of mimicry-related convergent evolution [2]. The wing color patterns of Heliconius are adaptations that warn potential predators of the butterflies' unpalatability [3] and also play an important role in speciation [4]. Nearly all Heliconius species participate in local Müllerian mimicry associations and, in any one area, the wing color patterns of different aposematic butterfly species converge into a handful (usually six or less) of clearly differentiated mimetic assemblages [5]. The color patterns characterizing many of these mimicry rings often change dramatically every few hundred kilometers. This pattern of convergent and divergent evolution in Heliconius is best exemplified by the mimetic relationship between H. erato and H. melpomene. The two species are distantly related within the genus and never hybridize [2,6,7], yet, where they co-occur, local races possess nearly identical wing patterns and have undergone parallel and congruent radiations into over 20 geographic races [5,8].
The multiple radiations of mimetic color patterns, particularly the parallel radiations of H. erato and H. melpomene, provide "natural experiments" for comparative studies into the genetic and developmental basis of adaptive change. In this paper, we describe a simple strategy that integrates growing genomic resources in Heliconius to identify regions of the genome near the loci that modulate wing pattern variation in H. erato. Our strategy relies on the fact that large phenotypic differences within species are caused by a handful of major effect loci [8] and that crosses can be designed that allow researchers to unambiguously follow the segregation of alleles at these loci [9,10]. By scanning through thousands of AFLP polymorphisms in these crosses we can identify markers tightly associated with particular color pattern genes. These markers are then used to probe newly available Bacterial Artificial Chromosome (BAC) libraries and allow us to obtain large sections of genomic sequence around color pattern genes. These targeted genomic sequences provide the first insights into the architecture of the H. erato genome including details on gene density, repeat structure and, with sequence information from homologous regions of the H. melpomene genome, the preservation of fine-scale gene order between the two co-mimics. These data facilitate comparative mapping work on the genetic basis of color pattern variation and convergence in Heliconius, including efforts to positionally clone the color pattern genes themselves. These data also provide some of the first information on patterns of microsynteny in lepidopteran genomes, complementing recent work showing marked patterns of synteny conservation at a macro scale between H. melpomene and the silk moth Bombyx mori [11].
We are focusing our research efforts on two major color pattern loci, D and Cr, which underlie much of the observed pattern variation in H. erato. Both genes are unlinked and alleles at the different loci interact to cause phenotypic shifts across large areas of the wing surface, changing the position, size and shape of red/orange/yellow and melanic patches on both the dorsal and ventral surfaces of the forewings and hindwings. Alleles at the D locus primarily act by switching scale color between black (melanin) and red/orange (ommochrome pigments) [12,13]. In contrast, alleles at Cr control the positioning of melanin across both the forewing and hindwing, thereby either exposing or covering underlying white and yellow pattern elements (Figure 1). The two loci strongly interact to control the size, shape, and position of both the forewing band and hindwing bar of many races of H. erato [9,10,14].
Cross design and wing phenotypes Figure 1 Cross design and wing phenotypes. Color pattern phenotypes observed in crosses between 'grand-parental' H. himera (middle) and H. erato cyrbia (left) and H. erato notabilis (right) resulting in two pairs of F1 parents with females on left, males on right. Each pair of F1 parents produced F2s with H. himera × H. erato cyrbia offspring (A) and the H. himera × H. erato notabilis offspring (B). In both F2 families, the observed phenotypic differences among individuals are consistent with the interaction of two co-dominant loci, as described in the Methods.
Crossing experiments among the various races of H. erato and H. melpomene have shown that the genetic basis of the color pattern radiations is similar in these species [15]. In both, a small number of major effect loci, or complex of tightly linked loci, modulate much of the intraspecific pattern variation. Furthermore, the phenotypic effects of many of the major patterning genes are often quite similar between the two species [5,16,17]. For example, Cr in H. erato and the N/Yb/Sb complex in H. melpomene control most of the variation in yellow and white pattern elements in different mimic races of the two species [9,10,17]. Similarly, variation in the major red pattern elements on the forewing and hindwing of H. erato and H. melpomene can be explained by variation at an unlinked gene, D, in H. erato and the similarly named D/B complex, in H. melpomene. In contrast, in H. melpomene these switch genes represent clusters of tightly linked elements separated by one cM or less [2,18]. Comparative mapping experiments have shown that the Yb complex in H. melpomene and the Cr locus in H. erato, which have analogous phenotypic effects, map to the homologous regions of their respective genomes [15].
There were three primary goals for the study presented here. First, we sought to identify molecular markers linked to the H. erato color pattern genes D and Cr. Second, we used some of these molecular markers to identify and sequence BAC clones containing genomic sequences linked to these color pattern genes. Lastly, we analyzed selected BAC sequences in order to better understand finescale characteristics of the H. erato genome and to make comparisons with homologous genomic sequences in H. melpomene and B. mori. Ultimately we found that synteny is highly conserved between Heliconius species, and even between Heliconius and B. mori. We also observed relatively low gene density coupled with a high frequency of novel repeat elements in the Heliconius genomic sequences. Together, our data show that comparative genomic analysis between lepidopterans is highly tractable, and that positional cloning of genes underlying color pattern variation in Heliconius should be possible using standard methods.

Identification of markers tightly linked to color pattern genes
We examined 1440 AFLP H. erato polymorphisms using 23 primer combinations (EcoCN/MseCNN). The number of AFLP bands per gel ranged between 26 and 132 with a mean of 72 bands per primer combination. Of these, approximately 84% were polymorphic in our outbred F2 cross. The experiment-wide error rate for our screen was approximately 1.0%, as inferred from discrepancies among female informative (FI) markers. In total, we scored 490 Male Informative (MI) and 470 backcross informative (BI) loci. Assuming an estimated H. erato genome size of 395 Mb [14], and assuming that AFLP markers are distributed randomly, suggests that we surveyed polymorphisms at approximately 362 kb intervals across the genome. This would suggest a resolution of 1.3 cM assuming that the relationship between physical and recombination distance is 276 kb/cM [9].
Our genome scan identified several AFLP markers 1-3 cM away from D. For the other gene, Cr, previous work using an identical strategy on crosses of H. melpomene provided markers within one cM of this gene in H. erato [15]. In total, we identified five AFLP loci within a 3 cM target window around the D locus. Across our H. himera × H. erato notabilis mapping family for D, three loci (MI_EcoCA-MseCAA-114 bp, BI_EcoCC-MseCAG-155 bp, BI_EcoCT-MseCCG-139 bp) were perfectly linked and two loci (MI_EcoCC-MseCAC-527 bp, MI_EcoCC-MseCAC-485) showed only one recombinant. We cloned and sequenced several of these AFLP loci and developed PCR primers to amplify them from genomic DNA. Interestingly, two AFLP bands tightly linked to the D gene, MI_EcoCC-MseCAC-527 bp, MI_EcoCC-MseCAC-485, were allelic variants of the same locus.

Identification of BAC clones containing color patternlinked AFLPs
We screened the H. erato BAC library with D-linked (MI_EcoCC-MseCAC-527 bp) and Cr-linked ("βggt-II" -Rab geranylgeranyl transferase beta subunit, βggt-II gene [15]) probes. In our screens with these and other probes (15 probes in total, with an average of 12 positives per probe), we consistently observed between 9 and 15 PCRconfirmed positives per probe, suggesting the H. erato BAC libraries have approximately 10× genome coverage. The largest clones identified from the D-linked and Crlinked probing experiments were sequenced at 8× coverage. The D-linked clone (BBAM-25K4) was composed of two large sequences that could easily be orientated to produce an approximately 180 kb genomic fragment ( Figure  2). Similarly, sequence of the Cr-linked clone (BBAM-38A20) was composed of two large sequences that together spanned approximately 165 kb ( Figure 3). The probe sequences were clearly identifiable in the D-linked and Cr-linked BACs and linkage to color pattern genes was confirmed by mapping (see below).

Chromosome walk in the Cr region
From the sequence of the first Cr-linked BAC, identified with the βggt-II gene, we designed additional probes to use for a second round of BAC library screening. Specifically we generated two more probes, corresponding to the genes Trehalase1 and B9, to expand our walk on both directions. With this strategy we identified new BACs on the 3' end that were positive for B9 and negative for H. erato BAC sequence (25_K04) annotation   . . .
Trehalase1 and others on the 5' end positive for βggt-II and negative for Trehalase1. After fingerprinting we selected one BAC to sequence from each end. Ultimately, BBAM-27D18 extended the overall contig by 211 kb on the 3' end, while BBAM-12K4 extended the contig on the 5' for 55 kb (Figure 4).

High frequency of novel repetitive elements in the Heliconius genome
The D-and Cr-linked BAC sequences were AT-rich (65% and 66% AT content, respectively) and contained many repeats (Table 1). Although all three major classes of transposable elements (DNA transposons, LTR, and non-Fine-scale synteny and sequence conservation between H. erato and H. melpomene    Genes showing a high similarity to know proteins in other arthropods are represented with black squares and arrows indicating orientation. The lines between the two sequences unite the homologous genes giving a visual representation of the overall synteny between the two species.  LTR retrotransposons) were present in the genomic sequences (Table 1A), the vast majority of repetitive sequences showed no significant BLAST similarity (i.e. evalue < 0.001) to any of the insect genomes currently available NCBI databases nor to any arthropod transposable elements listed in RepBase [19]. Heliconius-specific repetitive sequences corresponded to the nine core motifs identified with RepeatFinder (  (Table 2). For example, across the D-linked BAC we identified 75 hypothetical proteins using the Kaikogaas annotation tool and our own BLAST analysis. Over 90%, however, were less than 150 amino acids in length, only one of which showed similarity to any known or predicted protein. Across the entire ~190 kb region near the D locus in H. erato there were only two hypothetical proteins that showed significant homology to a known protein or contained a known structural element. One was similar to a sequence in our EST collection (HEC00815), while the other showed strong homology to a lepidopteran methionine-rich larval storage protein ( Table 2). Although there was a smaller absolute number of predicted proteins across the Cr-linked BAC, there were more than twice as many predicted proteins of aminoacid length greater than 150 across, all but one of which showed strong homology to known proteins or to Heliconius ESTs (Table 2, Figure 2). Overall, the gene density we observed in H. erato is similar to what has been observed in the repeat-rich heterochromatin domains of Drosophila melanogaster, which averages 2.9 genes per 100 kb (versus 12.6 genes per 100 kb in euchromatin) [20].

Fine-scale microsynteny between H. erato and H. melpomene genomic sequences
VISTA analysis [21] (70% identity, 30 bp window -see Methods) showed a 35% conservation between the 164 kb H. erato Cr-linked BAC clone (BBAM-38A20) and the homologous sequence from H. melpomene ( Figure 3). All of the predicted genes showed strong similarity (80-95% identity) between the two species and perfect overall synteny (Figure 3 and 4). Furthermore, a significant portion of 57 kb of presumed non-coding sequence (i.e. did not show notable open reading frames) was highly conserved between the two species ( Figure 3). Despite the overall conservation between H. erato and H. melpomene sequences, two ESTs did show a difference between the

Conservation of gene order between H. erato and B. mori
We found evidence for fine-scale synteny between H. erato and B. mori in the 420 kb genomic region linked to the Cr color pattern gene ( Figure 5). No evidence for microsynteny was observed between D-linked H. erato genomic regions and B. mori, due to the lack of conserved genes in the D-linked clone. B. mori scaffold sequence (nscaf2829), downloaded from SilkDB [22], contained all of the major genes annotated on the Cr-linked BAC clones (Table 2, All seven genes showed 70-85% nucleotide acid sequence similarity between species ( Figure 5). In addition to the major genes, there were many other genomic regions between Heliconius and B. mori with a nucleotide acid sequence similarity higher than 85% that did not show BLAST similarities to any known proteins.

High levels of fine-scale genomic conservation between Heliconius species
We have previously demonstrated that the Cr locus in H. erato and the Yb gene of H. melpomene map to homologous areas of the genome [15]. BAC genome sequence data for a region tightly linked to the Yb gene was obtained from H. melpomene using methodology similar to that described here. A single gene marker developed from the H. melpomene sequence mapped close to the Cr locus in H. erato. Here we provide the first genomic sequence evidence that, across a broad region around this gene, gene order and gene content is conserved. Across a 420 kb overlapping region all putative proteins showing high similarity to known proteins were in the same order in the two co-mimics (Figure 3 and 4). This further supports the hypothesis that a homologous gene, or set of  genes, is responsible for color pattern variation in the two species. Indeed, with the exception of two large ORFs with strong sequence similarity to a reverse transcriptase ( Table  2, Figure 3), gene order in the Cr-linked H. erato region and the N/Yb/Sb-linked H. melpomene region was nearly perfectly preserved. Furthermore, many of the smaller ORFs, as well as some non-coding sequence, were highly conserved both in the relative order and sequence. Generally, the H. erato and H. melpomene genomes appear to be structurally very similar. Because of this, linkage analyses and positional cloning efforts in each individual species should be highly informative for the co-mimetic species, and probably for the genus as a whole.

The difference in H. erato and H. melpomene genome sizes
One of the most obvious differences between the H. erato and H. melpomene genomic sequences was the larger physical distance between homologous anchor points in H. erato relative to H. melpomene. This observation was probably related to the fact that the genome of H. erato is about 30% larger than that of H. melpomene [9,10]. In this respect, it was notable that the difference in genome sizes between the two species was roughly proportional to the size of a number of sequence blocks that are absent in H. melpomene relative to H. erato in our genomic sequences ( Figure 3). Many of these blocks were comprised of Heliconius-specific repetitive sequences, or showed strong similarity to known mobile genetic elements. These indel blocks appeared to be primarily noncoding sequences because none of them contained protein-coding sequences, with the exception of one EST (HEC03006) and a reverse transcriptase.
Differences in the genome size between closely related species is common and is usually associated with differences in abundance of different classes of noncoding DNA [23]. It has been suggested that the insertion and replication of repetitive elements, as well as indel biases, can lead to profound differences in genome size [24,25]. Our observations suggest that both of these effects might be relevant in Heliconius, however, the differences in amount of repetitive DNA sequences in indels, at least over the small region that we examined, were not large enough to completely account for the differences in the genome size of the two species.  (Table 1A). Because we were unable to detect any of the remaining nine repeats (all identified via RepeatFinder) in public sequence databases we assume these nine repeats represent novel repetitive elements unique to the Heliconius genus (Table 1B, Additional file 1: Novel repetitive elements in Heliconius (row sequences)). The seven previously described elements were larger (~1-5 kb) relative to the nine novel elements (100-600 bp) and occurred much less frequently. Most instances of these novel repeats observed in the BAC sequences were intact, full-length, highly similar versions of the core motifs. However, a wide range of fragmentation and divergence relative to the core motifs was also observed among repetitive regions. Motif #7 in H. erato best exemplifies this pattern, as it was the most abundant repeat observed, with 310 BAC regions showing significant similarity to the 266 bp core motif. These regions ranged continuously in size from 37 to 286 bp and in divergence from 2% to 32%. All other motifs showed a qualitatively similar pattern where BAC regions corresponding to a novel element ranged from being highly similar to the core motif sequence to being fragmented and divergent. This pattern is consistent with a process of motif replication and insertion followed by mutational degradation, suggesting these novel repeat motifs likely represent some sort of transposable element such as short interspersed nuclear elements (SINEs) or miniature inverted-repeat terminal elements (MITEs) [28]. Both SINEs and MITEs have been reported from Lepidoptera [29,30]. More work is required to confidently determine the origin of these novel repetitive sequences in Heliconius. We are hopeful that future genomic sequences from other butterfly species will allow a better understanding the phylogenetic distribution and evolutionary origins of some of the observed repeats.

Preliminary evidence for fine-scale synteny between Heliconius and Bombyx
The observed fine scale synteny within Heliconius complements a recent study showing that patterns of macrosynteny are strongly conserved across the Macrolepidoptera [11]. This previous study demonstrated that a total of 70 markers mapped in H. melpomene and B. mori showed large-scale patterns of synteny across the genome (taking into account a number of putative chromosomal fusions that explain the difference in chromosome number between these species). Our data further suggests that synteny has been preserved between Heliconius and B. mori on a much finer scale. Specifically, we show here that seven predicted genes in the Cr region have a similar order in the homologous B. mori genomic sequence ( Figure 5). Although this is a very small sampling of the genome as a whole, it is still notable that gene order, as well as intergenic distances, has been preserved over such a long time scale.

Chromosome walking towards Heliconius color pattern genes
The goal of this study was to identify and characterize regions of the genome linked to wing pattern polymorphism in Heliconius butterflies. We did not necessarily expect these initial BAC sequences to contain the color pattern genes themselves, however, these sequences provide important genomic "anchors" for ongoing positional cloning work. Fine-scale mapping experiments imply that we are very near the D and Cr color pattern loci. A microsatellite marker at the 5' end of the D-linked BAC showed 7 recombinants across 444 individuals, suggesting that this end is about 1.9 cM from the gene. There were two fewer recombinants for a marker developed from exon sequence of the Methionine Rich Storage Protein (MRSP) gene at the 3' end of the BAC. This marker is about 150 kb from our 5' microsatellite marker, suggesting that D is a further 1 cM in the 3' direction. We appear to have come down even closer to the Cr gene. In this case, a marker developed from the βggt-II gene near the 5' of the BBAM-38A20 BAC clone showed no recombinants across 430 individuals. A similar result was obtained with a marker developed from the Putative DNA helicase Rec-Q in the middle of the BBAM-27D18 BAC clone, suggesting that the zero recombinant interval might be somewhat large. For this reason, given an expected relationship of physical to recombination distance of 275 kb/cM [9], an expectation consistent with our initial mapping results, we are optimistic to identify the genomic recombinant interval containing both genes with the next one (Cr) to three (D) BAC steps.
In terms of identifying color pattern genes, there are obvious practical benefits of a fine-scale preservation of gene order between species. Foremost, conservation should greatly facilitate the identification of functional genes and the discovery of some regulatory elements associated with these genes [31,32]. Indeed, there were numerous conserved regions that were not simply protein coding regions ( Figure 3). Even though this analysis covers only a small portion of the Heliconius genome, it confirms, for the first time, at a fine scale level what has been seen in comparative mapping projects concerning gene order conservation between different species in the genus [9,10,33].
A comparative approach will be particularly important for pinpointing the regions responsible for pattern variation in Heliconius. Pattern formation in Heliconius probably involves discrete changes in conserved protein coding or regulatory regions [2,12]. There is little precedent for what to expect, however, variation in pattern formation could be controlled by a number of cis-regulatory elements of a single gene, clusters of duplicated genes with divergent function, or clusters of non-paralogous but functionallyrelated genes.

Conclusion
The mimetic wing patterns of Heliconius stand out as one of the best examples of an adaptive radiation. We are using a strategy that couples growing genomic resources with high-resolution linkage analysis in order to gain a fuller appreciation of the genetic basis of this radiation.
We have identified regions of the Heliconius genome tightly linked to genes that modulate pattern variation and, for one of these regions, we have demonstrated the fine-scale preservation of gene order between distantlyrelated Heliconius species and across Lepidoptera (Heliconius and B. mori). This conservation is significant because it will greatly facilitate efforts for gene identification through parallel and complementary efforts in different species. It is our hope that further fine-scale mapping, complemented by targeted genomic sequencing, will allow us to identify the genes that underlie wing pattern variation and diversity in the genus Heliconius.

Cross strategy
We generated large F2 mapping families by crossing two different geographic races of H. erato to the same stock of H. himera (Figure 1). We followed segregating variation at both the D and the Cr loci in crosses between H. erato cyrbia and H. himera and the D and Sd loci using crosses between H. erato notabilis and H. himera (Figure 1). All crosses were carried out in the Heliconius insectary at the University of Puerto Rico from stocks originally gathered from the wild: H. himera was collected in Vilcabamba, Ecuador (79.13 W, 4.6 S), H. erato notabilis was collected from an area near Puyo, Ecuador (78.0 W, 1.5 S), and H. erato cyrbia was collected near Guayquichuma Glen, Ecuador (79.6 W, 3.9 S). After eclosion butterflies were euthanized, their wings removed for later morphological analysis, and their bodies frozen at -80 °C for later molecular analysis.

Identifying AFLP markers linked to color pattern genes
Genomic DNA was extracted following procedures described in [14] and approximately 500 ng of DNA was used as template of our AFLP reactions. Genomic DNA restriction digestions and the ligation of oligonucleotide adapters were performed using the Core Reagent Kit (Invitrogen Life Technologies) following manufacture's instructions. For all the other steps of the AFLP analysis (e.g. serial dilutions and PCR amplification protocols), we followed the modifications of the original protocol described by Vos et al. [34] as outlined in Papa et al. [35].
To efficiently identify primer combinations that contained loci tightly linked to color pattern genes, we used a modification of the "bulk segregant analysis" method [36]. Specifically, we screened 23 AFLP primer combinations across an initial panel of 48 individuals arranged by color pattern genotype. Each reaction was run on a 10% polyacrylamide gel and visualized using fluorescently labeled (IRDye 700 or 800) EcoCN primers on a NEN ® Global Edition IR2 DNA Analyzer (LI-COR ® Biosciences, Lincoln, NE). In our screening gels, individuals with distinctive color-pattern genotypes were grouped and these groups were run side-by-side.  (Figure 1). In this way, we easily identified marker loci linked to the color pattern genes and those primer combinations showing tightly linked markers were further assayed on all 120 individuals in the brood.
Because of the nature of our outbred F2 cross design [9,10] we observed AFLP loci in all possible phases and segregation patterns. For each AFLP primer combination, we scored four different AFLP marker types: (1) monomorphic, (2) female-informative (FI) (AFLP band present in the mother and absent in the father), (3) male-informative markers (MI) (AFLP band present in the father but absent in the mother), and (4) markers that were present in both parents but segregated in offspring (BI). Both MI and FI markers were expected to segregate in a 1:1 ratio, whereas the BI markers, which were heterozygous in the parents, segregate in a 3:1 ratio. There is no crossing over during the oogenesis in Lepidoptera [37,38] and FI markers on the same chromosome are inherited as a linkage block [10]. Thus, only the MI and BI provided information on recombination distance between AFLP polymorphisms and color pattern genes. Nonetheless, the FI markers were extremely useful because they allowed us to unambiguously identify chromosomal linkage groups and estimate background error rate of the AFLP technique [9,10].
Genotypes for the color pattern genes were inferred from wing pattern phenotypes. Based on the amount of red, white and yellow in the forewing and hindwing, nine genotypes were observed among both groups of F2 progeny.

Isolating and characterizing AFLP markers
We screened all primer combinations that generated BI or MI AFLP markers strongly associated with particular color pattern genes in our initial "bulked" sample across all 120 individuals from our mapping family. We excised and cloned those tightly linked AFLP markers larger than 160 base-pairs using a three-step strategy. First, the band was isolated from a polyacrylamide gel using a LI-COR ® Biosciences Odyssey ® Infrared Imaging System. We followed methods outlined in [39] and used a grid to position and excise specific fragments with a scalpel. We validated each excision by re-scanning the gel to confirm that we had removed the correct fragment. All gel fragments were placed in 15 ml of 1× TE and frozen at -80°C. Next, we reamplified the AFLP using the original selective primer combination and the original PCR conditions. We generated template for this reaction by freeze/thawing the excised band three times. In each freeze/thaw cycle, we collected the resulting supernatant after the band was frozen for one hour at -80°C, heated at 55°C for 15 min and centrifuged for 15 min at 15,000 rpm. We checked amplification success by running PCR products on polyacrylamide gels adjacent to the positive control original AFLP Libraries were gridded onto nylon membranes using a strategy where each clone is spotted twice to facilitate the identification of "true" positives. AFLP probes were labeled with P 32 using the Prime-It II Random Primer Labeling Kit (Stratagene, CA, USA). The resulting radioactive labeled products were cleaned using a sephadex purification column and hybridized to the filters overnight at 65 degrees in Church Buffer (0.5 M NaHPO4 pH 7.2, 7%SDS, 1 mM EDTA, and 1%BSA) with rotation. The filters were washed twice with 2× SSC +0.1%SDS, then once or twice with 1× SSC + 0.1%SDS. The washed filters were placed on film for 1-5 days at -80 degrees, depending on signal strength.

BAC fingerprinting, sequencing, and annotation
All positive clones identified in our screen of the H. erato BAC libraray were PCR-confirmed using probe specific primers. Clones were then grown overnight on agar plate and single colony was used to inoculate TB media. Insert DNA was extracted using the Qiagen Maxi prep kit. This insertion size of each BAC clone is estimated by summing all fragments after restriction enzyme digestion using EcoRI or BamHI. The largest clone identified from our fingerprinting experiments was sequenced and assembled by the Baylor Genomics Center. Clones were first sheared to create 4-6 kb fragments and subcloned into pUC19. Approximately 8× sequence coverage of each BAC was then generated in paired 600-800 bp reads. Data were assembled using PHRAP [41] and edited in a GAP4 [42] database.
The BAC sequences were analyzed using a variety of sequence annotation programs. First we used Kaikogaas [43], an automated annotation package for gene prediction [44], to identify possible genes. Kaikogaas integrates a variety of programs for gene prediction and structural analysis of genomic sequence including software for coding region and splice-site prediction, sequence homology analysis, protein localization site prediction, and protein classification and secondary structure prediction. All putative open reading frames were also compared directly to our database of Heliconius Expressed Sequence Tags (ESTs) located in ButterflyBase [45] using BLAST [46]. Any EST that showed highly significant similarity (e-value of ≥ 10 -40 ) to a putative ORF was then directly compared to the full genomic sequence. In this way, we could better assess possible homology versus similarity due to repetitive DNA elements (see below). To identify the B. mori sequences homologous to the H. erato BAC sequences, we performed a tBLASTn against the whole-genome shotgun reads of B. mori using as a query the translated protein sequence predicted by Kaikogaas.

Comparative sequence analysis
A linkage strategy similar to ours was previously used to identify BACs near the Yb gene complex in H. melpomene [15]. Sequencing and finishing of three contiguous H. melpomene BACs across this region was carried out by the Wellcome Trust Sanger Institute (accession numbers: CR974474, CT955980, CU367882). We used VISTA [21] to identify similarities between H. erato, H. melpomene, and B. mori genomic sequences. Pairwise genomic alignments were performed on the mVISTA server using the Avid alignment algorithm and the results were displayed together with the position of the annotations (ORFs, genes, mobile DNA, microsatellites). Each annotation or new genomic region identified from the comparative analysis that showed a strong similarity (≥ 80% conserved) was verified by aligning the sequences from both species using the LAGAN program [47] to get an accurate DNA sequence assembly.
We also searched the H. erato and H. melpomene BAC sequences for novel repetitive sequence as well as previously described transposable elements. To identify novel repetitive sequences in the BACs, we used the Repeat-Finder software [42,48]. RepeatFinder is explicitly designed to find complex repeated motifs in large contiguous blocks genomic DNA sequence. Its algorithm uses BLAST to iteratively query segments of the input sequence against the intact input sequence. It then combines subsequences with high BLAST similarity into groups, which are further refined by considering the variability in length and divergence among constituent subsequences. The final output is a list of groups of aligned, highly similar subsequences. It is important to note that because the algo-rithm uses local alignments (generated by BLAST), different groups may overlap in part or in whole. This overlap allows the groups to be clustered into just a few contigs representing a set of "core motifs" to which each group can be uniquely assigned. Thus, each core motif is a consensus of consenses.
We submitted the concatenated BAC sequences from each species to RepeatFinder using default parameters except for the following: Block Size = 2000, Minimum Repeat Size = 20, Maximum Repeat Size = 700. Larger values for the Maximum Repeat Size parameter were tried, but no groups larger than ~600 bp were ever identified. Core motifs were assembled from group consensus sequences >35 bp in length using the default parameters in Codon-Code Aligner software (CodonCode Corp., Dedham, MA). For both species' sets of core motifs, all-vs-all BLASTn searches were performed 1) between species to identify motifs shared between species and 2) within species to verify that motifs within species were unrelated. BLASTn was also used to search two additional databases for sequences similar to the core motifs. First we searched a combined database of the completed genomic or whole genome shotgun sequences from all 20 insect genome projects currently available at NCBI. We also searched all arthropod transposable elements available in the Repbase library of transposable elements [19].
We simultaneously identified and masked BAC regions corresponding to both the novel core motifs and known repetitive sequences using RepeatMasker [27]. We combined core motif sequences with sequences of all arthropod transposable elements from RepBase into a single database. We queried this database with the concatenated BAC sequences using RepeatMasker and the CrossMatch search algorithm on the 'slow' setting to maximize sensitivity. The resulting data was summarized using custom scripts implemented in the R statistics package [49].