Construction and sequence sampling of deep-coverage, large-insert BAC libraries for three model lepidopteran species

Background Manduca sexta, Heliothis virescens, and Heliconius erato represent three widely-used insect model species for genomic and fundamental studies in Lepidoptera. Large-insert BAC libraries of these insects are critical resources for many molecular studies, including physical mapping and genome sequencing, but not available to date. Results We report the construction and characterization of six large-insert BAC libraries for the three species and sampling sequence analysis of the genomes. The six BAC libraries were constructed with two restriction enzymes, two libraries for each species, and each has an average clone insert size ranging from 152–175 kb. We estimated that the genome coverage of each library ranged from 6–9 ×, with the two combined libraries of each species being equivalent to 13.0–16.3 × haploid genomes. The genome coverage, quality and utility of the libraries were further confirmed by library screening using 6~8 putative single-copy probes. To provide a first glimpse into these genomes, we sequenced and analyzed the BAC ends of ~200 clones randomly selected from the libraries of each species. The data revealed that the genomes are AT-rich, contain relatively small fractions of repeat elements with a majority belonging to the category of low complexity repeats, and are more abundant in retro-elements than DNA transposons. Among the species, the H. erato genome is somewhat more abundant in repeat elements and simple repeats than those of M. sexta and H. virescens. The BLAST analysis of the BAC end sequences suggested that the evolution of the three genomes is widely varied, with the genome of H. virescens being the most conserved as a typical lepidopteran, whereas both genomes of H. erato and M. sexta appear to have evolved significantly, resulting in a higher level of species- or evolutionary lineage-specific sequences. Conclusion The high-quality and large-insert BAC libraries of the insects, together with the identified BACs containing genes of interest, provide valuable information, resources and tools for comprehensive understanding and studies of the insect genomes and for addressing many fundamental questions in Lepidoptera. The sample of the genomic sequences provides the first insight into the constitution and evolution of the insect genomes.


Background
Large-insert bacterial artificial chromosome (BAC) libraries have been shown to be critical resources for many aspects of molecular and genomic studies [1,2], such as the positional cloning of genes [3] and quantitative trait loci [4], comparative studies of synteny and gene organization among different species [5], as well as for local or whole genome physical and genetic mapping and sequencing [6][7][8][9][10][11]. Arrayed, large-insert DNA libraries have provided the opportunity for researchers to analyze and share information and resources on specific clones [1,2,12,13]. Hundreds of BAC libraries have been constructed for microbe, plant and animal species [1,2,6,7,12,13]. However, only a few large-insert BAC libraries are available to date for insect species, especially lepidopteran insects [10,11,[14][15][16][17]. This could slow progress for the comprehensive molecular and genomics research of these clades.
Moths and butterflies, members of the insect order Lepidoptera, are the second most diverse group of animals, with at least 150,000 named species [18]. They are widespread members of the ecosystem, playing important roles as pollinators and prey, and are among the most destructive agricultural pests. Clearly, Lepidoptera are under-represented in terms of genomic resources and knowledge relative to their biological and economic status. This research was designed mainly to construct comprehensive BAC library resources for two species of moths, the tobacco hornworm, Manduca sexta and the tobacco budworm, Heliothis virescens, and one species of butterfly, the Müllerian mimic, Heliconius erato. These species have genome sizes ranging from 400 to 500 Mb/haploid genome (395 Mb for H. erato [19], 404 Mb for H. virescens [20], and 500 Mb for M. sexta [J. S. Johnston, pers. communication]) and are widely-used models for studying fundamental problems in neurobiology [21], olfaction [22], development [23], and immune responses [24] (M. sexta]; host feeding preferences [25] and evolution of insecticide resistance [26] and sexual communication systems [27] (H. virescens); and wing pattern mimicry [(H. erato) [28]. Moths and butterflies are estimated to have diverged from each other at least 50-60 million years ago [18]. The sphingid, M. sexta, is a member of the same superfamily, Bombycoidea, as the domesticated silkworm, Bombyx mori, the current genome model for Lepidoptera [8,9], and the noctuid, H. virescens, is related to other pest noctuids currently being used for genomic studies including Spodoptera frugiperda [16,29] and Helicoverpa armigera [30]. Here, we report the construction and characterization of six large-insert BAC libraries for these species and the first insight into the constitution and evolution of their genomes. The libraries will enable a large community of scientists to isolate and study the genes controlling these processes, provide new tools for lepidopteran systematics, and serve as critical resources for comparative genomic studies and genome sequencing of this important group of organisms.

Development of procedures for preparation of highmolecular-weight (HMW) DNA
One of the most important steps toward construction of high-quality BAC libraries is preparation of high-quality megabase DNA. Since no procedure was available for preparation of HMW DNA from these insects, we first developed a method for megabase DNA preparation by testing different DNA isolation buffer systems and tissues collected at different developmental stages of the insects. The results showed that the day-10 pupae (males and females) of M. sexta and day-4 pupae (males and females) of H. virescens and H. erato were most suitable for megabase DNA isolation using a buffer system containing 0.1 M NaCl, 10 mM Tris-HCl, 10 mM EDTA, pH 9.4, and 0.15% β-mercaptoethanol. The DNA isolated with this method was not only large in size (> 1000 kb), but also readily digestible and clonable, thus being well-suited for BAC library construction.

Library construction
The major goal of this study was to develop BAC resources that are widely usable for molecular and genomic studies of the insects, including whole genome physical mapping and sequencing. Therefore, we constructed two BAC libraries for each species with BamHI and EcoRI in the BAC vector pECBAC1. Table 1 summarizes the characteristics of the six BAC libraries constructed. The libraries were named MSB and MSR for M. sexta (MS) and B or R for BamHI or EcoRI, respectively, HVB and HVR for H. virescens, and HEB and HER for H. erato. The insert sizes of the library clones were estimated based on a random sample of 200-300 BACs from each library digested with NotI, a relatively rare cutter in lepidopteran DNA, and fractionated on pulsed-field gels. A typical pulsed field gel pattern for a set of random clones selected from the MSB BAC library is shown in Figure 1. The average insert sizes of the libraries ranged from 150-175 kb, and the proportion of the insert-empty clones was <5%. Each library contained from 19,200 to 21,504 clones which were arrayed into 384-well microtiter plates. Based on the number of clones and average insert sizes of each library, we estimated that the genome coverage of each library ranged from 6 ×-8 × genome equivalents, with the two combined libraries of each species having a genome coverage of 13.0 ×-16.3 × (Table 1). Based on this and previous studies [1,2,6,7,13], these BAC library resources should be well-suited for many kinds of molecular and genomic research, including whole genome physical mapping and sequencing.

BAC library screening
As an independent test of genome coverage, to demonstrate the utility of the libraries, and to isolate BACs containing genes of importance, the libraries were robotically double-spotted onto Nylon membrane filters in 3 × 3 format and screened with gene sequence-specific probes that are of interest for studies of the lepidopteran models. These included genes involved in olfaction (MsOR1, MsOR3, and HvHR16), nerve axon growth and guidance (MsNos128, MsEph, MsFasII, and MsPlexA), hormone action (MsE75, and HvPTTH), wing patterning (Hewg, Heptc, and HeCi), Bt toxin action (HvAPN120 and Hvcad), and ribosomal protein structure (HvRpS4, HeRpS5, HeRpS9, HeRpL3, and HeRpL10), some of which have also served as anchor loci for comparative linkage mapping [5,31]. Tables 2, 3 and 4 summarize the library screening results. Although the number of hits for individual probes varied widely, which may reflect the uneven distribution of the clones constructed with a single enzyme, the average number of hits in a library was close to the expected genome coverage estimated by the library insert sizes and the genome size (

BAC end sequences (BESs)
To validate the libraries further, estimate levels of contamination by microbial or organellar sequences and obtain some information about the constitution of the insect genomes, one 96-well plate per library, thus two 96-well plates per species, was sequenced from both ends of each clone. A total of 246-299 BESs were successfully generated for each species ( Table 5). The sequences had an average read length of 630 nucleotides, with an average of 560 Q20 nucleotides. Analysis of the BESs indicated that there was no evidence of contamination with DNA from organelles, the E. coli host, or other microbes that were potentially carried by the DNA source insects. The sequences are registered in the trace archives of GenBank described in Methods [see Additional file 1].

The GC and repeat element contents of the insect genomes
To provide a first insight into the constitution of the insect genomes, we analyzed the BESs using the RepeatMasker program, with an emphasis on the contents of GC and repeat elements including retroelements, DNA transposons, simple repeats, and low complexity repeats (  BESs from which a total of 103 repeats (3.16% of the genome) were identified, of which 79 were categorized as low complexity (1.51% of the genome). Similarly, a total of 180.031 kb of BESs generated from the M. sexta BACs were found to contain a total of 125 repeats (3.34% of the genome), of which 103 were categorized as low complexity (1.86% of the genome). Therefore, 3.16-10.01% of the lepidopteran genomes comprised repeat elements, of which a majority was categorized as low complexity repeats. The overall percentage of repeat elements was approximately 3-fold larger for the butterfly than for the moth genomes; further, the percentage of the low complexity repeats was > 4-fold larger for the butterfly than for the moth genomes. Among the low complexity repeats, 11 were longer than 100 bp (244 bp for the largest low complexity repeat), all of which were obtained from H. erato, whereas all remaining low complexity repeats obtained from M. sexta and H. virescens were shorter than 100 bp.
Retrotransposons, transposons and simple repeats were also identified in the BESs, but altogether they comprised <1% of the genomes. Nevertheless, the percentage of sim-ple sequence repeats in the butterfly genome (0.66%) was about 2-fold higher than those of the moth genomes (0.39% and 0.27%). Moreover, a total of 15 retro-elements were identified in the BESs of all three species whereas only 3 DNA transposons were identified, suggesting that retro-elements are generally more abundant than DNA transposons in these genomes.

BLAST hits of the BESs
Using the discontiguous Megablast program to query the database of all organisms available in GenBank, we searched for matches to BESs of the three species by BLASTn after masking with the RepeatMasker program (Tables 6 and 7 [ Table 6). Probes were derived from bacterial plasmids. (HER08E08), LQCBU02 (HEB02A01), and LQCBU16 (HEB04B01), had more than 10 discontiguous homologous sequences each to H. melpomene BACs registered in GenBank. These results suggested that this set of BACs may represent homologous regions. Complete sequencing of the three H. erato BACs will provide more information about the extent of microsynteny and evolution between these two species.
In comparison, 859 (83.5%) of the 1,029 hits for H. virescens BESs were from 52 insect species (Table 7). The top Probes were derived from PCR products amplified from genomic DNA. It was also found that most of the homologous sequences were shorter than 200 bp. However, large contiguous homologous sequences (>200 bp) were found to be associated with lepidopteran genes encoding proteins involved in hormone metabolism, structural proteins, and metabolic enzymes [see Additional files 5, 6 and 7]. An independent search of the BESs using BLASTx in Gen-Bank and ButterflyBase [32] yielded an average of 10.7 hits per species with high homology to confirmed coding regions of identified genes at e-values less than 1E-10 and bitscores in a range of 55-383 [see Additional file 8].
Additional hits were to ORFs with high similarity to features associated with retrotransposons, such as reverse transcriptase, gag-pol polyprotein, and endonuclease [33] and non-LTR transposons found in the silkworm genome, such as TRAS and SART [34]. Due to the limited sequencing information, the BLASTn results and discovery of putative genes are presented as potential features to be confirmed by further analysis.

Discussion
We have constructed six BAC libraries for three lepidopteran model species (2 moths and 1 butterfly). These libraries not only have large-insert sizes (150 -175 kb) and deep genome coverage (13 × -17 ×), but also have a low level of insert-empty clones (<5%) and no detected contamination with DNA from organelles and microbes potentially living on the source insects, as indicated by BES analysis. Moreover, the genome coverage and quality of the libraries have been verified independently by screening high-density filters of the libraries with a set of single-copy genes or ESTs. The observation that none of the libraries was contaminated with microbial DNA potentially carried by the source insects was expected, because the self-contained non-feeding pupal stage used as a DNA source for the library construction had purged their guts at the end of larval development. However, we did observe 6, 21 and 20 short sequences in the He, Hv and Ms BESs, respectively, which were homologous to viral, bacterial, and fungal sequences (Tables 6 and 7 [see Additional files 5, 6 and 7]). We believe that the homologues are real, but not from sample contamination because they sit in the middle of BESs. These results perhaps provide a line of preliminary evidence for the presence of microbial sequences in these lepidopteran genomes, possibly by horizontal transfer. Similar findings have been obtained in B. mori [35]. On the other hand, considering the small fraction (~0.5%) of the BAC libraries sampled, a more direct test of organelle contamination could be accomplished by using mitochondrial sequences as probes for hybridization. Furthermore, since the libraries of each species were constructed with two restriction enzymes (EcoRI and BamHI) complementary in the GC content of their restriction sites, the genome coverage should be much better distributed along the genome than those constructed with a single enzyme [13,36]. Therefore, these libraries could provide useful resources for comprehensive genomics research of the three model lepidopterans. The results of this study (Tables 5, 6 and 7) have provided a snapshot of the basic characters of the genomes of a group of ditrysian moths and butterflies which diverged from each other at least 50-60 million years ago [18]. First, the genomes of all three species are AT-rich (64-68%), with the genome of the butterfly (H. erato) having an AT content more than 3% higher than those of the moths (M. sexta and H. virescens). Second, the results show that all three insect genomes contain relatively small fractions of repeat elements (3-10%), including retro-transposons, transposons, simple repeats, and low complexity repeats. These results are in agreement with the small genomes of the species (400-500 Mb/1C) which generally tend to contain smaller fractions of repeat elements. Of these three insect species, the butterfly genome contains 3-5-fold more repeat elements (10.01% all repeats), especially low complexity repeats, than the two moth genomes. Papa reported that the total repetitive sequences accounted for about 26% of the genomic regions linked to wing pattern variation in H. erato [37]. The difference could be an effect of more H. erato-specific repeats documented, sampling of a specific region with a higher average repeat density, or both. Third, whereas the three insect genomes all contain a small number (<1%) of retro-elements, DNA transposons and simple repeats, retro-elements seem much more abundant than DNA transposons, and the butterfly genome is two-fold richer in simple repeats than the two moth genomes. Compared with published information from B. mori, the finding of such a low percentage of repeat contents in these three lepidopteran species is surprising, especially for M. sexta, which is in the same superfamily as the silkworm, Bombycoidea. Xia et al. [9] estimated about 20% of the B. mori genome to be composed of "transposable elements;" further, early work based on Cot hybridization kinetics estimated about 45% of the silkworm genome to be composed of repetitive sequences [39]. More recently Osanai-Futahashi et al. reported that the TEs made up 35% of the silkworm genome and contributed greatly to the genome size [40]. One may argue that we might simply have not identified all the relevant repeats in the BESs, but our argument is supported by the following evidence. . Large-scale end sequencing of the complete BAC libraries will uncover more detailed aspects of these butterfly and moth genomes, and provide more information for fundamental studies of lepidopteran insects in general.
The BLAST analysis of the sampled BESs has also provided insights into the evolution of these insect genomes. It is not surprising to find the top hits are to the sequences of lepidopteran species, but it is quite surprising that the highest numbers of M. sexta BES hits were to the sequences of other animals and plants rather than to B. mori (Tables 6 and 7). This finding suggests that although all the genomes have undergone changes since the split from the most recent common ancestor, they may have done so along different trajectories, with the M. sexta genome retaining some sequences in common with plants and animals that have been either lost or modified to a greater extent in H. virescens and H. erato. Such a hypothesis can only be tested when more genomic data are available for these lepidopteran insects. Moreover, the BESs of the butterfly (H. erato) are well-matched only to the sequences of H. melpomene. This suggests that not only is the butterfly more related to H. melpomene than to the two moth species, as expected, but this group has also diverged to a greater extent, resulting in a higher level of species-or evolutionary lineage-specific sequences. This argument is further supported by the finding that 27 of the 76 species having sequence matches to the BESs of H. Genomic sequence sample analysis of the moths and butterfly has provided an initial insight into the constitution and evolution of their genomes. Although large-scale genome sequencing is needed to further decipher the genomes of the species, especially their gene contents, the basic characteristics of the repeated sequence portion of each genome is useful information for our understanding of the genomes and their evolution. The high-quality BAC libraries of the insects, together with the gene-containing BACs and BAC end sequences, provide valuable information, resources and tools for comprehensive studies of the insect genomes and for addressing many fundamental questions in Lepidoptera.

Insect Materials
Source of DNA To minimize the potential polymorphism of the source DNA for BAC library construction, we sought insects that were as inbred as possible. For each species, we used progeny from a single pair mating to restrict the potential polymorphism of the insects to a maximum of 4 alleles per locus. Because the source strains were at least partially inbred, we expected significantly less polymorphism at many loci. This strategy also minimized the number of haplotypes in a library, since intra-chromosomal exchange (crossing over) occurs only in lepidopteran males.

BAC library construction
The pECBAC1 vector [41] was used in the library construction [42]. Vector DNA was isolated by the alkaline lysis method, purified by cesium chloride gradient centrifugation, digested completely with either BamHI or EcoRI, and dephosphorylated with calf intestinal alkaline phosphatase. The digested vector DNA was precipitated, dissolved in TE (10 mM TrisHCl, 1 mM EDTA, pH 8.0), adjusted to 10 ng/μl, and stored at -20°C before use [1,2,13,43].
HMW DNA was isolated from the insects using frozen pupal tissues and buffer system (see Results) according to the procedure described by Wu et al. [14]. The BAC libraries were constructed using an improved procedure developed in our laboratory [1,2,13,43] [46] to confirm the presence of one chromosomal locus. The amount of hybridizing DNA per filter was adjusted to a range of 30-60 ng per filter based on the intensity of signal obtained after initial screening. We routinely used 0.5 M NaCl in the hybridization buffer, but in some cases increased stringency to 0.4 M to reduce background. We re-used filters without stripping until the background became too high to read positive signals reliably or until we detected carry-through; then we treated the filters to remove the probe DNA according to the manufacturer's instructions. Probe DNA was obtained from a variety of sources, including bacterial plasmids containing well-characterized cDNA sequences and PCR fragments amplified from genomic DNA based on sequences registered in GenBank. Insert DNA was amplified by PCR using primers designed from the plasmid vectors or within the insert sequence and purified on Wizard Spin columns (Promega, USA) before labelling with the ECL reagents. Although we were able to detect positive signals on filters hybridized with probes as short as 350-400 bp, probes of 1 kb or greater gave more consistent signal-to-noise ratios.

BAC end sequencing and analysis
Ninety-six clones were randomly selected from each of the six BAC libraries and re-arrayed into a 96-well plate. Both ends of each clone were sequenced using the primer 5'-TAATACGACTCACTATAGGG-3' for the T7 end and 5'-GTTTTTTGCGATCTGCCGTTTC-3' for the SP6 end using the procedure developed at The Institute for Genomic Research [12] [47]. Finally, the masked BESs were BLASTed against the databases of all organisms by using the discontiguous megablast program at NCBI using the default criteria.
BLASTx searches were carried out against non-redundant protein sequences in GenBank (December 2008) and But-terflyBase version 2.92 [32]. Hits with e-values less than 1E-7 and bitscores greater than 50 were evaluated for similarity to coding regions of identified proteins and retrotransposons. High matches of similar sequences in more than one species was used as a criterion for provisional identification of a bona fide protein.
Publish with Bio Med Central and every scientist can read your work free of charge