KONAGAbase: a genomic and transcriptomic database for the diamondback moth, Plutella xylostella

Background The diamondback moth (DBM), Plutella xylostella, is one of the most harmful insect pests for crucifer crops worldwide. DBM has rapidly evolved high resistance to most conventional insecticides such as pyrethroids, organophosphates, fipronil, spinosad, Bacillus thuringiensis, and diamides. Therefore, it is important to develop genomic and transcriptomic DBM resources for analysis of genes related to insecticide resistance, both to clarify the mechanism of resistance of DBM and to facilitate the development of insecticides with a novel mode of action for more effective and environmentally less harmful insecticide rotation. To contribute to this goal, we developed KONAGAbase, a genomic and transcriptomic database for DBM (KONAGA is the Japanese word for DBM). Description KONAGAbase provides (1) transcriptomic sequences of 37,340 ESTs/mRNAs and 147,370 RNA-seq contigs which were clustered and assembled into 84,570 unigenes (30,695 contigs, 50,548 pseudo singletons, and 3,327 singletons); and (2) genomic sequences of 88,530 WGS contigs with 246,244 degenerate contigs and 106,455 singletons from which 6,310 de novo identified repeat sequences and 34,890 predicted gene-coding sequences were extracted. The unigenes and predicted gene-coding sequences were clustered and 32,800 representative sequences were extracted as a comprehensive putative gene set. These sequences were annotated with BLAST descriptions, Gene Ontology (GO) terms, and Pfam descriptions, respectively. KONAGAbase contains rich graphical user interface (GUI)-based web interfaces for easy and efficient searching, browsing, and downloading sequences and annotation data. Five useful search interfaces consisting of BLAST search, keyword search, BLAST result-based search, GO tree-based search, and genome browser are provided. KONAGAbase is publicly available from our website (http://dbm.dna.affrc.go.jp/px/) through standard web browsers. Conclusions KONAGAbase provides DBM comprehensive transcriptomic and draft genomic sequences with useful annotation information with easy-to-use web interfaces, which helps researchers to efficiently search for target sequences such as insect resistance-related genes. KONAGAbase will be continuously updated and additional genomic/transcriptomic resources and analysis tools will be provided for further efficient analysis of the mechanism of insecticide resistance and the development of effective insecticides with a novel mode of action for DBM.


Background
The diamondback moth (DBM), Plutella xylostella, is one of the most harmful insect pests for crucifer crops worldwide. Control of DBM is made difficult because DBM has rapidly evolved resistance to many classes of conventional insecticides such as pyrethroids, organophosphates, fipronil, spinosad, Bacillus thuringiensis (Bt), and diamides. DBM insecticide resistance is mediated primarily by overexpression of detoxification genes [1][2][3][4] and/or gene mutation-derived target insensitivity [5][6][7][8][9]. The life cycle of DBM is very short (approximately 14 days in warm climates), and this is considered as one of the factors for the high insecticide resistance of DBM. Therefore, to efficiently control DBM while suppressing the development of the resistance, it is important to identify and analyze the genes related to the insecticide resistance which will help to (1) clarify the mechanism of DBM insecticide resistance, and (2) to develop new insecticides with a novel mode of action for more effective and environmentally less harmful insecticide rotation.
To accelerate this area of research, a database system which provides genomic and transcriptomic information for DBM is essential. It is important to provide not only sequence data, but also useful annotation information and easy-to-use user interfaces which help researchers to search sequences by homology search, keyword search, and so on through common web browsers. In the Lepidoptera order, to which DBM belongs, the silkworm (Bombyx mori) has been used as a representative model insect owing to its well-studied full genome sequence and several useful databases such as KAIKObase [10], SilkDB [11], and SilkBase [12]. Recently, large-scale genome databases of three other lepidopteran insects, MonarchBase for the monarch butterfly (Danaus plexippus) [13], Manduca Base for the tobacco hornworm (Manduca sexta) [14] and Heliconius butterfly genome project website for the Postman butterfly (Heliconius melpomene) [15], have been developed based on next generation sequencing. Moreover, lepidopteran EST databases such as SPODOBASE for the fall armyworm (Spodoptera frugiperda) [16], WildSilkbase for wild silkmoths [17], and ButterflyBase for various butterflies and moths [18] are all very useful for comparative analysis of lepidopteran insects. Until recently, however, a similar genomic database has not been available for DBM (After public release of KONAGAbase and the submission of this paper, DBM-DB, a genomic database of DBM, was released in late December 2012 [19]).
We describe here the development of KONAGAbase, a genomic and transcriptomic database for DBM which combines comprehensive sequence data with useful annotation information and easy-to-use rich-GUI web interfaces. KONAGAbase is publicly available from our website (http://dbm.dna.affrc.go.jp/px/).

EST sequences
Three cDNA libraries were generated from midgut (4th instar larvae from both sexes), egg (both sexes), and testis (4th instar larvae) of the Bt-toxin susceptible strain (PXS) of DBMs maintained in our research institute. Total midgut RNA was extracted using RNeasy Mini Kit (Qiagen Inc., USA) according to the manufacturer's instructions. Midgut cDNA was synthesized using a cDNA Synthesis Kit (TaKaRa Bio Inc., Japan) in accordance with the Gubler-Hoffman method with oligo(dT) 18 primers and the cDNA was purified by removing smallsized DNA using a CHROMA SPIN + TE-1000 spin columns (Clontech Laboratories Inc., USA). A midgut cDNA library was constructed by cloning cDNA fragments into pBluescript II SK (+) vector (Agilent Technologies Inc., USA) which was then introduced into Escherichia-coli HST08 Premium Competent Cells. Egg and testis cDNA libraries were constructed as follows: total RNA was extracted using a QuickPrep Micro mRNA Purification Kit (GE Healthcare UK Ltd, England) according to the manufacturer's instructions. cDNA was synthesized using SMART RACE cDNA amplification kit (Clontech). First strand cDNA was synthesized using an oligodT primer (5′-TGTGTCTAGAG GATCCGTACCCAGC(T) 30 VN-3′) along with SMART II oligonucleotide (5′-AAGCAGTGGTAACAACGCAG AGTACGCGGG-3′) in the kit. Amplification of cDNA was performed by PrimeSTAR GXL DNA Polymerase (TaKaRa) with SMART technology using the nested universal primer (NUP) (5′-AAGCAGTGGTAACAACGCA GAGT-3′) and 3′-PCR primer (5′-TGTGTCTAGAG GATCCGTACCCAGC-3′) in the kit. An adenine overhang was added to each egg cDNA fragment using Taq polymerase and then the fragments were purified using Sepharose 2B (GE Healthcare). An egg cDNA library was then constructed by cloning the fragments into pGEM-T Easy vector (Promega Inc., USA). Each testis cDNA fragment was amplified using phosphorylated primers and then purified using Sepharose 2B. A testis cDNA library was then constructed by cloning the fragments into Sma I site of pUC18 vector (TakaRa). Each cDNA library was introduced into E. coli HST08 Premium Competent Cells.
Raw EST sequences were generated by sequencing the cDNA clones of the abovementioned three cDNA libraries using ABI 3730 XL sequencer. The generated sequences were base-called using the Phred program [20,21] and low quality regions (quality value (QV) < 20) were removed. EST sequences less than 100 bp were discarded. Next, contamination by vector, adapter, linker or primer in the remaining sequences was detected by cross_match program (−minmatch 12 -minscore 20 options used) and BLAST searches against UniVec database file (VecScreen Search Parameters described in [22] were used). Contamination by E. coli and/or mitochondria sequences were also detected by blastn (cut-off e-value: 1e-40) with mito and ecoli BLAST databases retrieved from NCBI ftp site [23]. Detected contaminating regions were removed and sequences less than 100 bp in length were discarded. The number of high quality EST sequences generated was 35,618 [12,406 (midgut), 6,904 (egg), and 16,308 (testis)].
EST/mRNA sequences of the DBM (2,033 entries, Nov. 2011) were downloaded from the NCBI nucleotide database [24] and preprocessed using the above described procedure. The number of EST/mRNA sequences added was 1,722, resulting in a total of 37,340 EST/mRNA sequences in our database (Table 1).

RNA-seq sequences
Total RNA was extracted from the whole body of the 4th instar DBM larvae (both sexes) of the PXS strain. Preparation of cDNA library from the total RNA and sequencing by Illumina HiSeq 2000 sequencer were performed by Hokkaido System Science Co., Ltd. (Sapporo, Japan). As a result, 201,115,780 single-read (101 bp) RNAseq sequences (total size = 20,313 Mbp) were generated. RNA-seq sequences were de novo assembled by Trinity [25] with ALLPATHS-LG error correction and 147,370 contigs were generated. The GC content was 42.82%. The total, average, median, and maximum lengths were 66.5 Mbp, 451.3 bp, 298 bp, and 11,311 bp, respectively. N50 was 515 bp. The lengths of 9,960 contigs were greater than or equal to 1,000 bp.

WGS sequences
DNA was extracted from the whole body of the 4th instar DBM larvae (both sexes) of the PXS strain and sequenced using a Roche 454 GS FLX Titanium sequencer. As a result, 5,359,190 raw read sequences were generated. Low quality (QV < 20) regions in the sequences were removed by our custom Perl script implemented based on the modified Mott algorithm of the Phred program. Contaminating sequences were removed using the same procedure used for the ESTs. The number of high quality sequences generated was 5,082,983 totaling 1,925 Mbp, representing approximately 5-fold genomic coverage [estimated genome size of DBM is 370 Mbp (data not shown)]. The GC content was 37.70%. The average and median read lengths were 378.8 bp and 406 bp, respectively.

Work-flow for constructing sequences in KONAGAbase
The prepared sequences were processed and analyzed according to the workflow shown in Figure 1. Each process is described below.

De novo assembly of WGS sequences
De novo assembly of 5,083,983 high quality WGS read sequences was performed using Celera Assembler version 6.1 with CABOG pipeline [26]. As shown in Table 2, 88,530 contigs (N50: 2,273 bp) were generated with 246,244 degenerate contigs and 106,455 singleton reads which could not be incorporated into contigs. The size of the contigs is approximately 50% of the putative genome size which is almost equal to the sum of the size of degenerate contigs and singleton reads, which may indicate that many of repetitive elements which could not be incorporated into the contigs are contained in the degenerate contigs and singletons. Therefore, we used not only contigs but also degenerate contigs and singleton reads for de novo repeat identification and gene prediction (described in later sections) to identify as many repeat sequences and genes as possible.

De novo identification of repeat sequences
De novo identification of putative repeat sequences in the assembled WGS sequences were performed using RepeatScout [27] with "-l 16" (16-mers) option. A total of 6,747 sequences, overlapping by 50 bp or more with WGS contigs, degenerate contigs, or singleton reads, and which appeared 10 times or more, were extracted. These 6,747 sequences were used to query NCBI-nt (non-redundant nucleotide) and NCBI-nr (non-redundant protein) with the blastn/blastx algorithms (at a cut off e-value of 1e-05). The sequences matched with known genes (437; except for repeat sequences such as transposons) were discarded since they may represent non-repeat sequences. The number of putative repeat sequences generated was 6,310. The GC content was 39.8%. The average, median, and maximum lengths were 304 bp, 260 bp, and 1,434 bp, respectively. These sequences were used for repeat masking in the analyses described below.

Clustering and assembly of transcriptome sequences
To provide a non-redundant, comprehensive transcriptomic DBM dataset, 37,340 ESTs and 147,370 RNA-seq contigs were clustered using CLOBB2 [28] and then assembled   Percentage values in the "Total" columns are the ratio of the values to the sum of the length of all the assembled sequences. N50 of the contig sequences is 2,273 bp.
using CAP3 [29]. Before performing the clustering, repeat and low complexity regions in the sequences were masked using RepeatMasker [30] with our putative repeat sequences, after which poly-A tails were trimmed using trimest [31]. Sequences of fewer than 100 non-masked bases ( [34] and sixpack [31], a total of 84,562 predicted CDSs (more than 15 amino acids) were identified in both the nucleotides and amino acids.

Gene prediction
Although the generated WGS contigs are relatively small compared with those of model insects such as B. mori, those data are still useful for comprehensive preliminary search of target genes. Accordingly, in order to provide useful preliminary information of putative genes, ab initio gene prediction was performed against the assembled WGS sequences using AUGUSTUS [35], SNAP [36], and geneid [37] with exonpart/intron hint files (EST/mRNA sequences and RNA-seq contigs of DBM) and with optimized parameter settings for more accurate gene prediction of DBM calculated by us based on the instruction documentation of each software program. Consensus predicted genes were generated by combing results of the three gene prediction programs using EVidenceModeler [38]. Predicted genes with short CDSs (fewer than 200 bp) and/or with transposon-related description identified in the same manner as for the unigenes were discarded. The number of genes generated was 34,890 CDSs of the predicted genes (both the nucleotide CDS and amino acid CDS were generated for each gene). The GC content was 54.8%. The total, average, median and maximum lengths were 16.8 Mbp, 481.1 bp, 360 bp and 11,460 bp, respectively. In the predicted gene CDSs, the number of putative complete genes (including start and stop codons) was 10,682 (30.6%) and the number of putative partial genes was 24,208 (69.4%).

Generating a putative gene set
To provide a comprehensive putative gene set of DBM, the unigene and predicted genes were clustered by CLOBB2 using criteria identical to those used for the clustering of the unigene, and 72,581 clusters were generated. Of these clusters, 39,781 whose member sequences have no predicted CDSs or no similarity with known domains (Pfam protein database searched by HMMER3) or known proteins (NCBI-nr database and eight gene sets of model insects searched by BLAST as described later) were detected, and 39,781 representative sequences (the longest sequence in each cluster) were selected as a putative unknown gene set in which novel genes might be included. In the remaining clusters, 32,800 representative sequences (Table 3) were selected as a putative gene set based on the criterion in which a sequence with highest BLAST score is preferred (HMMER3 score and sequence length is used in a tie-break). As shown in Table 3, the number of representative sequences derived from unigenes is 20,870 (63.6%) and that derived from predicted gene CDSs is 11,930 (36.4%).

Annotation
The unigene, predicted genes, and the putative gene set were annotated with homology search by BLAST, domain/motif search by HMMER3 (Pfam protein database was used), and GO terms allocated by Blast2GO [39] (the result of the BLAST search for NCBI-nr database was used by Blast2GO). The sequences were compared with various databases for the BLAST search (cutoff e-value: 1e-05), namely, NCBI-nr database, UniRef90 database [40], gene sets (protein sequences) of three lepidopteran insects (Bombyx mori (gene set by GLEAN [10,11] Utility and discussion

Database system
We developed a database system for integrating the generated DBM sequences and associated annotation data, and for providing user interfaces to access to the data by standard web browsers. Figure 2 shows the feature block diagram of KONAGAbase. The following web interfaces are provided for efficient and easy searching, browsing, and downloading sequences stored in the database: (1) BLAST search form, (2) keyword search form, (3) BLAST result-based search form, (4) GO tree-based search form, and (5) genome browser powered by GBrowse [47]. All sequences can be downloaded as a zipped FASTA file from a download page. The architecture of the database system consists of three parts: (1) back-end databases, (2) server-side programs, and (3) web interfaces. Figure 3 shows the overview of the database architecture.

Back-end databases
Generated DBM sequences and associated annotation data were stored using three back-end database software: MySQL (relational database management system), Sary (suffix array library and tools for fast full-text search for huge text files) [48], and BLAST database (generated by formatdb of NCBI BLAST). FASTA format data (sequence ID, string of nucleotides/amino acids, and description based on BLAST top-hit of NCBI-nr) of all DBM sequences were stored into (1) the MySQL database by our in-house Perl scripts using Perl DBI module, and (2) BLAST databases by formatdb. GFF (General Feature Format) version 3 format data required for the GBrowse were also stored into the MySQL database by bp_seqfeature_load.pl provided by BioPerl [49]. The GFF data contains start to end positions of assembled WGS sequences, exons positions and strands of the predicted genes, the unigenes, and the putative gene set on the assembled WGS sequences. Annotation information of each sequence data (tabbed text files containing each sequence ID and corresponding annotation strings) were converted into suffix arrays by mksary provided by Sary.

Server-side programs
Server-side operations at KONAGAbase are performed in a Linux server (CentOS 6.3). An Apache web server handles queries from web clients through Perl CGI scripts, both generated in-house to perform search functions, and those included in GBrowse. In-house Perl scripts query the suffix arrays using the Perl Search:: Saryer module provided by Sary for fast text searches, and send a response to the web interfaces through the web server. Requests for nucleotide and amino acid sequences, as well as for GFF3 data and sequences on GBrowse are handled as queries to a MySQL database via a DBI module.

Web interfaces
Web interfaces work on client-side standard web browsers and provide users access to KONAGAbase by various search functions. The web interfaces except for GBrowse were implemented by us using HTML, JavaScript, and CSS library with Yahoo! User Interface (YUI) [  etc., and AJAX (Asynchronous JavaScript and XML) communication libraries for asynchronously sending/receiving data with server-side programs. GBrowse was installed on the Linux server according to the documentation of the software package.

BLAST search form
A homology search against each type of generated sequence can be performed using the BLAST search form. The result of BLAST search is presented in a rich GUIbased table in a comprehensive summary, and the result (a raw result text and a FASTA file of hit sequence(s)) can be readily selected and downloaded.

Keyword search form
Each type of sequence can be searched and browsed in the keyword search form. Keywords of "Sequence ID," "BLAST top hit," "Pfam ID/Description," and "GO term ID/Description" can be used in the search form (currently, only keyword search of Sequence ID is available for assembled WGS sequences

Search by GO term tree viewer
Annotated sequences can also be searched using the GO Tree-based search form in which a graphical tree GO term viewer is provided. Each GO term annotated to the selected type of sequences is presented as a node in the viewer. In each node, the number of sequences annotated with the GO term(s) in the sub-tree of the node is presented. One or more GO terms can be selected using the checkboxes next to each node and sequences annotated with the selected GO terms can be browsed in the summary table. For example, a list of unigenes whose molecular functions are related to transporter activity can be easily searched by checking GO:0005215 (transporter activity) in the sub-tree of molecular function in the GO term viewer and clicking a search button.

Genome browser powered by GBrowse
GBrowse provides bird's eye and detailed views of the assembled WGS sequences. Three tracks can be selected on the detailed view: (1) predicted gene, (2) unigene, and (3) putative gene set. In each track, exons of each sequence on a selected assembled WGS sequence are visualized. Locations of exons on the genome were determined by EVidenceModeler (for the predicted genes) and exonerate [53] (for the unigenes). A popup balloon is displayed by clicking a sequence on the detailed view, providing links to annotation information page and to detail of exons position page.

Data analysis
Homology search result of the putative gene set  [54]. Homology searches by blastn and tblastx (blastx for NCBI-nr) were performed against each database. The e-value cutoff was 1e-05 and the percent identity cutoffs were 80% for blastn and 30% for blastx/tblastx, respectively. The classification was performed based on the description of the top hit sequences of the BLAST search. As shown in Table 5, 2,399 sequences (38.0%) were classified as transposable elements, and the remaining 3,911 sequences (62.0%) were unclassified, and may contain novel repeat sequences. Approximately 74% of the transposon sequences were classified as non-LTR transposon (LINE/ SINE), and the remaining sequences were classified as LTR transposons (20%), DNA transposons (2.5%), and others (3.5%). RepeatMasker was performed with the putative repeat sequences against all the WGS assembled sequences. Therefore, 32.51% of WGS assembly sequences were masked, of which 9.24% were masked by the putative transposon sequences and the remaining 23.27% were masked by the unclassified repeat sequences.
More represented genes among the sequenced transcripts in the midgut, egg, and testis The top 10 more represented genes among the sequenced transcripts in three tissues (midgut, egg, and testis) were extracted by counting the number of EST sequences contained in the unigene sequences. The genes of the unigenes were identified based on the annotated description of the proteins (NCBI-nr) by BLAST search and on the description of the Pfam domain using HMMER3 search. In the midgut, as shown in Table 6, genes encoding digestive enzymes (serin proteases, lipases, and carboxypeptidases) are the most represented. In particular, serine protease genes are more significantly represented in the digestive enzymes (approximately 30% of ESTs in the midgut). In the case of housekeeping genes, cytochrome c oxidase genes, ribosomal protein genes, and ATP synthase genes are more represented in the midgut. As shown in Tables 7 and 8, these three genes are also more represented in the egg and the testis (with the exception of the cytochrome c oxidases in the testis). Glucosinolate sulfatase genes, ranked seventh in expression in the midgut, are very important genes for the crucifier specialist DBM. The enzymes encoded by these genes prevent the formation of toxic products arising from the glucosinolates of crucifier plants, thereby enabling DBM to feed on these plants [55]. In the case of fibroin genes, ranked ninth, 10 genes contain a "Fibroin P25" domain (Pfam ID: PF07294) and one gene contains a "Fibroin light chain (L-fibroin)" domain (Pfam ID: PF05849). All the 11 genes have BLAST hits with at least one of the three lepidopteran insects (B. mori, D. plexippus, M. sexta). However, this result may be because of contamination of the silk gland, in which fibroin is known to be expressed, in the midgut. To validate the expression of the fibroin genes in the midgut, we performed RT-PCR on RNAs extracted from carefully collected additional midguts and silk glands of the fourth instar larvae using primers for the 10 genes containing the "Fibroin P25" domain and the single gene with the "Fibroin light chain" domain. Transcripts encoded by the 10 genes containing the "Fibroin P25" domain were detected in the midgut, whereas they were not detected in the silk gland (data not shown). On the other hand, the transcript encoded by the single gene containing the "Fibroin light chain" domain was not detected in the midgut, whereas it was detected in the silk gland (data not shown). It is at least likely, therefore, that fibroin genes containing the "Fibroin P25" domains are indeed expressed in the midgut. Currently we have no biological hypothesis for this result. In future experiments, we will analyze in more detail the expression of the fibroin genes in the midgut. The 11 unigene sequences can be easily searched using the keyword search form of KONAGAbase as follows: (1) perform a keyword search with the word "fibroin"; and (2) sort the result table in descending order of the number of EST sequences by clicking the column title of "ESTs (midgut)." In the egg, as shown in Table 7, seven housekeeping genes (encoding ribosomal proteins, cytochrome c oxidases, ATP synthases, heat shock proteins, actins, elongation factors, and myosins) account for the top six and ninth most represented genes. Cuticular protein genes, ranked seventh, and nucleoplasmin-like protein genes, ranked eighth, are weakly represented in the other two tissues. As shown in Table 8, zinc finger protein genes, ranked tenth, are also more represented in the testis, in which they are ranked fifth.
In the testis, (Table 8), arylphorin-like hexamerin genes are the most represented. In insects, hexamerins are storage proteins which are mainly synthesized by the "Percentage of masked bases" is the percentage of masked bases against the total bases of the WGS contigs, degenerate contigs, and singletons. fat body and serve as sources of amino acids for pupae and adults during metamorphosis [56,57]. In the honey bee (A. mellifera), it has been confirmed that the hex 70a gene (a hexamerin classified as an arylphorin) is expressed not only in the fat body but also in the ovary and testis, which may imply previously unknown roles for the gene in the developing gonads of the honey bee [58,59]. In DBM, two arylphorin-like hexamerin genes (PxAry1 and PxAry2) have been identified [60], although the expression of the genes in the testis was not evaluated in these studies. All 27 identified unigenes in the testis (these sequences can also be easily searched by the keyword search form with the word "arylphorin-like hexamerin" and sorting with "ESTs (testis)" column) are highly similar (approximately 70%-100% identities) with one of the two genes. Although we carefully removed cells which were adhered to the surface of the testes prior to RNA extraction, these cells, which we consider to represent the fat body, may have contaminated the testes material. To validate the expression of genes in the testis, we performed RT-PCR on RNAs extracted from carefully collected additional testes and previously collected testes of the fourth instar larvae using primers for the two genes. We found that transcripts encoded by both genes were detected in both RNA samples when cycle number of RT-PCR is 30 (data not shown). However, the two genes were barely detectable in the additionally collected RNAs when the cycle number was reduced to 21, whereas they were detected in previously collected RNAs (data not shown). It is likely therefore that the elevated expression of both genes was due to contamination of the fat body into the testis. In future experiments, we will analyze in more detail the expression of genes in the testis.

Classification of three insecticide resistance related genes
Cytochrome P450 and Glutathione S-transferase (GST) are important detoxification enzyme genes in insects, and an ATP-Binding Cassette (ABC) transporter is considered to be involved in detoxification [61]. These gene superfamilies in the putative gene set were identified by keyword search and then classified by two classification methods. One method is based on unidirectional best hit (UBH) of blastp search. The other method is phylogenetic tree-based method which were used for the classifications of the three gene superfamilies  in B. mori [62][63][64][65]. In general, the former method requires less computational cost but the latter method is more accurate. Firstly, putative genes annotated with the corresponding highly conserved domains (p450 domain (Pfam ID: PF00067) for Cytochrome P450, GST N-terminal (GST_N) domain (Pfam ID: PF02798) for GST, and nucleotide-binding domain (NBD) (Pfam ID: PF00005) for ABC transporter) were extracted by performing the keyword search against the putative gene set using the corresponding Pfam ID. GST_N domain-containing putative genes annotated with description of non-GST genes such as elongation factor 1 gamma and gangliosideinduced differentiation-associated-protein were discarded. For each extracted putative gene, an amino acid sequence of the corresponding domain was extracted based on the alignment position information in HMMER result. All-to -all comparisons using blastp search were performed for the extracted domain sequences in each gene superfamily, respectively. The domain sequences with 98% or higher identity and coverage were manually merged for removing redundancy. We denote the set of the extracted domain sequences as DS1. Since the putative gene set of DBM is relatively short and fragmented compared with those of other model insects, for more strict classification, we extracted relatively long domain sequences whose lengths are more than half length of the corresponding Pfam domain (37 aa for GST_N domain; 231 aa for p450 domain; 59 aa for NBD) from the DS1. We denote the set of the relatively long domain sequences as DS2. Table 9 summarizes the numbers of genes containing the domain sequences in the DS1 and the DS2, respectively. We speculate that the exact number of genes in the putative gene set may lie between the number of genes in the DS1 and the DS2.
UBH based classification was performed for the DS1 and the DS2 by performing blastp search (cutoff e-value: 1e-05) against corresponding known genes in insects (1,602 P450 genes retrieved from Cytochrome P450 Homepage [66], 231 GST genes and 2,890 ABC transporter genes retrieved from NCBI database), respectively. The family/class of each DBM gene was identified based on the family/class description in the BLAST top hit. Phylogenetic tree-based classification was performed for the DS2 with the corresponding domain sequences in B. mori (23 GST_N domains in 23 GST genes [62]; 79 p450 domains in 84 P450 genes [63]; 70 NBDs in 53 ABC genes [65]) which were extracted in the same manner for the DS2. For the classification of P450 genes, four p450 domain sequences of other insects (CYP321A1 in Helicoverpa zea, CYP6BD6 in Manduca sexta, CYP347A1 in Tribolium castaenum, and CYP304F2 in Zygaena filipendulae [66]) were also used. To build a phylogenetic tree of each gene superfamily, amino acid sequences of each gene superfamily were aligned using ClustalW [67] and neighbor-joining method in MEGA5 [68] was performed with a bootstrap of 1,000 replicates, Poisson model for amino acid substitution, and pairwise deletion of gaps. The family/class of each DBM gene was identified based on the family/class descriptions of genes located in the same subtree.
The classification results are shown in We denote the classification results by the UBH based method as DS1-UBH and DS2-UBH, and that by phylogenetic tree-based method as DS2-PTB. Generated phylogenetic trees of P450, GST, and ABC transporter genes are shown in Figure S1-S3 (Additional files 1, 2 and 3), respectively. For P450 genes, phylogenetic subtrees of each clan are shown in Figures S4-S7 (Additional files 4, 5, 6 and 7), respectively.
In the case of P450 genes, as shown in Table 10, the numbers of putative genes are very different between the DS1 (137 genes) and the DS2 (61 genes). Especially, there are large difference in P450 families in CYP3 clan (26 more genes in the DS1) and CYP4 clan (39 more genes in the DS1), which may be due to many fragmented genes in the DS1. On the other hand, between two different classification methods for the DS2 (DS2-UBH and DS2-PTB), the numbers of genes classified in each clan are equal, and only two genes in two P450 families (CYP4 and CYP367 families in CYP4 clan) are differently classified. The number of P450 genes in the putative gene set is less than that of B. mori (84 genes in 26 families [63]) when compared with that in the DS2, but greater than that in the DS1. In total, the putative gene set covers up to 27 P450 families, which includes 24 families identified in B. mori, which may indicate high gene coverage of the putative gene set. In the case of GST genes, as shown in Table 11, the numbers of putative genes are almost the same between the DS1 (22 genes) and the DS2 (20 genes). Genes in Delta, Sigma, and Zeta classes are equally classified between all methods. Although the number of genes in Epsilon, Theta, and unclassified classes are different between the DS2-UBH and the DS2-PTB, only two genes are differently classified in total. The number of GST genes in the putative gene set is less than that of B. mori (23 genes [62]), but the difference is small. In total, both the numbers of genes in the DS1 and the DS2 are less than that of B. mori only in Epsilon class.
In the case of ABC transporter genes, as shown in Table 12, the numbers of putative genes in the DS1 and the DS2 are 70 and 54, respectively. Especially, the classification results are exactly the same between the DS2-UBH and the DS2-PTB. The number of ABC transporter genes in the putative gene set is more than that of B. mori (53 genes [65]). It is likely that DBM has more xenobiotic resistance-associated ABC transporter genes (ABCB, ABCC and ABCG families [61]) than B. mori.
In the above results, the differences between the two classification methods for the DS2 are small. Therefore, our classification result for the DS2 may be useful for gene classifications of other insects even when the simple UBH based method is used. In order to evaluate this, classifications of the above three gene superfamilies in two model insects, B. mori and D. melanogaster, were performed by the UBH based method using our classification result of the DS2-PTB. The domain sequences of each gene superfamily in D. melanogaster (37 GST_N domains in 37 GST genes [62], 86 p450 domains in 90 P450 genes [69], and 87 NBDs in 56 ABC transporter genes [70]) were also extracted in the same manner for the DS2. For P450 genes, the genes in common gene families between DBM and each model insect (66 B. mori genes and 59 D. melanogaster genes) were evaluated. As a result, in B. mori, the numbers of classified The number of putative GST genes in each GST class in DBM was estimated using two sets of GST N-terminal domain (Pfam ID: PF02798) sequences (DS1 and DS2). UBH based classification was performed for DS1 (DS1-UBH) and DS2 (DS2-UBH), respectively. Phylogenetic tree-based classification was performed for DS2 (DS2-PTB). The number of GST genes in Bombyx mori was obtained from [62]. The number of putative P450 genes in each P450 clan/family was estimated using two sets of p450 domain (Pfam ID: PF00067) sequences (DS1 and DS2). UBH based classification was performed for DS1 (DS1-UBH) and DS2 (DS2-UBH), respectively. Phylogenetic tree-based classification was performed for DS2 (DS2-PTB). The number of P450 genes in Bombyx mori was obtained from [63].
ABC transporter and GST genes in each family/class by the UBH based method are exactly the same with those of original classifications while the same is true for ABC transporter and P450 genes in D. melanogaster. In B. mori, only two genes in CYP367 family were wrongly classified as genes in CYP4 and CYP340 families (all the three families belong to CYP4 clan) while nine GST genes in Epsilon class (14 Epsilon genes in original classification) were wrongly classified as genes in Delta class (11 Delta genes in original classification). Thus, it is likely that our classification results for the three gene superfamilies in DBM are useful for the classification in other insects, and may be more useful for that in lepidopteran insects.

Conclusions
We have constructed KONAGAbase, a genomic and transcriptomic database for DBM. KONAGAbase provides DBM comprehensive transcriptomic and draft genomic sequences with useful annotation information and easy-to-use rich GUI-based web interfaces, which enables researchers to efficiently search target sequences, such as insect resistance-related genes. The database will be continuously updated and additional genomic/ transcriptomic resources and analysis tools will be provided for further efficient analysis of the mechanism of insecticide resistant of DBM and the development of effective insecticides based on novel modes of action.