Strain selection and DNA extraction
The laboratory strain of A. sinensis used in this study has been inbred within the lab since 1984 and, never been exposed to pesticides. These mosquitoes were reared at 26 ± 1°C and 75 to 85% humidity, under a 10:14 h light:dark cycle. Genomic DNA was extracted from 300 adult females and 300 males (2 to 3 days post adult emergence) according to methods described in . To prevent RNA and protein contamination, extracted DNA was treated with RNase A and proteinase K and, subsequently, precipitated with ethanol.
Whole-genome sequencing and assembly
We employed a whole-genome sequencing strategy with Roche/454 GS FLX. We constructed a total of five single-end and seven mate-pair sequencing libraries with insert sizes of about 3 Kb, 8 Kb and 20 Kb from 1 μg, 5 μg, 30 μg and 60 μg of starting DNA. In total, we generated 4.16 Gb of data of sequencing reads ranging from 40 to 1196 bp. To reduce the effect of sequencing error during assembly, we undertook a series of checking and filtering steps in assembling the reads generated. By using stringent criteria, 3.34 G of high quality data were incorporated into the final de novo genome assembly.
The Lander-Waterman algorithm  were used to estimate the genome size of A. sinensis. K-mer analysis for single-end reads  revealed a frequency distribution that conformed to the Poisson expectation when K-mer was equal to 13. The value of expected depth was calculated based on the lambda, a parameter of possion distribution. The genome size of A. sinensis was then calculated using the total K-mer number divided by the expected depth value.
Whole-genome assembly was carried out with a Celera Assembler V6.1 for the remaining 454 reads . The revised pipeline (called Celera Assembler with the Best Overlap Graph, CABOG) was robust to uncertainty in homopolymer run length, high read coverage and heterogeneous read lengths. We utilized the following modules of the Celera Assembler software for successive phases of the assembly: pairwise overlap detection; initial ungapped multiple sequence alignments, called unitigs; unitig consensus calculation; combining unitigs with mate constraints to form contigs and scaffolds that were ungapped and gapped multiple sequence alignments; and, finally, scaffold consensus determination. Because the genome used for sequencing were constructed from whole adult mosquitoes, contamination from bacteria in gut or adhering on the surface were inevitable. To check for possible microbial contamination of the assembly, we screened scaffolds against the NCBI NT database using query alignment and identity cut-off of 90% and e-value cut-off of 1e-6. When the top hit was bacterial species, this scaffold was removed.
In order to assess the assembly quality, the transcriptome was sequenced and aligned to the scaffold sequences using Blat with default parameters . Assembly quality was also assessed by mapping the 454 Single reads to the scaffolds using BWA. The mapped regions (consensus sequences) with depth over 3X were extracted for SNVs and INDEL variation analysis, which represent potential base error and short indel error rate in the genome, respectively . Additionally, presence of CEGs was evaluated for the genome assembly (http://korflab.ucdavis.edu/Datasets/cegma/submit.html) [47, 48].
Identification of repetitive elements
The identification of repetitive elements is essential for genome sequencing, as unidentified repetitive elements can affect the quality of gene predictions, annotation and annotation-dependent analyses . Two methods were adopted for masking repeat regions in A. sinensis. First, RepeatMasker V3.3.0 (http://www.repeatmasker.org/) was applied against the Repbase library (species Anopheles) based on the scaffolds. Then, RepeatScout V1.0.5  software was used (with frequency set to ≥50) to build a repeat regions database by providing scaffolds and potentially repeat sequences. These results were merged with the results of the transposable elements for mosquitoes, which were downloaded from TEfam database (http://tefam.biochem.vt.edu/tefam/). Finally, these merged results were reprocessed with RepeatMasker.
To predict genes, we used two independent approaches: a homology-based method and a de novo method. The results of these two methods were integrated by the EVidenceModeler utility and then filtered multiple times and also checked manually. The reference protein sequences for protein alignment were obtained from VectorBase (for the aforementioned three sequenced mosquito species) and the NCBI database (for Culicidae species). CD-HIT software was used to cluster these protein sequences with 100% global similarity . AAT  and Genewise  software were used to align the protein data to the masked scaffolds. By comparing the databases, we obtained the number of protein distributions.
Four ab initio gene prediction programs were run on the genome: SNAP , Augustus , GlimmerHMM , and Genezilla  with the model trained using the published mosquito gene information (A. gambiae, Ae. aegypti and C. quinquefasciatus).
Quality of protein-coding gene predictions
To estimate the accuracy of gene prediction, we undertook a consistency check for the protein length of single-copy orthologs between A. sinensis and D. melanogaster. Considering the high conservation of single-copy orthologs, the protein length should have a high coherence between two species . The protein lengths of the two species were plotted as a scatter diagram and analyzed with a regression analysis. We compared the results of this regression analysis with results from the published literature.
Identification of noncoding RNA genes
tRNA genes were predicted by tRNAscan-SE-1.23 with eukaryote parameters . The rRNA fragments were identified by aligning the rRNA template sequences from the SILVA database  and RNAmmer database, by using BlastN at E-value 1e-5 with cutoff of identity ≥95% and match length ≥50 bp. It is important to note that rRNA genes in the A. sinensis genome were combined by aligning the 5.8S, 18S, 25S and 28S regions of databases using BlastN. miRNA was predicted by BlastN against the hairpin sequences from miRBase database (RELEASE 17) with E-value 1e-3, allowing no less than 70 bp alignment length, and requiring no less than 85% overall identity and 80% coverage.
Gene functions were assigned according to the best match of the alignments using Blast and BlastP (query coverage ≥50%; E-value: 1e-10) against the NCBI NR protein database. All predicted protein-coding genes were obtained with the InterProScan analysis tool . According to features of the predicted protein sequences, the InterProScan analysis was based on the active site, the binding site, the conserved site, the domain, the family, the PTM, and the repeat. Gene Ontology (GO) IDs for each gene were obtained from the corresponding InterProScan entry. All genes were aligned against the KEGG proteins, and the pathway in which the gene might be involved was derived from the matching genes in the KEGG. SignalP 4.0 server was used to predict the presence and location of signal peptide cleavage sites in the amino acid sequences . This method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. TMHMM software  was used with default values to predict the transmembane region based on a hidden Markov model.
Gene orthology prediction
The gene orthology predictions were generated by the Ensemble Gene Tree method , which is based on the PHYML algorithm for multiple protein sequence alignments, and uses MUSCLE for each gene family that contains sequences from all five species (A. sinensis, A. gambiae, Ae. aegypti, C. quinquefasciatus and D. melanogaster). Gene trees were reconciled with the species trees using the RAL algorithm to call duplication events on internal nodes and to root the trees. The relations of orthology were inferred from the results of each gene tree.
Defining gene families
The PANTHER hidden Markov models V7.2, annotated to different functional gene families, were used with default parameters (i.e. E-value: 1e-3) to classify all gene models of A. sinensis. Immune-related gene sets were downloaded from ImmunoDB resource (http://cegg.unige.ch/Insecta/immunodb) and subjected to inspection, curation, and phylogenetic analysis. Based on these gene sets, we re-annotated the proteins in the A. sinensis genome by Blast search, and counted the number of A. sinensis genes in each functional gene set. The threshold E-value in the Blast search was set to 1e-3, while the similarity was set to 0.35.
Construction of microsyntenic blocks
CHSMiner V1.1  was used to construct the microsynteny map for A. sinensis and the other three previously sequenced mosquito species. Briefly, the program used the orthologs between two genomes as anchors, and merged two anchors into a block if they were located less than a specified gap size apart. We used default values for parameters and set the minimum length to 100 Kb. Each microsynteny detected was evaluated by corrected P-values; only those results with the P-values less than 1e-5 were preserved.
M-Coffee V9.0 program  was used to perform the multiple alignment of proteins in each family. A phylogeny tree was constructed based on the 3,470 single-copy families in the five species (A. sinensis, A. gambiae, Ae. aegypti, C. quinquefasciatus and D. melanogaster). We used the Phylip package V3.69  to build the maximum likelihood (ML) tree for each protein family under the JTT substitution model. Then the SuperTree software was used to get an integrated supertree. To evaluate the topology of the supertree, we performed a bootstrap resample analysis using 100 resamples from the original tree.