Genome sequencing and assembly
We improved the published genome assembly of 2.7 million Sanger reads of the honey bee genome, version Amel_4.0 [1] by incorporating ABI Solid sequence and Roche 454 paired-end sequence to superscaffold and gap-fill the Amel_4.0 assembly using the Atlas-Link software [55]. The new sequence data as well as the existing sequence data were used to link the genome contigs from Amel_4.0 into more contiguous scaffolds. Adjacent contig sequences within these scaffolds were assessed and overlapping and redundant contigs were merged.
We used the paired-end reads for superscaffolding and intra-scaffold gap filling. The Amel_4.5 assembly contains new scaffolds formed from merging existing scaffolds and filling some intra-scaffold gaps with other scaffolds or contigs.
The newly formed scaffolds were anchored to the same linkage group that their member contigs were anchored to in Amel_4.0. Ten of the new scaffolds could be anchored to two different linkage groups so a manual break was inserted to split these scaffolds for consistent anchoring.
New data for gene prediction
RNAseq data
We sequenced samples from a number of tissues using the 454 Titanium technology. Tissues from ovary, testes, mixed antennae (worker, drones, and queens), larvae, mixed embryos, abdomen, and a combined library from brain and ovary samples were sequenced.
The testes cDNA library was prepared with the Clontech “full-length” amplification protocol. A gel cut of 400 to 800 bp fragments was combined with nebulized products from the larger cDNA fragments to generate fragments of an optimum sized sequencing library for the 454 Titanium platform. Other libraries were prepared by isolating total RNA using Trizol and RNeasy columns followed by mRNA isolation using the Qiagen kit and cDNA genration using the Invitrogen Superscript kit #11917-010 with random primers. A gel cut of 400 to 800 bp fragments was combined with nebulized products from the larger cDNA fragments to generate fragments of an optimum sized sequencing library for the 454 Titanium platform.
Peptide data
Honey Bee peptide atlas
We performed liquid chromatography-tandem mass spectrometric (LC-MS/MS) analysis of bee protein extracts from 253 samples, representing three castes, larvae and virtually all adult honey bee tissues in both sexes and of different disease states. Tissue collection and interpretation of the LC-MS/MS data has largely been described elsewhere [56–61] and is available for public download from the Honey Bee Peptide Atlas [62]. The raw data from these studies, amounting to more than 8 × 106 tandem mass spectra, were searched against a six-frame translation (3,596,047 forward sequences, with reversed complemented sequences and common contaminants concatenated) of Amel_4.5 using Mascot (v2.3, Matrix Science). This exercise identified 30,622 unique peptide sequences at a false discovery rate of 1%, of which 5,834 peptides mapped to regions of the genome, where genes were previously unknown in Amel_4.0. As these results were based solely on a basic translation of raw genomic sequence, these peptides represent only tryptic peptides wholly within a single exon.
Venom peptides
We analyzed the honey bee worker venom proteome by integrating a combinatorial peptide ligand library (CPLL) with liquid chromatography/Fourier transform ion cyclotron resonance (LC-FTICR) MS/MS. The collection and interpretation of the MS/MS data are described elsewhere [35]. Gene prediction datasets and a six-frame translation of Amel_4.5 were searched using Mascot (v2.3, Matrix Science). Setting the significance threshold at p < 0.01 led to a peptide false discovery rate of 5.23% for the search of the AUGUSTUS AU9 gene set, 2.51% for the AUGUSTUS AU11 search and 3.77% for the Amel_4.5 NCBI RefSeq search. MS/MS data generated from the CPLL flow-through fractions, and the elution fractions separated by Tris-glycine- or Tris-tricine-SDS-PAGE gels were separately searched against the genome six-frame translation resulting in false discovery rates of respectively 1.56%, 3.75% and 3.44%.
Annotation methods
Summary of gene prediction sets
Genes were predicted using a variety of methods including the NCBI RefSeq and Gnomon pipelines, AUGUSTUS, SGP2, GeneID, Fgenesh++ and N-SCAN. Unless stated otherwise below, gene prediction pipelines used the masked assembly available from NCBI [63], which was generated using RepeatMasker [64]. New transcriptome data was used either directly as evidence or in the generation of training sets for the predictors. The Augustus analysis also incorporated available peptide data, and the N-SCAN analysis leveraged nucleotide sequence conservation between the A. mellifera genome and the other bee genomes. SGP2 leveraged conservation between the A. mellifera genome and the previously published Nasonia genomes [23] based on translated alignments.
NCBI RefSeq and Gnomon
The Amel_4.5 assembly was annotated with NCBI's eukaryotic genome annotation pipeline (version 3.0), as described at [65]. The NCBI pipeline uses a repeat-masked assembly generated by WindowsMasker [66] and transcript and protein alignment evidence supplemented with ab initio prediction to annotate coding and non-coding transcripts and proteins. The evidence used for this annotation run included alignments of:
-
1)
9 million RNAseq reads from 454 sequencing (described above), which were treated as ESTs in the pipeline.
-
2)
All available honey bee ESTs and mRNAs.
-
3)
Proteins from the FlyBase annotation of Drosophila melanogaster (release 5.30).
-
4)
Proteins from the RefSeq annotation of human.
-
5)
Proteins annotated on insect mRNAs.
-
6)
Proteins from the annotations of the ant genomes, Harpegnathos saltator and Camponotus floridanus available in GenBank.
The final RefSeq annotation included models that were completely or partially supported by alignment evidence. The pipeline had some provisions to predict models that were disrupted by frameshifts or stop codons in the assembly, some of which were assigned protein accessions (XP_ prefixes) and named with the prefix ‘LOW QUALITY PROTEIN’, whereas others were conservatively classified as pseudogenes and not assigned protein accessions. The RefSeq annotation is available from NCBI's genomes FTP site as NCBI build 5.1 [63].
AUGUSTUS
AUGUSTUS can be used as an ab initio gene prediction tool, but can also integrate extrinsic evidence from various sources [67].
Generation of training gene structures and training AUGUSTUS
We used Scipio [68] to generate gene structures on the Amel_4.5 assembly with a protein set from a previous A. mellifera annotation (Amel_pre_release2_OGS_pep.fa). We used these gene structures to optimize AUGUSTUS parameters for A. mellifera, and constructed UTR models from 78,274 A. mellifera ESTs from GenBank, which were used to train AUGUSTUS UTR parameters. We then predicted genes in the Amel_4.5 assembly. Individual RNAseq reads from 454 sequencing (described above) were mapped against predicted transcripts, and fully covered transcripts (2,151) were selected as training genes for optimizing a final AUGUSTUS parameter set.
Gene prediction with AUGUSTUS
We generated three gene sets using AUGUSTUS: a gene set with extrinsic evidence from ESTs and RNAseq data (AU9), a particularly inclusive gene set that contained many alternative transcripts for peptide identification (AU11), and a gene set with extrinsic evidence from ESTs, RNAseq data and peptide data (AU12) (Table S6 in Additional file 1).
We created extrinsic evidence “hints” for protein coding genes and transcripts from the A. mellifera ESTs and from 454 transcriptome libraries. Genes were predicted with AUGUSTUS, allowing the prediction of alternative transcripts and allowing the splice site AT-AC (in addition to GT-AG and GC-AG) in case of supporting extrinsic evidence. The resulting gene set was named AU9. AU9 genes were predicted using the following options: 2augustus –species = honeybee1 –UTR = on –print_utr = on –hintsfile = all_but_no_peptide.hints –extrinsicCfgFile = extrinsic.M.RM.E.W.cfg –exonnames = on –codingseq = on –alternatives-from-evidence = true –allow_hinted_splicesites = atac genome.fa
For the purpose of peptide identification, alternative transcripts were extensively sampled with AUGUSTUS, resulting in the gene set AU11. AU11 genes were predicted issuing the following command: augustus –UTR = on –print_utr = on –hintsfile = all_but_no_peptide.hints –extrinsicCfgFile = extrinsic.M.RM.E.W.cfg –exonnames = on –codingseq = on –species = honeybee1 –alternatives-from-evidence = true –alternatives-from-sampling = true –sample = 100 –minexonintronprob = 0.1 –minmeanexonintronprob = 0.4 –maxtracks = -1 –allow_hinted_splicesites = atac genome.fa
We generated hints from peptide sequences by mapping the peptides against the protein set of AU11 and against a six-frame translation of the genome using BLAT [69] in a non-redundant way (i.e. parts of the six-frame translation that were included in the protein set AU11 and redundant parts of the AU11 protein set that were removed). A final AUGUSTUS gene set AU12 was generated using all available extrinsic evidence and the following options: augustus –species = honeybee1 –UTR = on –print_utr = on –hintsfile = all.hints –extrinsicCfgFile = extrinsic.M.RM.E.W.cfg –exonnames = on –codingseq = on –alternatives-from-evidence = true –allow_hinted_splicesites = atac genome.fa.
Fgenesh++
Predictions were made using FGENESH 3.1.1 [70, 71] using the HBEE matrix with parameters specific for A. mellifera. We used Illumina transcriptome data from A. mellifera forager and nurse brains available from the SRA (SRP003528), which were a total of 181.8 M spots, 18.3G bases of 100 bp single end reads generated on an Illumina Genome Analyzer II in 2010. We mapped the Illumina RNASeq using the ReadsMap program [72], which provided intron position information to Fgenesh for building sample-specific gene models. Predictions were made based on individuals used in Illumina RNASeq libraries (five nurses and five foragers), and then were combined into forager and nurse group predictions, and redundant genes (those having coinciding coding sequences) were removed from each set. Finally, the forager and nurse predictions were combined into a final set of predictions, and redundant genes (those having coinciding coding sequences) were removed from this combined set to produce the final Fgenesh++ prediction set.
GeneID
GeneID is an ab initio gene prediction program used to find potential protein-coding genes in anonymous genomic sequences. The training of GeneID to obtain a parameter file for A. mellifera was based on the method described to obtain a Drosophila melanogaster GeneID parameter file [73]. Training was performed in a “semi-automated” manner by employing a recently developed GeneID training tool that computes position weight matrices (PWMs) or Markov models of order 1 for splice sites and start codons, and derives a model of coding DNA, which, in this case, is a Markov model of order 5. Furthermore, once a preliminary species-specific matrix is obtained it is further optimized by adjusting two internal matrix parameters: -the cutoff of the scores of the predicted exons (eWF) and the ratio of signal to coding statistics information to be used (oWF).
The initial A. mellifera training set was comprised of the 2,151 gene models used to train AUGUSTUS (described above). Of these gene models 80% (1,720) were used to train GeneID while the remaining 20% (431) were set aside to test the accuracy of the newly developed matrix. The 1,720 A. mellifera protein-coding gene models included 7,913 canonical donor splice sites/7,939 canonical acceptor sites and 1,720 start codons. The start codons were used to compute PWMs while the donor and acceptors were used to derive Markov matrices of order 1. Given the large number of sequences we also had enough coding (2,338,800) and non-coding (5,561,154) nucleotides to derive a Markov of order 5 for the coding potential. We tested accuracy of the GeneID A. mellifera parameter file on an artificial contig consisting of the 431 evaluation-set concatenated gene models with 800 nucleotides of intervening sequence between each of the genes (Table S7 in Additional file 1). We then used the GeneID parameter files to predict genes on an assembly consisting of Amel_4.5 sixteen chromosomes and 5,304 "unplaced" scaffolds files that had repeat sequences masked using Repeatmasker. GeneID predicted 24,554 protein-coding genes.
SGP2
SGP2 is a syntenic gene prediction tool that combines ab initio gene prediction (GeneID) with TBLASTX searches between two or more genome sequences to provide both sensitive and specific gene predictions, and it tends to improve GeneID’s performance, especially by reducing the number of false-positive predictions. SGP2 requires one or more reference genomes to which the target genome (in this case A. mellifera) is compared. We decided to use the genomes of Nasonia vitripennis, N. giraulti and N. longicornis[23] as references to develop our A. mellifera parameter file for SGP2 because the genus Nasonia is at an appropriate evolutionary distance from Apis such that mostly the coding regions of the genes, not the introns or intergenic regions, are significantly conserved between these two genomes.
Obtaining the SGP2 A. mellifera-specific parameter file was based on the methodology described by Parra et al[74] used to obtain a human SGP2 parameter file using mouse homology evidence. The starting point to obtain a parameter file for SGP2 was the previously described GeneID A. mellifera matrix. The GeneID-derived SGP2 parameter file was optimized on an artificial contig comprising the same concatenated 1,720 sequences used to train GeneID, with 800 nucleotides between each of the gene models. We optimized the SGP2 matrix by modifying not only the eWF internal parameter (as previously for the GeneID parameter file) but also two SGP2-specific internal parameters (“NO_SCORE” and “HSP_factor”). The “NO_SCORE” parameter provides a penalty for no overlap between TBLASTX-derived HSPs (High-Scoring Pairs) and GeneID ab initio predictions in the same region whereas the “HSP_factor” parameter reduces the score assigned to the HSPs in order to maximize the prediction accuracy. We evaluated the newly developed SGP2 parameter file on the same artificial contig consisting of the 431 concatenated gene models with 800 nucleotides of intervening sequence between each of the genes used to evaluate the GeneID bee matrix (Table 1). We then used the SGP2 parameter file2 to predict genes on the same assembly file used for GeneID, and generated 20,179 predictions.
N-SCAN
We used the N-SCAN package [75] to leverage conservation between the A. mellifera genome and genomes of two other bee species, A. florea (Aflo_1.0 ) and Bombus terrestris (Bter_1.0). We first masked Amel_4.5 for simple sequence repeats using RepeatMasker [64]. We ran LASTZ [76] using default parameters with Amel_4.5 as the target genome, and either Apis florea (Aflo_1.0) or Bombus terrestris (Bter_1.0) as the informant genome. We then used iParameterEstimation to generate both an Amel_4.5-Aflo_1.0 specific parameter set as well as an Amel_4.5-Bter_1.0 parameter set using the training set described above for AUGUSTUS gene prediction, including UTR features. Finally, we ran N-SCAN using each of the A. mellilfera specific parameter sets with the respective LASTZ informant alignments to produce two N-SCAN prediction sets, one set based on Aflo_1.0 as the informant genome and the other set based on Bter_1.0 as the informant genome.
Combining gene sets
Input data for combined gene sets
We used MAKER2 and GLEAN to generate combined gene sets. The MAKER2 and GLEAN analyses used the same set of input data. Both analyses combined the gene predictions described above (NCBI, AUGUSTUS, Fgenesh++ with RNAseq, N-SCAN using A. florea as an informant genome, GeneID and SGP2) with transcript and protein homolog alignments. Transcript data included the new 454 transcriptome data described above, A. mellifera ESTs from GenBank and Illumina nurse and forager reads downloaded from the SRA (SRP003528, described above).
We aligned Illumina reads to Amel_4.5 in two groups, nurse and forager, using Tophat version 1.3.1 with the option "--butterfly-search" for more sensitive splice junction detection, and then generated predicted transcripts for each set of pooled data using Cufflinks version 1.0.3 with default parameters. The 454 reads were assembled into contigs de novo using Newbler (2.3-PreRelease-9/14/2009) with the cDNA option. We aligned 454 contigs and ESTs to Amel_4.5 using MAKER2 v2.15, which uses WU-BLAST [77] and Exonerate est2genome [78], with minimum 80% alignment coverage and 95% identity. Protein homolog alignments included SwissProt [79] Metazoa homologs, Drosophila melanogaster (fruit fly; r5.31) [80], Nasonia vitripennis (parasitoid wasp; OGSv1.2) [23] and the ants: Atta cephalotes (OGSv1.1) [22], Camponotus floridanus (OGSv3.3) [18], Harpegnathos saltator (OGSv3.3) [18], Linepithema humile (OGSv1.1) [20], Pogonomyrmex barbatus (OGSv1.1) [21]). Proteins in the SwissProt dataset annotated as transposable elements were removed prior to alignment. We aligned protein sequences to Amel_4.5 using Exonerate protein2genome with a minimum 60% percent identity and 60% alignment coverage.
MAKER2
To create a combined gene set, we ran MAKER2 v2.15 using parameters min_contig = 1000 and pred_gff, which allowed us to provide as input the gene prediction sets and alignments described above, instead of generating new evidence tracks within MAKER2.
GLEAN
We ran GLEAN [17] 32 times to create consensus gene sets using different combinations of the gene prediction sets described above. All of the GLEAN runs included the transcript and protein homolog alignments described above, and required a minimum coding sequence length of 75 nt.
Selecting the new official gene Set
We evaluated the 32 GLEAN sets based on several criteria including overlap with a conservative evidence-based set (RefSeq), transcript sequences, peptides and the CEGMA conserved core set [30] (Additional file 2). NCBI’s RefSeq pipeline has been considered reliable and relatively conservative, so we performed overlap analyses to determine how many of the RefSeq models were captured in the GLEAN consensus sets. We used FASTA [81] to align GLEAN coding sequences with RefSeq coding sequences where a perfect alignment required 99% identity over the entire lengths of both sequences. We used FASTA to align peptides to GLEAN proteins, with E-value and scoring matrix optimized for short exact matches (E-value .01 and MD10 scoring matrix). We parsed peptide alignments to count those with 100% identity and 100% alignment coverage (Additional file 2).
In addition to alignments between GLEAN predictions and RefSeq genes and peptides, we considered numbers of gene models, total numbers of coding nucleotides, average coding sequence lengths, and numbers of splits and merges compared to RefSeq. We selected the GLEAN31 set (Additional file 2) to be the official gene set, OGSv3.2. It was the set generated using gene predictions from NCBI Gnomon, AUGUSTUS, Fgenesh++, N-SCAN using A. florea as an informant genome and GeneID and SGP2 combined as a single set as well as the transcript and protein homolog alignments.
Adding UTRs to OGSv3.2 gene models
Although we did not use MAKER2 to generate the official gene set, we did use MAKER2 (v2.15) [28] to add UTR to the GLEAN coding exons since GLEAN does not produce annotations with UTR features. OGSv3.2 coding exons were input as “pred_gff” and transcriptome evidence (including the 454 transcripts, Illumina contigs and A. mellifera ESTs described above) was input as “est”. MAKER2 aligned the EST evidence to Amel_4.5 using WU-BLAST and Exonerate est2genome then used overlapping EST evidence to extend OGSv3.2 models to include UTR features when possible. Because MAKER2 sometimes modified the coding exon coordinates, we processed the MAKER2 output and retained UTRs only when the coding exon structure was unchanged. Out of a total of 15,314 OGSv3.2 genes, UTR were added to 7,514 genes (49%).
Identifying and characterizing of new genes
Identifying of new and previously known OGSv3.2 genes
In order to compare the newest gene set (OGSv3.2) with the first official gene set (OGSv1.0), we mapped OGSv3.2 coding sequences to Amel_2.0, which was the assembly on which OGSv1.0 was generated [1]. First, we used MegaBLAST [82] to identify scaffold/gene matches with 95% identity and E-value < 1 × 10-20. We then aligned coding sequences to matching scaffolds using GMAP [83], and parsed the output to create two sets of splice-modeled alignments, both requiring 95% identity. One set was based on a relaxed coverage criterion, requiring that the alignment cover at least 50% of the coding sequence. The other set was based on a stringent coverage criterion, requiring that the alignment cover 80% of the coding sequence. Results of further analyses for the relaxed mapping set are provided in the supplemental materials, but we discuss only results for the stringent mapping.
On the basis of mapping OGSv3.2 to Amel_2.0 and overlap between OGSv3.2 and OGSv1.0 gene models on the Amel_2.0 assembly, we divided 15,314 OGSv3.2 genes into three sub-sets (Table 6). We deemed any OGSv3.2 gene that did not align to the Amel_2.0 assembly a “Type I New” gene. The additional sequencing and reassembly of the genome for the Amel_4.5 assembly likely allowed the detection of these genes. “Type II New” genes were those that did align to the Amel_2.0 assembly, but whose coordinates did not overlap an OGSv1.0 gene by a single coding base pair on the same strand. Additional expressed sequence and protein homolog evidence as well as improvements to gene prediction algorithms likely contributed to the detection of these genes. Finally, any OGSv3.2 gene that both aligned to the Amel_2.0 assembly and overlapped an OGSv1.0 gene was deemed a “Previously Known” gene.
Coding sequence length analysis
The total length of the coding sequence was calculated for each gene and the means for all genes and each gene sub-set were calculated (Table 6). We tested the null hypotheses that the mean coding sequence lengths of Type I and Type II New genes and Previously Known genes were equal.
Splice site and single versus multiple coding exon gene analysis
We assessed the genomic sequence of the two intronic base pairs adjacent to each coding sequence exon-intron splice site to determine whether they corresponded to the canonical …]5’-GT/AG-3’[… splice site sequence. We considered only splice sites supported by matching intron coordinates of spliced transcript alignment evidence. We used chi-square tests with one degree of freedom to compare the frequencies of single coding exon genes and non-canonical splice sites in Type I or Type II New genes with Previously Known genes.
OGSv3.2 gene location relative to GC compositional domains
We used IsoPlotter, a recursive segmentation algorithm [84, 85], to partition the A. mellifera genome into GC compositionally homogeneous domains, contiguous regions of the genome with similar GC content (percent G + C nucleotides). We determined the GC content of the GC compositional domain for each OGSv3.2 gene. In cases where a gene spanned multiple GC compositional domains, we calculated the average GC content of the GC compositional domains, and weighted it according to the fraction of the gene's length that occurs in each GC compositional domain. We computed the weighted percent GC based only on non-coding nucleotides of the compositional domains, because we did not wish to include effects of codon bias.
We compared the distribution of Type I and Type II New genes with respect to GC content to that of Previously Known genes. IsoPlotter cannot segment scaffolds less than 10 kb into compositional domains, so the 90 genes residing in these short scaffolds were not considered. We tested the null hypotheses that the means of the weighted GC compositional domain contents for genes in the Type I and Type II New gene sets were equal to the mean of the Previously Known genes.
Effective number of codons analysis
Using the chips program within the EMBOSS package [86], we calculated the effective number of codons separately for each OGSv3.2 gene. We tested the null hypotheses that the mean effective number of codons values for the Type I and Type II New genes were equal to the mean of the Previously Known genes.
Tissue expression analysis
We determined the number of OGSv3.2 gene overlaps to transcript alignments that were used in creating the GLEAN consensus gene sets (454 reads, Illumina contigs and A. mellifera ESTs from GenBank, described above). We relied on splice signals to determine the directionality of a transcript read. We tallied spliced and unspliced alignments separately, since we could be confident when directionality of a spliced alignment agreed with a gene prediction, but could not be confident about unspliced alignments. For spliced transcript alignments, if the transcript was on the opposite strand from the gene then it was discarded from further analysis. For transcripts on the same strand or transcripts that were un-spliced, in which case directionality could not be determined, a coordinate overlap of at least one coding base pair was required for a gene to count as overlapping with a transcript alignment. We counted the number of transcript data sets in which each OGSv3.2 gene was found to have an overlapping alignment (Table 6.) We performed chi-square tests with one degree of freedom to compare the frequencies of spliced transcript overlap in the Type I or Type II New gene sets with the Previously Known gene set.
Of genes that overlapped spliced transcript alignments, we identified genes that were narrowly expressed and genes that were broadly expressed on the basis of overlap to the four single-tissue libraries (brain [combined Illumina forager and nurse brain libraries] and 454 libraries of mixed antennae, ovary and testes). (Foragers and nurses are worker honey bees that specialize on collecting food and feeding brood, respectively.) Genes were deemed narrowly expressed if they overlapped at least one transcript alignment in only one of the four tissues and broadly expressed if they overlapped at least one transcript alignment from all four tissues. We performed chi-square tests with one degree of freedom to compare the frequencies of narrowly expressed genes and broadly expressed genes in the Type I or Type II New gene sets with the Previously Known gene set.
Homolog analysis
We determined the number of protein alignments overlapping OGSv3.2 for protein data sets that were used in creating the GLEAN consensus gene sets. We required overlap of at least one coding base pair on the same strand to deem a gene overlapping with a protein homolog alignment. We performed chi-square tests with one degree of freedom to compare the frequency of the existence of overlaps to homolog alignments for Type I or Type II New genes with that of Previously Known genes.
TBLASTN alignment of OGSv3.2 to A. florea and B. terrestris genomes
We aligned OGSv3.2 protein sequences to the A. florea (Aflo_1.0) and B. terrestris (Bter_1.0) genome assemblies using TBLASTN [87] with an E-value criteria of 1 × 10-06. We did not use information about predicted genes or expression in A. florea or B. terrestris.
Statistical methods to test for differences in means between new and previously known genes
To test the null hypothesis that the mean of a particular variable for Type I or Type II New genes was equal to the mean of the Previously Known genes, we used both a non-parametric Kolmogorov-Smirnov test and a Welch t-test with the correction for non-homogeneity of variances. Testing the hypotheses in this way avoids assuming these data were normally distributed or had equal variances. To be conservative, we report the least significant P-value for each test.
Using peptide data in development and analysis of gene sets
We used the peptide data is described above to evaluate the GLEAN sets and analyze Type I New, Type II New, and Previously Known genes. We aligned the peptide sequences to predicted protein sequences with FASTA [81] with a relaxed E-value of 0.1 and the MD10 scoring matrix. These parameters were found to allow matches to peptide sequences as short as 6 amino acids with 100% identity. Only alignments with 100% identity were retained. We used chi-square tests with one degree of freedom to compare the frequency of proteins that aligned to peptide sequences in Type I or Type II New genes with that of Previously Known genes.
The venom peptide data were used separately to evaluate gene sets for improved identification of venom genes. All significant and top ranking venom peptides from the Mascot output, with an ion score ≥30 were retained in the final peptide lists. Venom mass spectra were searched against the OGSv3.2 and OGSv1.0 to compare the contribution of both datasets to the identification of new venom genes. The false discovery rates of both searches were set to 1%.
Venom peptide data were used to annotate venom genes using the Apollo annotation tool [36], provided for Amel_4.5 by the Hymenoptera Genome Database [88].
Assessing orthology
The protein-coding gene annotations from the two A. mellifera genome assemblies were compared with orthologs from OrthoDB [89] from nine other insects. These included Pediculus humanus (PhumU1.2, 10,772 genes), Acyrthosiphon pisum (ACYPI v2.1, 36,275), Nasonia vitripennis (nvit2, 24,369), Linepithema humile (OGSv1.2, 16,048), Pogonomyrmex barbatus (OGSv1.2, 17,100), Tribolium castaneum (Tcas 3.0, 16,565), Danaus plexippus (OGS2, 15,329), Anopheles gambiae (AgamP3.6, 12,670), and Drosophila melanagaster (r5.45, 13,927). For annotations on the older A. mellifera genome assembly, we used the gene set known to the community as “Amel_prerelease2” (abbreviated as V2), available from BeeBase. It was the gene set resulting from mapping OGSv1.0 from assembly Amel_2.0 to assembly Amel_4.0, and included a small number of manual annotations, with a total of 10,699 genes. Comparing each Apis annotation (V2 and OGSv3.2) to the other nine gene sets identified near-universal orthologous groups with orthologs in all but one or all but two insects. For each gene set, Figure 4 and Table S5 in Additional file 1 show counts of near-universal insect orthologous groups that are missing orthologs in different species. Total counts were partitioned into groups with only single-copy orthologs and those with gene duplications, further divided into those with only one missing species and those with two missing species.
Predicting protein functions
GO analysis
We used FASTA [81] with an E-value threshold of 1 × 10-6 to compute reciprocal alignments between OGSv3.2 proteins and a D. melanogaster protein set consisting of the longest protein isoform of each gene (annotation version r5.42). We identified reciprocal best hits (RBH) and transferred Gene Ontology (GO) [90] annotations from the D. melanogaster protein to the A. mellifera protein for each RBH pair, using the GO annotation file available at FlyBase [80]. We used GeneMerge [91] to test for enrichment of GO terms in OGSv3.2 protein datasets, testing for each of the three GO ontologies (Molecular Function, Biological Process and Cellular Component) separately. Several tests were performed with either the entire set of Gene Ontology terms, the generic GO slim set, or the GO slim set developed for A. mellifera by Whitfield et al. [92]. Population and test gene datasets for a particular GO ontology included only OGSv3.2 genes with GO annotations for that ontology. Population datasets consisted of all OGSv3.2 genes with GO annotations. Test datasets were 1) all new genes based on stringent mapping criteria, 2) Type I New genes using stringent criteria, 3) Type II New genes based on stringent mapping criteria.
InterPro analysis
We used InterProScan [93] to compare OGSv3.2 and OGSv1.0 proteins with the following InterPro [29] protein domain and motif databases: PFAM [94], TIGRFAMS [95], SMART [96], PRODOM [97], PROSITE [98], PIRSF [99], GENE3D [100], SUPERFAMILY [101], and PANTHER [102]. The total numbers of proteins annotated with at least one InterPro domain were 9,479 and 8,552 for OGSv3.2 and OGSv1.0, respectively. For each InterPro domain identified in the combined datasets, we determined the number of proteins containing that domain within OGSv3.2 and OGSv1.0. Then for each InterPro domain, we used 2 × 2 chi-square tests with Yates correction and one degree of freedom to determine whether the frequencies of proteins containing that domain differed between the OGSv3.2 and OGSv1.0 InterPro-annotated sets (total 9,479 and 8,552 proteins, respectively).
Detecting genome-wide repetitive elements
We detected and annotated repetitive elements with the REPET software package ([103], version 2.0) consisting of two pipelines integrating a set of bioinformatics programs. First, repeated sequences were detected by similarity, using an all-by-all BLAST [104] search via BLASTER [105]. LTR-retrotransposons were detected by structural search with LTRharvest [106]. The similarity matches were clustered with GROUPER [105], RECON [107] and PILER [108], the structural matches were clustered with NCBI BLASTclust [109]. From each cluster a consensus sequence was generated by multiple alignment with Map. We analyzed the consensus sequences for terminal repeats (TRsearch), tandem repeats (TRF), open reading frames (dbORF.py, REPET) and poly-A tails (polyAtail, REPET). We screened the consensuses for matches to nucleotide and amino acid sequences from known transposable elements (RepBase 17.01, [42]) using BLASTER [105], which runs tblastx and blastx [87], and searched them for HMM profiles (Pfam database 26.0, [94]) using HMMER3 [110]. Based on the detected structural features and homologies, we classified the consensuses using PASTEC according to Wicker et al. [111]. We then removed redundancies identified with BLASTER and MATCHER [105] as well as elements classified as simple sequence repeats (SSRs; >0.75 SSR coverage) or unclassified elements built from less than 10 fragments.
We used the set of de novo detected repetitive elements to mine the genome in a second pipeline with BLASTER (using NCBI BLAST, sensitivity 4, followed by MATCHER), RepeatMasker (using CrossMatch, sensitivity q, cutoff at 200) and CENSOR (using NCBI BLAST). We removed false positive matches using an empirical statistical filter. Satellite repeats were detected with TRF [112], MREPS [113]and RepeatMasker [64] and were then merged into a single set. We also screened the genomic sequences for matching nucleotide and amino acid sequences from known transposable elements (RepBase 17.01, [42]) with BLASTER. Finally we removed TE doublons (loci annotated as multiple transposable elements) and SSR annotations within transposable element annotations, and performed a "long join procedure" to connect distant fragments. Sequences from the de novo repetitive element library, which were found to have at least one perfect match in the genome were then used to rerun the whole analysis.
To ensure compatibility and to avoid introducing a bias, we refrained from a manual curation or clustering of the de novo detected elements before mining the genome. However, post hoc we manually analyzed all elements, which were previously classified into class I retrotransposon or class II DNA transposon elements or unclassified elements with detected coding element features (similarity to known transposable elements) due to potential chimeric insertion. At this stage we excluded derivative elements (LARD, TRIM, MITE) from further detailed inspection unless carrying a class I or class II element. Some elements were classified “potential Hostgene” in the computational analysis, based on characteristics of the DNA, not based on overlap analysis with predicted genes. These “potential Hostgene” elements, as well as unclassified elements (“noCat”), were also excluded from manual analysis. We performed manual inspection by checking for open-reading frames (ORF) with the NCBI ORF Finder (NCBI), by searching the NCBI Conserved Domain Database (CDD) [114], by searching the most up to date online RepBase database (accessed December 2012-February 2013) via CENSOR [115]. We also performed phylogenetic analysis for LINE RT domains with RTclass1 [116] in order to achieve a detailed classification for each element, determine its potential relation to a family of known elements, to evaluate the completeness and to detect potential active elements. We defined an element to be complete, if it possessed the relevant coding parts with the element-typical domains and the structural features (LTR, TIR). The potential activity was defined according to the region an intact ORF, if present, covered. If an intact ORF seemed to cover a complete region including the typical domains (e.g. GAG as well as POL, Tase) then the element was considered to be potentially active. If a Tase domain was covered by a truncated ORF or the Tase itself appeared to be truncated but was covered by an intact ORF, or if the RT domain was covered by an active ORF but not the remaining element-typical domains, then the element was considered to be maybe potentially active. During the manual classification to at least superfamily level, novel transposable element types not covered by the system of Wicker et al. [111] were also considered: Kolobok, Sola, Chapaev, Ginger, Academ, Novosib and ISL2EU class II DNA transposons [117, 118].
Simple sequence repeats and other low complexity regions were extracted from the REPET pipeline database and processed with a custom Perl script to calculate the total coverage of these types of repetitive DNA by omitting overlaps with transposable element or other repetitive element annotations.
Ethical approval
Experimental research followed appropriate guidelines for the ethical treatment of research subjects. Human subject and animal protocol approvals were not applicable.
Data availability
The 454 transcript read data are available in the Sequence Read Archive (SRA) [119] at the NCBI (SRP003261, SRP003260, SRP001899). The assembled 454 transcripts (119,959) are in the Transcriptome Shotgun Assembly (TSA) database, and available through NCBI BioProject pages for PRJNA51481 and PRJNA51483 [120].
The accessions for the assembled 454 transcript data are: HP542035 to HP552088 (testes), HP527956 to HP542034 (mixed antennae), HP509343 to HP527955 (embryo), HP482918 to HP509342 (brain; ovary), HP473811 to HP482917 (larvae), HP459439 to HP473810 (abdomen), and HP552089 to HP579397 (ovary).
The SOLID genomic read data are available in the SRA (SRX097020). The 454 genomic read data are also available (SRX006752, SRX000071).
The Amel_4.5 assembly is available from NCBI under the accession GCA_000002195.1.
The following resources are available at BeeBase [121], a division of the Hymenoptera Genome Database [88]: genome browsers with gene annotations and supporting evidence alignments; BLAST databases with the Amel_4.5 scaffold assembly, OGSv3.2, all input gene prediction sets and transcript contig assemblies; and a data download page with fasta sequence files and gff for annotations. Peptides and their tandem mass spectra are available from the Apis mellifera PeptideAtlas [62].