Gene discovery in the horned beetle Onthophagus taurus
© Choi et al. 2010
Received: 24 May 2010
Accepted: 14 December 2010
Published: 14 December 2010
Skip to main content
© Choi et al. 2010
Received: 24 May 2010
Accepted: 14 December 2010
Published: 14 December 2010
Horned beetles, in particular in the genus Onthophagus, are important models for studies on sexual selection, biological radiations, the origin of novel traits, developmental plasticity, biocontrol, conservation, and forensic biology. Despite their growing prominence as models for studying both basic and applied questions in biology, little genomic or transcriptomic data are available for this genus. We used massively parallel pyrosequencing (Roche 454-FLX platform) to produce a comprehensive EST dataset for the horned beetle Onthophagus taurus. To maximize sequence diversity, we pooled RNA extracted from a normalized library encompassing diverse developmental stages and both sexes.
We used 454 pyrosequencing to sequence ESTs from all post-embryonic stages of O. taurus. Approximately 1.36 million reads assembled into 50,080 non-redundant sequences encompassing a total of 26.5 Mbp. The non-redundant sequences match over half of the genes in Tribolium castaneum, the most closely related species with a sequenced genome. Analyses of Gene Ontology annotations and biochemical pathways indicate that the O. taurus sequences reflect a wide and representative sampling of biological functions and biochemical processes. An analysis of sequence polymorphisms revealed that SNP frequency was negatively related to overall expression level and the number of tissue types in which a given gene is expressed. The most variable genes were enriched for a limited number of GO annotations whereas the least variable genes were enriched for a wide range of GO terms directly related to fitness.
This study provides the first large-scale EST database for horned beetles, a much-needed resource for advancing the study of these organisms. Furthermore, we identified instances of gene duplications and alternative splicing, useful for future study of gene regulation, and a large number of SNP markers that could be used in population-genetic studies of O. taurus and possibly other horned beetles.
Horned beetles, in particular in the genus Onthophagus, are important models for studies on sexual selection [1–3], biological radiations [4–7], endocrine regulation of development [8–11], biological control of invasive species [12–14], conservation biology [15, 16], and forensic biology [17–19]. Onthophagus beetles have more recently gained particular prominence as models for studying the origin and diversification of novel traits (hundreds of species express diverse horns and horn-like structures that lack obvious homology to any other traits in insects [20, 21]) and the developmental underpinnings of phenotypic plasticity (species adjust adult morphology, behavior, and physiology in response to larval nutrition, ranging from subtle adjustment to profound modifications depending on species and phenotype [22–27]).
Despite their growing prominence as models for studying both basic and applied questions in biology, no genome projects exist for any Onthophagus species. Instead, investigations into the genetic basis of Onthophagus biology have had to rely on homology-based gene-by-gene cloning [28, 29] and only very recently on low throughput EST sequencing . At the same time, development of genomic resources in several other insect models, such as Drosophila, mosquitoes, Tribolium beetles , honey bees , and several lepidopteran species [33–35], has greatly advanced insights into the molecular and developmental genetics, physiology, and evolution of these organisms. Onthophagus beetles offer great opportunities to add to the study of important biological phenomena pioneered through the study of these earlier models, such as the regulation of arthropod development, which has been informed in large part through work on fruit flies and Tribolium beetles [36, 37], the origin of novel complex traits, as studied in butterfly wing patterns [38, 39], or the genetic regulation of nutrition-sensitive development, a central focus of honey bee research [40–42].
Furthermore, several other experimental techniques and tools have been successfully developed for Onthophagus, most notably RNAinterference mediated transcript depletion . Applying such tools to the study of Onthophagus biology has, however, been hampered by the paucity of candidate genes and pathways available for investigation. The very recent development of modest EST resources for Onthophagus taurus using traditional Sanger sequencing  has already facilitated several important new research efforts [44–46]. Combined, this suggests that studies on Onthophagus beetles are poised to make rapid progress once large-scale genomic or transcriptomic resources are available, which in turn promises to advance our understanding of fundamental and applied question in evolution and developmental biology. Here we describe an EST collection developed for the horned beetle Onthophagus taurus, the most commonly studied species of horned beetle to date.
Sequencing and assembly statistics
Total number of reads
Total length of reads (bp)
Total number of reads cleaned
Total length of reads cleaned (bp)
Number of reads placed
Number of singletons
Total length of singletons (bp)
Average length of singletons (bp)
Largest singleton (bp)
Number of contigs
Total length of contigs (bp)
Average length of contigs (bp)
Largest contig (bp)
Average read coverage of contigs
Sequence matches against public databases
Tribolium annotated proteins
Further analyses of these sequences revealed the following: (i) 759 sequences matched the T. castaneum annotated protein coding sequences and genome. (ii) 345 sequences matched the T. castaneum annotated proteins but not the T. castaneum genome, and thus are likely to be genuine transcribed sequences, but the homologous sequences may not be included in the assembled T. castaneum genome. (iii) Of 874 sequences that matched only the T. castaneum genome, 446 contigs and 114 singletons matched T. castaneum sequences that lie within the bounds of annotated genes, but are not annotated as transcribed sequences. These may represent sequences that are included in mature transcripts in O. taurus but not T. castaneum. Alternatively, they may identify sequences that are included in mature transcripts in T. castaneum but not annotated as such. (iv) 314 sequences matched unannotated genomic sequence in T. castaneum, and thus may identify lineage specific genes, or more likely, genes that are not annotated in T. castaneum. In total then, 52% (25,968) of the O. taurus non-redundant sequences matched either NR, or T. castaneum genomic or protein sequences. Conversely, the O. taurus non-redundant sequences match with approximately 56% (9,303/16,645) of the gene models in T. castaneum, and 59% (5,322/9,053) of the sequence clusters in the Tribolium UniGene set (Table 2). These data suggests we have sampled a significant fraction of the O. taurus transcriptome.
Nevertheless, 48% of the O. taurus non-redundant sequences did not match with the T. castaneum genome or proteins. Of 38,050 contigs, 21,847 sequences had significant hits (E-value cutoff of 1 × 10-5) to the Tribolium genome, Tribolium proteins, and/or NCBI NR (Figure 1), while 16,203 sequences had no significant match to any of these databases. We performed additional analyses on the "no-hit" sequences to determine whether they represent poor sampling (i.e., short or few reads) or biologically interesting (highly divergent) genes. Compared to sequences with significant hits, the no-hit sequences had shorter contig lengths (mean (SE): hit = 733.5 (3.17); no hit = 391.1 (3.69); F 38048 = 4941, P = 0), smaller proportions of read length made up of predicted coding sequence (mean (SE): hit = 0.751 (0.002); no hit = 0.518 (0.002); F 38048 = 7,737, P = 0), and fewer reads (mean (SE): hit = 74.1 (0.67); no hit = 22.64 (0.78); F 38048 = 2,492, P = 0). While the mean distribution of contig length, proportion of coding sequence and read number was significantly different between hit and no hit sequences, there was considerable overlap between the two distributions (see Additional file 3). In particular, 21% of the no hit sequences (3,338 of 16,203) had at least the average read length (733 bp) and proportion coding sequence (0.74) of sequences with hits. This subset of high quality sequences had on average 39 reads, suggesting they were not simply genes with low expression. Taken together, this analysis suggests that while many of our "no hit" sequences likely represent low information content of a contig due to short or few reads, a significant proportion of these no hit sequences may represent highly divergent or novel genes that may prove interesting in future study.
In summary, the O. taurus non-redundant sequences match with over half of the genes in T. castaneum. If we assume that these two beetle species have similar total gene numbers, then we can infer that we have sampled a significant proportion of genes in O. taurus. There is no evidence that the O. taurus sequences are significantly contaminated with sequences from other taxa. Furthermore, we seem to have sampled many genes that may represent highly divergent or novel proteins.
It was to be expected that many of the non-redundant sequences would derive from non-overlapping regions of common transcripts. Indeed the 23,990 non-redundant Onthophagus sequences match a total of 14,223 distinct sequences in the NCBI NR protein database (Table 2). We took two approaches to identify clusters of non-redundant sequences that potentially derive from common transcription units.
A total of 4,205 contigs met the first set of criteria and were flagged as putative alleles and merged into 2,026 groups. The hypothesis that these contigs derive from the same gene was supported by sequence matches against NR proteins. Of the pairs of merged contigs where both contigs had sequence matches against NR, 99% (1,547/1,565) had best matches to the same protein sequence, and only 1% (18/1,565) matched different proteins.
A total of 85 pairs of contigs met the second set of criteria and were flagged as presumptive duplicates. The hypothesis that these contigs derive from duplicate genes was supported by sequence matches against NR proteins. Of the pairs of contigs where both contigs have sequence matches against NR, 96% (23/24) had best matches to the same NR protein, and only 4% (1/24) had best matches to different proteins. We also asked whether there is evidence that the genes corresponding to the NR proteins are likely to be duplicated in the genomes of related species. Of the sequences tested 85% had two matches (E-value < 1 × 10-5) in T. castaneum (of 26 with at least one match) and 83% had two matches in a D. melanogaster (of 24 with at least one match). This indicates that 83-85% of the BCCs flagged as putatively duplicated in Onthophagus are also duplicated in other arthropods providing support for this classification.
We identified 753 groups of contigs (BCC) that met the third set of criteria and were thus flagged as putatively derived from alternative splicing. The hypothesis that these BCCs derive from alternatively spliced transcripts was supported by sequence matches against NR proteins, and Drosophila genes. Of these BCCs in which all contigs had sequence matches against NR, 62% (138/221) were composed of contigs that all had best matches with the same NR protein. We also investigated whether the corresponding genes (best Blastp match with a minimum of E-value = 1 × 10-5) in D. melanogaster are annotated as being alternatively spliced. 67% of the 206 cases tested were annotated as alternatively spliced in D. melanogaster. These data support the prediction that the corresponding genes are indeed alternatively spliced in O. taurus. A simple example is illustrated in Figure 3. Contigs 28477 (108nt), 25928 (769nt) and 04341 (639nt) form a BCC that is joined by a total of 67 broken reads (Figure 3A). Each edge of the BCC is defined by multiple broken reads, with the minimum being eight. The three contigs do not share significant sequence similarity among themselves (5.9% translated amino acid sequence identity between contig 25928 vs. contig 28477, 10.8% between contig 28477 vs. contig 04341, and 18.2% between contig 25928 vs. contig 04341), and as such, were flagged in our analysis as putatively belonging to a common transcription unit with alternative splicing. Indeed, the three conceptual polypeptide sequences from these contigs align to a contiguous region of the Disabled protein from T. castaneum, supporting this hypothesis. It should be noted that many of the BCCs that putatively result from alternative splicing have complex structures the resolution of which will require sequencing genomic and/or cDNA sequences. In summary, the analysis of BCCs grouped 10,387 of the non-redundant sequences into 2,603 groups flagged as putatively derived from a common gene (2,026 as allelic variants and 753 as alternatively spliced). While the BCC analysis provides reasonably strong evidence for merging sequences, it clearly does not capture all likely cases, and we therefore turned to similarity to known genes as a more comprehensive means of grouping non-redundant sequences, as described next.
Clustering using sequence matches to HomoloGene
1 × 10-5
1 × 10-20
1 × 10-50
Candidate developmental genes
PREDICTED: similar to fibroblast growth factor receptor
PREDICTED: similar to epidermal growth factor receptor
1, 2, 4, 21
1, 4, 21
PREDICTED: similar to ets
1, 9, 16
PREDICTED: similar to Homeobox protein cut
1, 10, 21
PREDICTED: similar to par-1 CG8201-PA
PREDICTED: similar to COUP-TF/Svp nuclear hormone receptor
PREDICTED: similar to thickveins CG14026-PA
PREDICTED: similar to DNA cytosine-5 methyltransferase
PREDICTED: similar to cornichon protein, putative
PREDICTED: similar to extracellular signal-regulated kinase
PREDICTED: similar to Ecdysone-induced protein 63E CG10579-PK
PREDICTED: similar to Ecdysone-induced protein 78C CG18023-PA,
matrix metalloproteinase 1 isoform 2
phosphatidylinositol 3-kinase regulatory subunit
PREDICTED: similar to insulin receptor
PREDICTED: similar to Phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase PTEN
epoxide hydrolase 1
PREDICTED: similar to BarH1 CG5529-PA
PREDICTED: similar to LIM homeobox 1b
PREDICTED: similar to FAS-associated factor 1, putative
transcription factor deformed
effector caspase; Sl-caspase-1
PREDICTED: similar to caspase
PREDICTED: similar to E74
PREDICTED: similar to NAD-dependent deacetylase sirtuin-1
PREDICTED: similar to Darkener of apricot CG33553-PG
PREDICTED: similar to brahma CG5942-PA, isoform A, partial
PREDICTED: similar to Headcase protein
PREDICTED: similar to histone deacetylase
PREDICTED: similar to fringe CG10580-PA
PREDICTED: similar to suppressor of fused
PREDICTED: similar to supernumerary limbs CG3412-PA
homeodomain transcription factor Prothoraxless
PREDICTED: similar to rotated abdomen CG6097-PA
PREDICTED: similar to fruitless
PREDICTED: similar to BmDSX-F
PREDICTED: similar to iroquois-class homeodomain protein irx
PREDICTED: similar to cadherin
ecdysone receptor isoform A
nuclear hormone receptor FTZ-F1 beta
PREDICTED: similar to ecdysone inducible protein 75
retinoid X receptor
glass bottom boat protein
PREDICTED: similar to mothers against dpp protein
PREDICTED: similar to frizzled
PREDICTED: similar to frizzled 7
PREDICTED: similar to jnk
PREDICTED: similar to legless CG2041-PA
PREDICTED: similar to Wnt11 protein
PREDICTED: similar to Wnt6
PREDICTED: similar to wntless CG6210-PB
Our libraries sampled a total of 64 haploid genomes (32 diploid individuals) from a laboratory culture established from wild caught animals, which allowed us to begin to explore sequence variation in O. taurus. After merging contigs in BCCs, all contigs were used for sequence variant analysis. GigaBayes  identified 164,537 SNPs and 344,632 indels. After removing indel calls in homopolymer regions, 80,732 indels remained. Additional file 8 shows histograms of our confidence in identified sequence variants (based on the "probability" value calculated for each SNP by GigaBayes). We focused subsequent analyses on only high quality sequence variants: those with at least 5 reads and a GigaBayes probability value of at least 0.9. Additional file 9 shows sequence changes of 92,979 and 25,496 final SNPs and indels. Transitions - A-G and T-C mutations - were more common than transversions. Furthermore, insertions and deletions more commonly affected A and T than C and G.
Effect of contig length and coverage on detection of SNPs and Indels
Total detected SNPs
Total Detected Indels
Correlations between patterns of gene expression and estimated levels of variation
Residual SNP frequency
Residual Indel Frequency
Number of Tissues
We used Blast2Go to identify GO terms enriched in subsets of genes with more or less variation than expected. The most variable genes (the top 5% of genes in terms of residual SNP frequency) were enriched for three GO terms, including actin binding and cytoskeletal protein binding (Additional file 10, Sheet "top 5% SNP variation"). The least variable genes (the lowest 5% of genes in terms of residual SNP frequency) were enriched for 61 GO terms, including many processes related to metabolism, development, intracellular functions, cell cycle, nucleic acid binding, anatomical structures, and life history (Additional file 10, Sheet "lowest 5% SNP variation").
Our results build on a growing literature that shows that 454 pyrosequencing can be a powerful tool for expanding genomic resources for emerging model systems (e.g. [59, 60]. The major shortcoming of such high-throughput sequencing - short reads that can be difficult to assemble - are being overcome by advances in the sequencing process itself and novel bioinformatic approaches explored in studies such as this. Here we review our findings, the novelty of our approach, and highlight some of the specific tools this sequencing effort brings to the Onthophagus system.
As in most high-throughput sequencing projects, in one 454 run, we generated a massive body of sequence information: almost 600 millions base pairs spread over 1.4 million cleaned reads (Table 1). This has vastly expanded our existing set of Onthopagus ESTs that was previously generated by Sanger sequencing , with 93% of the 454 sequences (46,891) failing to match the Sanger sequences (E-value cut-off = 1 × 10-50). Nevertheless the Sanger ESTs are not entirely redundant, as 9.6% of the Sanger sequences failed to match the 454 sequences (E-value cut-off = 1 × 10-50). Using a series of analyses, these reads were assembled into contigs and clusters that represent approximately 8,500 known genes (4,000 - 14,000 depending on the database used and the stringency for the match; see Table 2, 3). Our analyses provide several lines of evidence that suggest we can be confident in our assembly.
First, the majority of our sequences significantly align with known protein sequences, in particular those of Tribolium, the only beetle genome currently sequenced. Specifically, 52% of contigs and singletons, and 59% of contigs align with known sequences from one of three databases (E-value < 1 × 10-5; Figure 1). The majority of these alignments (80%) agree between all three databases queried (see Table 2). A small number of genes (<4% of sequences) matched a protein in only one of the databases; these genes likely represent instances where the Tribolium genome is incompletely annotated or assembled or where genes are transcribed in Onthophagus, but not in Tribolium (see Figure 1).
While the majority of our sequences aligned with known sequences, 41-48% of our sequences did not match known proteins or the Tribolium genome. Our analyses suggest that many of these sequences represent contigs with short or very few reads, or reads that cover non-coding sequence (Additional file 3). However, at least 20% of these genes are of similar or greater quality (average read length and proportion coding sequence) of sequences with significant alignments, and may include genes that are novel or highly divergent between O. taurus and Tribolium. This is not surprising given that Onthophagus and Tribolium shared a common ancestor over 150 million years ago . These divergent genes may be fruitful for future research given that they may represent novel genes, or genes under strong selection for new functions.
A second line of evidence that generates confidence in our results is a survey of the taxonomic distribution of best sequence matches. Specifically, of contigs with significant matches to the NCBI NR database, over 8,000 (40%) were classified to Tribolium, and 18,500 (87%) to Arthropoda. A relatively small proportion was classified to taxa outside of insects. For instance, less than one percent of genes were classified to bacteria. Given these beetles feed on dung and have a diverse associated gut flora [62, 63], the non-arthropod-classified sequences could represent gut contamination of symbionts or partially digested plant products. We sought to minimize such contamination by sampling only the head and thorax of individual beetles, but presumably, contaminants could be present in the foregut or structures specialized to house symbionts. The fact that non-arthropod classified sequences are dominated by singletons, while the arthropod-classified sequences are dominated by contigs (Figure 2), supports the interpretation of minor contamination by naturally associated plant parts and bacteria.
Our Gene Ontology classification also lends significant support to our assembly of the Onthophagus taurus transcriptome. Comparing Onthophagus to Tribolium, our genes sample roughly the same proportion of gene classes for classifications of molecular function, cellular component, and biological process (Figure 4). For instance, in Tribolium, 15, 9 and 8 percent of genes are involved in "biological regulation," "developmental process" and "establishment of localization," respectively, while in Onthophagus, the corresponding percentages are 14, 8, and 7. The largest differences are still modest and fall within the "metabolic process" category (39% of Onthophagus genes fall into this category, versus 32% of Tribolium genes); which could prove to be an interesting consequence of the differences in diet between the two species.
By comparing our assembled sequences to existing databases, our analyses suggest that we have sampled about half of the Onthophagus transcriptome. For instance, around 20,000 contigs and singletons match to approximately 10,000 separate genes (depending on the database used; at an E-value of 1 × 10-5; see Table 2, 3). Based on the Drosophila genome, we can estimate that Onthophagus may possess around 20,000 genes; thus, we have sampled about half of the genes present in the genome. If we assume that the 4,000 higher quality, but highly diverged genes discussed above, which did not align to known genes, match to about 2,000 additional, independent genes (as in Table 2, 3), then we may have sampled about 60% of the transcriptome. Regardless, this dataset represents a rich resource for future work on the system.
Next generation sequencing and the de novo assembly of transcribed sequences is increasingly being used to characterize the transcriptomes of non-model organisms for which a whole-genome sequence is not yet available . The absence of a reference genome sequence makes the assembly of these sequences particularly challenging. Because 454 sequencing results in sequence reads that are generally shorter than a given gene, assembly relies on generating a series of overlapping reads. However, any sequence variants among the reads, for instance due to genetic variation between individuals, sequencing of paralogs of a gene, or alternative splicing, can make assembly difficult. Furthermore, overlapping reads may be assembled for separate components of a gene, resulting in multiple contigs representing one gene. We used several complementary bioinformatic approaches to overcome the limitations of short read lengths.
We used a clustering approach against the HomoloGene database to determine whether our non-redundant sequences (contigs and singletons) derived from the same gene within the genome. Similar approaches have been demonstrated to be effective in clustering contigs of transcribed sequences in the absence of a reference genome [65–67]. This method allowed us to identify the highest confidence set of non-redundant protein-coding sequences in our dataset. For instance, with an E value < 1 × 10-5, 18,976 Onthophagus sequences matched sequences from 12,464 HomoloGene groups (Table 3). This method was concordant with our more general approach to identify the number of genes sampled, where we tested for sequence matches across several databases, including the Tribolium genome (Table 2).
In the absence of a reference genome sequence it is very difficult to identify sequences that derive from the same gene but fail to assemble due to sequence polymorphisms or alternative splicing. This is particularly true in the absence of informative similarities to sequences from related species. One approach to this problem has been to remove a sub-set of redundant contigs without resolving the relationship between the sequences . Here we used a novel analysis of "broken reads," individual sequence reads that were placed in two or more contigs during assembly. This approach identifies groups of sequences connected through multiple broken reads ("bi-connected components") and are thus biologically linked, but have failed to assemble due to sequence polymorphisms, alternative splicing or gene duplication (see Figure 3). To our knowledge this approach is unique in being able to cluster such sequences in the absence of a reference genome. We then used sequence similarity amongst the sequences within bi-connected components to infer the most likely physical origin of the connected sequences. For instance, when linked contigs shared at least 95% sequence identity, they were assumed to be divergent alleles and the sequences were merged. Of the sets of linked contigs that fit this criterion, 99% (1,547/1,565) matched to the same protein in the NCBI NR database, suggesting our assumptions were correct. Linked contigs with less sequence similarity were classified as either putative duplicates (85-95% similarity) or alternatively spliced transcripts (<85% sequence similarity), both of which were supported by comparing our classifications to existing gene models. It is important to emphasize that while the clustering of bi-connected components provides very strong evidence that the sequences derive from either alleles, alternatively spliced transcripts, or gene families, the classification based on sequence similarity is not definitive. The definitive resolution of gene structures must ultimately rely on genome and/or full length cDNA sequencing. Despite this caveat the clustering based on broken reads is an effective means of grouping related sequences in the absence of a reference genome.
Our preliminary SNP analyses lend further confidence in the classification of genes as alternative alleles. First, we found that transitions were more common than transversions (Additional file 9), as commonly reported in studies that consider patterns of genetic variation [69–72]. Second, we used previously reported microarray data [30, 46] to test whether the SNPs we identified were related to patterns of gene expression in manners consistent with other studies. We found significant negative relationships between SNP frequency and overall levels of gene expression, consistent with the commonly reported relationship between gene expression and gene conservation [73, 74]. We also found a negative relationship between SNP frequency and the number tissues in which a gene was significantly expressed, which is reminiscent of the positive relationship between tissue-specific gene expression and sequence divergence due to pleiotropic constraints [75, 76].
Overall, our use of multiple, complementary bioinformatic approaches allowed us to glean a large amount of information from one 454 run. We are now primed for a range of studies on the Onthophagus system, some of which we highlight below. Furthermore, this analysis pipeline will allow similar resources to be developed for a range of emerging model systems.
The resources generated in the present study provide an expansive toolbox for advancing current, and enabling future research efforts in horned beetles. Here, we highlight three particularly interesting avenues for future inquiry.
Horned beetles in general, and Onthophagus beetles in particular, are becoming attractive models for studying the origin and diversification of novel traits, specifically horns . Horns lack obvious homology to other traits in insects or non-insect arthropods, yet develop at least in part similar to more traditional appendages such as legs and mouthparts. Thus, beetle horns offer an interesting opportunity to study how evolutionary changes in the interactions between ancestral developmental-genetic mechanisms may enable the origin of novel features. Earlier studies have begun to implicate several developmental pathways in the regulation of horn development using immunohistochemical analysis of candidate genes (e.g. limb patterning [28, 29], programmed cell death ), quantitative PCR (e.g. insulin signaling ), hormone manipulations (e.g. juvenile hormone metabolism ) and most recently RNAinterference . In each case, analysis of candidate genes was limited to very few or one gene candidate. In the present study we substantially increase the number of candidate genes now available to investigate the role of these pathways in horn development and evolution (Table 4). Furthermore, we provide a substantial number of candidate genes for the investigation of developmental pathways previously inaccessible for study, yet hypothesized to play a potentially significant role in the origin and diversification of beetle horns and horned beetles, such as the notch, Wnt, and EGFR signaling pathways . In so doing the present study also contributes important resources for moving beyond the examination of single candidate genes and toward examining interactions between developmental pathways and within and between gene networks.
The biology of horned beetles is characterized by a remarkable degree of phenotypic plasticity - a single genotype's ability to adjust phenotype expression to changes in environmental conditions. In horned beetles, such plastic responses involve behavioral traits (e.g. fighting vs. sneaking reproductive tactics in males ), parental investment [25, 26], physiological changes (timing of metamorphosis , thermoregulation ) and morphology (horns , testes investment ). However, the developmental genetic basis of this plasticity is largely unknown. Here, our 454 run has generated a rich set of tools for future inquiry into the genetic underpinnings of phenotypic plasticity. For instance, the insulin signaling pathway has been suggested to play an important role in mediating the plastic switch between fighter and sneaker tactics in horned beetles . Our sequencing effort identified many important genes in this pathway, including chico, FOXO, insulin receptor, and melted.
Similarly, DNA methylation is another important candidate pathway for nutrition-induced phenotypic plasticity . The importance of DNA methylation in insects has been of interest in recent years, following the discovery that a complete methylation machinery - while absent in Drosophila - is present in the Hymenoptera , shows significant variation across species , and appears to play an important role in nutrition-induced caste determination in honeybees [56, 85]. The present study has identified the complete set of all three DNA methyltransferases (dnmt 1-3) in Onthophagus taurus, including the de novo methyltransferase (dnmt 3) and the maintenance methyltransferase (dnmt 1), setting the stage for future studies into the role of DNA methylation in horned beetle plasticity
Our SNP analyses identified overall levels of genetic variation comparable to 454 studies of other animals sampled from wild populations [70, 71], but considerably more relative to domestic or lab strains of animals [59, 86] and considerably less relative to plants, even domestic varieties [69, 87]. This SNP dataset will serve as a powerful resource in future studies of genetic variation. We can now easily genotype individuals and sample standing levels of genetic variation. Having SNPs associated with this assembly also primes us for more powerful analyses of how gene expression patterns affect genetic variation, for instance due to relaxed selection on morph- or environment-specific genes .
Our SNP dataset also brings us closer to identify patterns of natural selection on the Onthophagus genome, and determining which genes are under strong positive or purifying selection. Such analyses will be facilitated by sequencing other species in the genus (and the calculation of dN/dS). Until then, we can get hints at classes of genes under strong or weakened selection, based on our gene ontology enrichment analysis of more or less variable genes. For instance, very few GO categories were associated with highly variable genes (actin binding, cytoskeletal protein binding; Additional file 10). This could be because these variable genes are more divergent and have no or incomplete annotation. Indeed, of these 1900 genes (5% of 38,000 contigs), 47.4% were annotated in the low variation group and 33.2% were annotated in the high variation group. In contrast, the least variable genes were enriched for over 60 GO terms (Additional file 10). Scanning this list reveals many processes that are likely under purifying or positive selection, including metabolism (GO terms: metabolic process, primary metabolic process, macromolecule metabolism, etc.), development (developmental process, multicellular organismal development), cell cycle (cell cycle, cell death), morphology (anatomical structure development and morphogenesis, nucleic acid binding (nucleotide binding, RNA binding), and life history (death, reproduction). While these processes may be biologically relevant and possibly indicative of the origin and rapid diversification of novel traits in this lineage (horns), it is important to treat these lists with caution as such enrichment analyses can be confounded by nested gene ontology categories . Regardless, this analysis yields genes and developmental processes that may prove interesting for future study in this system.
This study sampled sequences from approximately half of genes expressed in the horned beetle Onthophagus taurus. This greatly advances our knowledge of the Onthophagus transcriptome and paves the way for future molecular genetic studies of horn evolution and development.
Beetles used in this study were reared as described previously . To avoid possible contamination of the gut fauna, we used head and three thoracic segments from our larval samples. Instead of dissecting head and thoracic segments from pupae, we used the whole body. Late pupae were transferred to a clean, humid chamber before emergence and adults were not fed with any cow dung to avoid possible contamination from the food source. To enrich the pool of expressed genes with various classes of transcripts, we included all the major stages of postembryonic dung beetle development: mid 3rd instar larva (5 days after molt), late 3rd instar larva (11 days after molt), early and late prepupa, pupa within 24 hrs after pupation, between 36 and 48 hrs after pupation, mid pupa (7 and 9 days after pupation), late pupa (12 days after pupation), and adults 4 days after molt. Each stage includes two males and two females except for pupa day 7 (one female) and day 9 (one male). In total, we used 32 animals to extract total RNA. Total RNA was isolated as described in . RNA quantity and quality were tested using an Agilent Bioanalyzer 2100. Total RNA was then sent to the Center for Genomics and Bioinformatics at Indiana University, Bloomington (IU CGB), which prepared a normalized transcriptome (cDNA) library optimized for Roche 454 GS-FLX Titanium sequencing using custom methods (K. Mockaitis, unpublished, available upon request). This library was sequenced using the GS-FLX Titanium process on a full PicoTitre plate, according to the manufacturer.
Repeats, transposons, small RNAs and low complexity regions were identified using RepeatMasker  in conjunction with the RepBase  database for Drosophila. O. taurus sequences were aligned to the following public sequence databases using BlastX and tBlastX (-F F -e 1e-5): NCBI T. castaneum UniGene #12 , T. castaneum genome v3.0 , T. castaneum annotated proteins v3.0 , and NCBI HomoloGene build 64 . Open reading frames (ORFs) were predicted using ORF finder  and ORFpredictor . The taxonomical analysis of sequence matches was performed using MEGAN . Gene Ontology analysis was performed using Blast2Go . The analysis of metabolic pathways was performed by first using Blast sequence matches against HomoloGene  to assign Enzyme Commission (EC) numbers, and then using iPath  to visualize metabolic pathways.
Two different approaches were used to cluster non-redundant sequences. In the first approach broken reads - reads that were split and assembled in two or more contigs by Newbler - were used to group contigs. Using a custom script (script available at: https://www.sourceforge.net/projects/snp454), broken reads were used to construct graphs; where nodes represented contigs and edges represented broken reads, and contigs forming connected components (CC) and bi-connected components (BCC, subgraphs that are not split if any one edge is removed) were identified. Briefly, the script was fed Newbler assemblies in ACE format and performed the following steps: (i) identify broken reads based on read names, (ii) construct graph using Pearl Graph Module , (iii) identify CC and BCC using Pearl Graph Module function, (iv) flag each BCC as putatively allelic, duplicated, or alternative splicing according to the criteria described below.
The contigs connected in BCCs are likely to represent allelic variants, duplicated genes, duplicated exons, alternative splicing, or a combination of the aforementioned. It is to be expected that contigs that failed to assemble due to allelic variation will share higher sequence similarity than duplicated genes/exons, and that alternatively spliced exons will share the lowest sequence similarity. In order to explore whether we could distinguish between these classes based on sequence similarity alone, we analyzed the sequence similarity between recently duplicated genes , and between alternatively spliced exons in the D. melanogaster genome (FlyBase FB2010_07). Recently duplicated genes (e.g. Nuclear transport factor-2 and Sperm-specific dynein intermediate chain) can share regions of up to 95-100% sequence identity. Given that this exceeds the sequence similarity cut-off commonly used for the assembly of 454 sequences [48, 49], it is not possible to definitively distinguish between BCCs representing allelic variants from those representing duplicated genes. However, an earlier scan for segments duplicated in the D. melanogaster genome identified 82 groups of duplicated sequence (average length 3.7 kb) that share 95% or greater sequence identity , indicating that such cases are rare. On the other hand 97% of alternatively spliced genes in Drosophila do not have pairs of exons with Blastn matches with E-value < 1 × 10-5 (excluding overlapping exons). Thus, while most alternatively spliced exons fall below this cut-off, there are exceptions and it is not possible to definitively distinguish between BCCs representing alternative splicing from those representing duplicated genes/exons or allelic variation, based on sequence similarity alone. Having established that sequence similarity performs reasonably well in classifying alleles, duplicated genes and alternatively spiced exons; we performed inter se Blastn sequence alignments among the contigs within BCCs, and classified them as follows. First, highly similar sequences were flagged as putative allelic sequences if they met the following criteria: at least 50 bp of at least 95% sequence identity, flanked on either side by no more than 10 nt of less than 95% sequence identity (single stranded overhangs of >10 nt were permitted). Second, pairs of contigs not meeting the first criteria but having Blastn matches of E-value < 1 × 10-5 were flagged as putatively representing duplicated genes. Finally, BCCs with contigs failing to meet the first two criteria were flagged as putatively representing alternatively spliced transcripts.
The second approach to cluster non-redundant sequences utilized sequence matches to the HomoloGene database of groups of homologous sequences from sequenced genomes . The non-redundant sequences were assigned HomoloGene IDs based on the best Blast matches to the database (minimum cut off = E-value < 1 × 10-5). Clusters of O. taurus non-redundant sequences that all shared the best Blast match to the same HomoloGene group were defined as "major clusters".
Sequence variant call programs suffer from the fact that when assembling contigs, the Newbler algorithm introduces gaps instead of substitutions in alignments between reads. To overcome this problem, we realigned sequences within contigs using a custom script (Script available at: https://www.sourceforge.net/projects/snp454). This script was fed Newbler alignments in ACE format and performed following steps: (i) extracted pairwise alignments, (ii) removed homopolymeric gaps from pairwise alignments, and (iii) ran MosaikAssemble  to generate multiple sequence alignments. These multiple sequence alignments were then fed to GigaBayes  to predict sequence variants. The sequence variants predicted by GigaBayes were filtered out for high confident sites if indel variants occur in homopolymer regions, read coverage is less than 5 or greater than 100 and the probability of sequence variants is less than 0.9.
To analyze patterns of SNP variation, we first calculated a "residual" SNP number from a standard least squares linear model that controlled for read length and number of reads (both factors were first log transformed and treated as fixed effects in the model). Previous studies have acknowledged the importance of controlling for both factors when estimating genetic variation from 454 data , but we feel this analysis improves on previous metrics. By using a residual calculated from a predicted value, we can control for the fact that a SNP reading of "0," could be due to low genetic variation, or low power due to short read lengths. We performed several exploratory analyzes of SNP variation. First, we used past microarray data to test if genetic variation was related to gene expression patterns. Our microarray data  were based on the past cDNA EST assembly  that was incorporated into the current assembly. For any microarray construct that matched more than one contig in the 454 assembly we averaged the residual SNP frequency. Expression data are described in detail in . Briefly, gene expression was measured in first day pupae of females, large, horned male and small, sneaker males in the head horn epidermis, thoracic horn epidermis and legs relative to abdominal epidermis and in the central brain relative to ganglionic neural tissue (N = 48 total arrays). We focused on four measures of gene expression: total expression level ("A"), total tissue types (out of 4) in which differential expression was detected, bias in gene expression between male morphs and between males and females (averaged over all tissues). We recognize that the relationship between tissue specificity and SNP frequency could be confounded by our method of detecting SNPs. Specifically, a highly expressed, tissue-specific gene may be detected only in one or two individuals (that are currently expressing this gene), thus decreasing the probability of detecting SNPs in that gene. However, this prediction is opposite that predicted (and found) in our data; that tissue-specific genes are more variable. In our second set of analyses, we were interested in whether genes with the greatest or least amount of variation were enriched for any GO terms. We performed an enrichment analysis using Fisher's Exact Test implemented in Blast2Go .
We thank the Center for Genomics and Bioinformatics at Indiana University and its staff, especially Keithanne Mockaitis and John Colborne for their help and expertise in executing this study. Funding for this study was provided by National Science Foundation grants IOS 0820411 to JA and APM and IOS 4824311 to APM. ESR was supported by NIH NRSA F32GM083830. Additional funding for work in the Center for Genomics and Bioinformatics, was provided in part by the METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, as well as the National Research Foundation of Korea [NRF-2009-352-D00275] to HS.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.