Genome sequencing efforts focused upon the phylum Arthropoda have grown enormously with advances in genomics and bioinformatics. As of May, 2012, the National Center for Biotechnology Information (NCBI) reported 222 arthropod genomes as assembled or in progress. Importantly, 82% of these projects are for true insects (Hexapoda: Insecta). Genomic resources for the remainder of the arthropod phylum are far more limited: 24 projects are reported for crustaceans, one for myriapoda, and 16 for chelicerates. The subphyla have divergent evolutionary histories exceeding 500 million years . Broadening the genome survey across this phylogenetic distance contributes to the discovery of lineage-specific genes [2–4], making the sources of orphan genes a particularly relevant question for these taxa [5, 6].
Lack of homology to genomic databases prevents the putative assignment of function to orphans, which typically represent at least 10% of an organism’s gene set [3, 7]. They are commonly attributed to adaptations associated with a taxon’s unique biology . This argument was most recently advocated by researchers of the Daphnia pulex genome, in which a remarkable 36% of genes showed no homology to other datasets . There are, however, numerous potential sources for orphan gene sequences that must be thoroughly investigated in light of their high representation in sequenced genomes and transcriptomes. Poor quality sequence and/or assembly are the least interesting and arguably most likely sources for unassigned sequences. Moreover, genes without homology to other datasets may be non-functional [3, 8–13]. Taxonomic isolation among representative lineages in genome databases can also contribute to lack of homology. For instance, as the first crustacean and chelicerate genomes with annotated genomes, the proportion of orphans in D. pulex and in Tetranychus urticae far exceeds that for closely-related but more heavily-sampled insect genomes [5–7].
In consideration of orphan genes, transcriptome projects serve as an important complement to whole-genome sequencing. They provide a more rapid and less expensive approach to obtaining gene sequences. In addition, transcriptome sequencing projects typically focus exclusively upon protein-coding regions. These are translated to amino acid sequences, which are more likely to be conserved [14–16]. Focusing upon conserved sequences favors identification of true orphan genes. Finally, transcriptomes are an effective proxy for estimating gene diversity and sampling orphan genes when other genomic data are limited. This is contingent upon having a sufficient number of expressed sequence tags (ESTs) that are enriched for full-length transcripts, normalized to sample rare mRNA, and sampled from biologically variable pools of RNA to obtain transcripts associated with diverse tissue types and biological processes [5, 15, 17, 18].
The arthropod subphylum Chelicerata includes scorpions, horseshoe crabs, spiders, mites, and ticks. These lineages are more diverse than Crustacea and equally understudied. The chelicerate subclass Acari comprises the tick and mite lineages. Within the Acari, draft genomes of Tetranychus urticae, the two-spotted spider mite , and Ixodes scapularis, the black-legged deer tick [19, 20] are available, with that of Rhipicephalus microplus, the southern cattle tick, in progress [21, 22]. The need for more comprehensive genomic and transcriptomic data within the Acari is pressing given that many species are obligate blood-feeders that vector human and animal pathogens, including typhus, Lyme disease, Rocky Mountain Spotted Fever, and ehrlichiosis . More data from blood-feeding chelicerates would allow comparison within ticks and with blood-feeding insects for identification of shared pathways to be exploited in control efforts. Indeed, numerous transcriptome projects targeting the salivary glands of at least 12 tick species have implicated several gene families as central to blood-feeding [24–35].
The lone star tick, Amblyomma americanum, is one of the most abundant vectors of zoonotic pathogens in the United States [36–38]. White-tailed deer are key hosts of A. americanum, and their ongoing expansion into suburban areas has increased tick-human interactions [36–38]. Lone star tick bites are associated with many diseases including human monocytic ehrlichiosis , southern tick-associated rash illness [40, 41], tularemia , several pathogenic Rickettsia[36–38, 43], and perhaps the recently discovered Heartland Virus . As of September 2012, only 6,502 ESTs were available for A. americanum on GenBank , derived primarily from specific analysis of gene expression associated with tick salivary glands and blood-feeding [32, 46]. A whole-organism transcriptome would complement these previously available sequences by increasing gene number and diversity.
Here, we present a comprehensive study of a normalized EST library for A. americanum enriched for unique, non-redundant transcripts. This library more than doubles the number of sequences previously available for this species. It represents a compilation of sequences from five life stages from a laboratory colony (i.e. larva, nymph, adult male, adult female, engorged female) and from a cohort of ticks collected from the wild. This approach reveals a large number of genes lacking homology to existing tick and other arthropod genomic datasets. We also outline a framework for evaluating orphan genes, with the aim to distinguish the primary sources of non-homology. Our results argue for a greater recognition and critical assessment of lineage-specific genes, notably in ticks and other understudied taxa.