The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms

Larsen, Peter A; Heilman, Amy M; Yoder, Anne D

doi:10.1186/1471-2164-15-720

Research article
Open access
Published: 26 August 2014

The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms

Peter A Larsen¹,
Amy M Heilman¹^nAff2 &
Anne D Yoder¹

BMC Genomics volume 15, Article number: 720 (2014) Cite this article

7843 Accesses
36 Citations
16 Altmetric
Metrics details

Abstract

Background

Molecular characterization of highly diverse gene families can be time consuming, expensive, and difficult, especially when considering the potential for relatively large numbers of paralogs and/or pseudogenes. Here we investigate the utility of Pacific Biosciences single molecule real-time (SMRT) circular consensus sequencing (CCS) as an alternative to traditional cloning and Sanger sequencing PCR amplicons for gene family characterization. We target vomeronasal gene receptors, one of the most diverse gene families in mammals, with the goal of better understanding intra-specific V1R diversity of the gray mouse lemur (Microcebus murinus). Our study compares intragenomic variation for two V1R subfamilies found in the mouse lemur. Specifically, we compare gene copy variation within and between two individuals of M. murinus as characterized by different methods for nucleotide sequencing. By including the same individual animal from which the M. murinus draft genome was derived, we are able to cross-validate gene copy estimates from Sanger sequencing versus CCS methods.

Results

We generated 34,088 high quality circular consensus sequences of two diverse V1R subfamilies (here referred to as V1RI and V1RIX) from two individuals of Microcebus murinus. Using a minimum threshold of 7× coverage, we recovered approximately 90% of V1RI sequences previously identified in the draft M. murinus genome (59% being identical at all nucleotide positions). When low coverage sequences were considered (i.e. < 7× coverage) 100% of V1RI sequences identified in the draft genome were recovered. At least 13 putatively novel V1R loci were also identified using CCS technology.

Conclusions

Recent upgrades to the Pacific Biosciences RS instrument have improved the CCS technology and offer an alternative to traditional sequencing approaches. Our results suggest that the Microcebus murinus V1R repertoire has been underestimated in the draft genome. In addition to providing an improved understanding of V1R diversity in the mouse lemur, this study demonstrates the utility of CCS technology for characterizing complex regions of the genome. We anticipate that long-read sequencing technologies such as PacBio SMRT will allow for the assembly of multigene family clusters and serve to more accurately characterize patterns of gene copy variation in large gene families, thus revealing novel micro-evolutionary patterns within non-model organisms.

Background

Multigene families have played a fundamental role in the evolution of metazoan genomes [1–5]. Mechanisms such as gene duplication, gene conversion, and lineage diversification underlie multigene family complexity and contribute to genetic patterns that can be extremely difficult to molecularly characterize [6, 7]. Whereas processes such as positive selection and lineage diversification can yield gene copies of increasing nucleotide divergence, opposing processes such as gene duplication and gene conversion can yield copies that are so similar that they are virtually impossible to distinguish from sequencing error [8, 9]. The accurate characterization of gene copy number is fundamental to the differentiation of paralogy and orthology, and by extension, to the identification of heterozygotes versus homozygotes. This latter distinction is in turn central to determining the effects of genotype on phenotype, with the MHC (Major Histocompatibility Complex) gene family offering a classic example [10–13].

Given the intrinsic interest of accurate gene copy representation, it follows that methods of molecular characterization should be highly sensitive both to levels of low nucleotide diversity and to regions of high complexity. Unfortunately, such is not presently the case for organisms that lack a well-characterized genome: i.e., non-model organisms. Although low-coverage “draft” genomes are increasingly available for non-model organisms, these draft genomes are notoriously unreliable for accurate gene calling, particularly for regions of high genomic complexity [6, 14]. Thus, until such time that high-coverage, fully-assembled and annotated genomes are available for all species of interest, alternative molecular methods are desirable.

Long-read next generation sequencing technologies provide unique opportunities for genome-based research of non-model organisms [15]. Recent upgrades to the Pacific Biosciences RS instrument have improved sequence accuracy of circular consensus sequencing (CCS) yielding high-quality sequences about 500 to ~2,500 base pairs in length. CCS allows for the repeated sequencing of individual inserts and, depending on template length, stochastic sequencing errors are reduced with each CCS pass [16, 17]. Here, we investigate the utility of SMRT CCS as an alternative to traditional approaches to sequencing large closely related gene families (e.g. Sanger sequencing of cloned products). We used SMRT CCS technology, in combination with clustering algorithms and phylogenetic analyses, to measure subfamily gene diversity of vomeronasal G protein-linked receptors within a non-model primate, the gray mouse lemur (Microcebus murinus; Figure 1). This species was selected for analysis for several reasons. First, recent work has shown that the V1R complex is extremely diverse with a very high proportion of intact gene copies in this primate, perhaps having the highest proportion of functional V1R copies of any mammal [18]. Second, a draft genome is available for comparison against which we can validate the data generated by our study. And third, a deep-coverage fully-assembled genome is expected for release within the coming year (J. Rogers, pers. com.) allowing for final validation of gene copies generated from more approximate methods (i.e., bioinformatic mining of the draft genome).

The vomeronasal organ (VNO) is the primary olfactory organ responsible for pheromone detection in mammals and two families of VNO G protein-linked receptors (V1R and V2R) allow for the recognition of different classes of chemosensory cues. V1Rs are typically encoded by a single coding exon (~900-1000 base pairs) and are distantly related to bitter taste receptors whereas V2Rs are encoded by multiple coding exons and are closely related to Ca²⁺-sensing receptors [19–21]. Much research has been directed to distinguishing V1R receptors, owing to their single coding exon and relatively short sequence length characteristics [18, 20, 22]. However, despite recent advancements in the understanding of V1R sequence diversity in mammals [18, 23, 24], relatively little is known about the intra-specific or intra-individual (i.e., intragenomic) V1R diversity of many non-model species. This is because V1R repertories are hypothesized to evolve rapidly and are likely lineage-specific, resulting in relatively few one-to-one orthologs between species [18, 25]. These factors make characterization of the V1R repertoire difficult for DNA sequence based studies because i) traditional approaches to sequencing large, closely related gene families can be time consuming and expensive (e.g. cloning and subsequent Sanger sequencing of PCR amplicons), and ii) short-read high-throughput sequence data of such gene families are difficult to assemble given high sequence similarity and the potential for multiple paralogs to exist throughout the genome.

Many prosimian species (e.g. bushbabies, lemurs, lorises, and tarsiers) have evolved highly specialized pheromone communication mechanisms and are hypothesized to have a large number of functional V1R genes [18, 26–29]. In particular, mouse lemurs practice complex chemosensory communication (e.g. urine washing, scent marking, etc.) and accordingly, species of mouse lemurs have diverse V1R repertoires [18, 29, 30]. We present SMRT CCS data from two V1R subfamilies of two individuals of the gray mouse lemur (Microcebus murinus). Our study is designed to make comparisons of a biological nature (i.e., intragenomic versus intraspecific sequence variation) and of a methodological nature (i.e., Sanger sequencing of PCR clones versus CCS versus bioinformatic mining of a low-coverage draft genome; Table 1). In order to make these comparisons, we generated sequences via CCS of the same individual M. murinus from which the draft genome was generated/assembled (hereafter referred to as M. murinus 1). A second individual was included to enable comparisons within a single genome versus comparisons between genomes within a single species. This second individual was also used in a previous study of strepsirrhine V1R variation that utilized Sanger sequencing of cloned amplicons [29] (hereafter referred to as M. murinus 2). In summary, our study design allows for comparison of sequence characterization of a complex multigene family derived from distinct methods and sequencing technologies (Table 1).

Table 1 Intra- and intergenomic comparisons of Microcebus murinus V1R subfamily diversity examined herein

Full size table

Results

CCS quality and clustering

We generated 62,159 CCS reads (minimum of 2 CCS passes; SMRT cell 1 = 29,556; SMRT cell 2 = 32,603). A bimodal distribution of sequence lengths was observed for each SMRT cell, corresponding to the V1RI (~725 bp) and V1RIX (~800 bp) amplicon sizes (Additional file 1: Figure S1). Minimum and maximum read lengths of raw data from SMRT cell 1 were 48 bp and 2,316 bp and for SMRT cell 2 min and max read lengths were 318 bp and 2,364 bp. Average quality score per read increased per CCS polymerase pass (Figure 2) and, for both libraries, we selected all reads having a minimum of 4 CCS passes and an average Phred quality score of 20 for 90% of bases per read for downstream analyses. This quality filtering approach resulted in 34,088 reads available for analysis (16,914 and 17,174 reads for M. murinus 1 and M. murinus 2, respectively; Table 2). Based on sequence length 12,625 and 11,814 reads were classified as V1RI and 4,289 and 5,360 reads were classified as V1RIX for M. murinus 1 and M. murinus 2, respectively (Table 2).

Table 2 Results of CCS and cluster analyses of the V1R I and V1R IX repertoires of two individuals of M. murinus

Full size table

Clustering analyses resulted in 8,545 and 8,694 V1RI clusters and 2,936 and 3,673 V1RIX clusters for M. murinus 1 and 2, respectively (Additional file 1: Figure S2). Of these, approximately 18% and 17% of V1RI clusters and 5.4% and 4.6% of V1RIX clusters were identified as putative chimeras, M. murinus 1 and 2 respectively (Additional file 1: Figure S2). The majority of the chimeras (85%; n = 2,861) consisted of singleton clusters and only 13 chimeras had cluster sizes greater than a 7× threshold. Clusters consisting of putative chimeras were removed prior to all downstream analyses. Results of cluster analyses, including de novo chimera detection results, are presented in Table 2 and Additional file 1: Figure S2. We identified 15 clusters as consisting of putative pseudogenes and these were also excluded from downstream analyses. Final analyses were performed on consensus sequences from 106 and 114 V1RI clusters and 61 and 85 V1RIX clusters for M. murinus 1 and 2, respectively (≥7× coverage; Table 2). These consensus sequences were aligned using 98% and 99% similarity thresholds (see Methods) in order to determine the minimum and maximum number of V1R genes obtained using PacBio CCS technology (see Table 2). For M. murinus 1, 13 CCS V1R genes were not identified in sequences mined from the draft genome [18] and are considered novel.

Comparisons between CCS and draft genome V1R sequences

Young et al.[18] identified ~214 functional V1R genes within the draft M. murinus genome. Our M. murinus 1 is the same individual from which the draft M. murinus genome was sequenced, thus allowing for validation of our newly generated V1R CCS data. Using a minimum of 7× coverage, and a 98% similarity threshold, we recovered 27 of 30 functional V1RI subfamily sequences present within the M. murinus draft genome and 16 of these sequences were identical across all base pairs (Figures 3 and 4). When low coverage CCS reads were considered (i.e. < 7× cluster size) 100% of draft genome V1RI sequences were recovered (Figure 3). Of the CCS V1RI gene copies that we recovered, five were unique to the CCS data; i.e., they were not identified within the draft genome assembly in a previous study [18]. Collectively, the V1RI data (CCS sequences and those mined from the draft genome) suggest a conservative estimate of that subfamily repertoire to be 34 loci for the subfamily repertoire, a 12% increase from sequences recovered from the draft genome (Table 2). Nucleotide diversity, estimated mutation rate, and average number of nucleotide differences between V1RI sequences originating from CCS and those gathered from the M. murinus draft genome were similar (Table 3), with no significant difference being observed in the nucleotide variation between sequences generated by CCS technology and the draft genome V1RI repertoire (Additional file 1: Table S2).

Table 3 Polymorphism statistics of DNA sequence data for the V1R subfamilies examined herein

Full size table

We recovered approximately 58% of V1RIX draft genome sequences and this value increased to 79% with the inclusion of low coverage CCS reads (Figure 5). Seven V1RIX sequences were identical across all base pairs with sequences mined from the draft genome (Figure 5). Eight putatively novel V1RIX loci (i.e. absent from the draft genome assembly) were identified yielding an estimate of 32 loci for the M. murinus V1RIX subfamily repertoire, a 68% increase from those previously identified in the draft genome. Patterns of nucleotide variation were similar between the CCS V1RIX data and the draft genome sequences (Table 3) and no statistical difference was observed in the nucleotide variation between the two datasets (Additional file 1: Table S2).

Molecular variation of the strepsirrhine V1RI subfamily has been explored using a traditional Sanger sequencing of cloned inserts approach [29]. We used the same PCR primers as Yoder et al.[29] as well as DNA from one individual of M. murinus used in that study (herein M. murinus 2). We recovered approximately 82% of the sequences reported by Yoder et al.[29] for M. murinus 2, with 12 sequences being identical (Figures 3 and 6). We identified a maximum of 27 putative V1RI loci within the M. murinus 2 CCS data. When these data were combined with the Sanger sequence data [29] the V1RI repertoire size from M. murinus 2 is estimated to consist of ~36 loci (98% sequence similarity threshold). When compared with V1RI CCS sequence data, the nucleotide diversity and average number of nucleotide differences were slightly higher in sequences originating using the Sanger sequencing of clones approach [29] (Table 3), however no statistical difference was observed in nucleotide variation between the two datasets (Additional file 1: Table S2).

M. murinusV1R repertoire comparisons

Microcebus murinus 1 and 2 shared approximately 70% of their CCS V1RI repertoires and 15 sequences sharing greater than 99% sequence similarity were identified (Figure 3). Zero V1RI sequences were identified as matching across all base pairs between the two individuals (Figure 3). Alternatively, the V1RIX repertoires of M. murinus 1 and 2 overlapped by ~90% and 16 sequences were identified as sharing greater than 99% percent sequence similarity, with three of these matching at all base pairs (Figures 5 and 6). Intragenomic nucleotide distances within the V1RI and V1RIX subfamilies were 9.8% and 6.3%, respectively, for M. murinus 1 and values were similar for M. murinus 2 (Additional file 1: Table S3). All genetic distance calculations were similar for both inter- and intra-genomic comparisons and the V1RI subfamily exhibited greater variability relative to V1RIX (Additional file 1: Table S3). Phylogenetic analyses of amino acid variation of the V1RI and V1RIX repertoires for both individuals are presented in Figures 7 and 8. No significant differences in the magnitude of nucleotide variation were observed between the V1RI and V1RIX repertoires of M. murinus 1 and 2, respectively (Table 3).

Discussion

Insights into the Microcebus murinusV1R repertoire

Chemosensory communication has played a critical role in mammalian evolution from both physiological and behavioral perspectives. In particular, many species rely heavily on pheromones for intraspecific communication, especially with respect to sexual and social behaviors [31]. Of all mammalian orders, Primates exhibits what is perhaps the greatest variation of functional versus pseudogenized V1R diversity [18, 22, 23]. For example, no functional V1R genes have been identified in the macaque (Macaca mulatta) genome whereas ~214 intact V1R genes have been identified in M. murinus[18]. This proportion of functional versus pseudogenized V1R genes is likely correlated with a reduction of pheromone communication in some primate lineages (e.g. Old World catarrhines; [23, 32]), whereas lineages that have maintained or evolved enhanced chemical communication typically exhibit diverse V1R repertories (e.g. strepsirrhine primates [18, 29]). In particular, mouse lemurs practice complex chemosensory communications and M. murinus has one of the highest proportions of functional vs. pseudogenized V1R repertories of all mammalian species characterized thus far [18]. Our results reinforce this finding and suggest that the functional V1R repertoire of M. murinus has likely been underestimated, perhaps by as much as 25% in the V1RI and V1RIX subfamilies, collectively. This observation, coupled with recent documentation of strong positive selection throughout the mouse lemur V1R repertoire [30], strengthens hypotheses regarding the highly specialized pheromone communication mechanisms used by species of lemurs [26–29].

While the putative function of the V1RI subfamily is unknown [29, 30], the available data indicate that this subfamily binds a diverse variety of ligands [29]. Alternatively, the genetic variation within the V1RIX subfamily suggests this subfamily is more conserved, perhaps binding to a reduced number of ligand classes (Table 3; Figure 8). This hypothesis is reinforced by the observation that the two individuals examined herein share ~84% of their V1RIX repertoires, compared to ~60% shared V1RI loci, at the 98% sequence similarity threshold. Moreover, we identified three V1RIX loci that were identical between M. murinus 1 and 2, whereas zero V1RI loci were identical between the two individuals. These results support previous studies that have hypothesized differing rates of evolution within lemurid V1R subfamilies [29, 30]. Based on comparisons with putative mouse orthologs, Hohenbrink et al. [30] hypothesized that the V1RIX subfamily was closely related to the mouse V1Rc subfamily, a subfamily that has been shown to detect female, heterospecific, and predator cues in mice [33]. Future studies focused on the identification of the ligands associated with the nine known V1R subfamilies present in M. murinus will be an important advance for understanding the functional roles of these gene families and whether or not genetic variation underlying V1R repertoires contribute to the maintenance of species boundaries within the genus.

Utility of CCS for gene family characterization and discovery in non-model species

Sanger sequencing of cloned inserts is a well-established and common approach for characterizing multigene family diversity [29, 34–36]. Although effective, this method can be time-consuming, labor-intensive, and expensive. A growing number of studies utilize next generation sequencing technologies for targeted approaches to gene validation and discovery [37, 38]. These technologies have limitations however, and issues such as systematic and stochastic error rates as well as average read lengths must be considered when developing experimental designs [6]. We selected Pacific Biosciences SMRT sequencing technology because the long read lengths eliminated the necessity for downstream assembly of highly similar fragments and because CCS reduces stochastic error rates. Moreover, the option of filtering reads by number of CCS pass provides greater flexibility to quality control of PacBio sequence data. Recent advancements to the PacBio RS sequencing instrument and sequencing chemistry have improved read length and accuracy [15] and, as of this writing, Pacific Biosciences has released the RS II upgrade which allows for higher throughput and even greater read accuracy.

Relatively few studies have utilized PacBio CCS for targeted sequencing in non-model species [39, 40]. In the absence of high quality genome assemblies, the long read lengths provided by SMRT CCS offer new opportunities for characterizing complex multigene families (e.g. immunoglobulin, MHC, olfactory receptors, V1R, etc.). The observation that our clustering approach of CCS reads resulted in capturing 100% of V1RI sequences mined from the draft genome assembly (including low coverage clusters), coupled with the identification of 13 putatively novel genes (Figures 3, 4, 5, 6, 7 and 8), documents that the methods reported herein are useful for gene discovery and for describing the diversity of large gene families. Moreover, no bias was detected in the nucleotide variation of sequences originating from CCS clusters with respect to those mined from the draft genome [18] or generated via Sanger sequencing of cloned inserts [29] (Table 3; Additional file 1: Table S2). Our results concerning the reduced coverage of the V1RIX subfamily (when compared to V1RI) likely stem from PCR amplification bias and/or preferential sequencing of the shorter V1RI sequences in pooled sequencing libraries. This finding, in addition to the identification of putative PCR chimeras by de novo chimera detection software (Additional file 1: Figure S2), generally agrees with other studies that have identified PCR bias and PCR artifacts within data originating from high-throughput sequencing of PCR amplicons [41].

Although our experimental design is useful for identifying potentially unrecognized gene diversity, a major drawback is the inability to distinguish closely related paralogs and to reliably identify orthologs among individuals. This problem is compounded by the observation that V1Rs are encoded by a single exon [21, 23] and therefore lack intronic sequences that may help to identify orthologs and/or paralogs. Thus, future studies aimed at characterizing V1R gene diversity in non-model species may benefit from other methods such as the targeted capture and sequencing of genomic regions harboring V1R genes (e.g., using biotinylated probes in combination with PacBio sequencing of long templates). Such an approach would be useful for identifying orthologous and paralogous genes and for characterizing allelic variation in multigene families of non-model species.

Conclusions

Our findings suggest that the V1R repertoire of M. murinus is larger than previously hypothesized and underscore previous observations that low coverage genome assemblies provide a limited view of multigene-family diversity [14, 18]. Even so, it is probable that we have still underestimated V1R diversity given the potential for the clustering of closely related paralogs (i.e. < 2% sequence divergence). Importantly, the forthcoming availability of a high coverage (~150×) M. murinus genome (Human Genome Sequencing Center, Baylor College of Medicine) will allow our hypotheses regarding the V1R repertoire size to be more definitively tested.

Pacific Biosciences SMRT CCS provides an alternative to traditional Sanger sequencing of cloned inserts. We anticipate that the methods described herein will be useful for the characterization of diverse gene families in other non-model species where genome sequences are unavailable or consist of low coverage draft assemblies. Our results concerning the presence of putative PCR artifacts agree with previous observations [41] and necessitate the implementation of strict quality control measures when high-throughput sequencing is performed on libraries constructed from PCR amplicons. Modifications to our approach, such as barcoding and advanced targeted capture methodologies will be useful for increasing sample size and gene discovery. These methods will greatly advance genome assembly and annotation of multigene families in non-model species.

Methods

Molecular methods

We examined V1R sequences mined from the draft Microcebus murinus genome by Young et al.[18] and selected two diverse subfamilies (V1RI and V1RIX sensu Hohenbrink et al.[30]) for circular consensus sequencing (Additional file 1: Figure S3). These subfamilies were amplified from whole genomic DNA, isolated from two non-related individuals of M. murinus, using primers targeting conserved transmembrane regions 2 and 7 (V1RI) and 1 and 7 (V1RIX; Additional file 1: Table S1). We refer to the individual from which the draft genome was derived as M. murinus 1, and the second individual, included in the study by Yoder et al.[29], is referred to as M. murinus 2 (Duke Lemur Center voucher number 7013). Animal procedures were reviewed and approved by the Duke University Institutional Animal Care and Use Committee under protocol number A250-12-09.

Amplicons were obtained using a high fidelity Taq DNA polymerase (Platinum Taq; Invitrogen) and PCRs were conducted in 50 ul reactions with the following final concentrations: 1× high fidelity buffer, 2 mM MgCl2, 200 uM dNTPs, 0.8uM primers, 0.625 units Taq, and ~15 ng DNA template. The following touchdown thermal profile was used for all amplifications: initial denaturation 95°C for 3 min followed by 15 cycles of 95°C for 1 min, 60°C (1°C decrease per cycle) for 1 min, 72°C for 1 min 30 sec, then another 20 cycles of 95°C for 1 min, 45°C for 1 min, 72°C for 1 min 30 sec, and a final extension of 72°C for 10 min. PCR products were visualized on a 2% agarose gel using SYBR Green I (Lonza Rockland, Inc.) and bands within the expected size ranges (V1RI = ~725 bp and V1RIX = ~800 bp) were excised and extracted using the Mo Bio gel purification kit (Mo Bio Laboratories, Inc.).

Three PCR reactions per individual per locus were pooled separately and quantified using a NanoDrop spectrophotometer (Thermo Scientific). V1RI and V1RIX amplicons were then pooled (1.5 μg V1RI and 1.0 μg V1RIX) resulting in two 2.5 μg samples for the construction of two sequencing libraries. V1RI amplicons were enriched to ensure sequence coverage given the increased variation observed within the V1RI subfamily when compared to V1RIX[18, 30]. Samples were submitted to the Duke IGSP Genome Sequencing & Analysis Core Resource for real-time circular consensus sequencing using a Pacific Biosciences RS instrument and C2 chemistry. Two small-insert libraries (one per individual) were prepared following manufacturers protocols and were sequenced using two SMRT cells (one SMRT cell per library) with 2 × 55 min movie run times. The resulting bas.h5 files were used for downstream analyses.

Quality filtering and sequence clustering

CCS sequences were quality filtered using pbh5tools (https://github.com/ PacificBiosciences/pbh5tools) and the Galaxy platform [42–44]. The pbh5tools package was used to extract CCS fastq sequences from bas.h5 files according to minimum number of CCS pass, thus allowing for inspection of average read quality as a function of CCS pass (see Figure 2 and Results). We used FastQC software (v0.10.1; http://www.bioinformatics.babraham.ac.uk/projects) to summarize average Phred score per CCS pass category (Figure 2). Cluster analyses were performed on sequences that originated from a minimum of 4 CCS passes and within which 90% of the bases averaged a quality score ≥ Phred 20 (1% error rate). Our pooled samples consisted of amplicons separated by ~65 bp in length, thus allowing for demultiplexing V1RI and V1RIX sequences according to length.

The USEARCH software package (v6.0) [45] was used for clustering, de novo PCR chimera detection, and preliminary cluster alignment. The UCHIME algorithm (as implemented within USEARCH) was used to detect putative chimeric sequences with the de novo mode and an -abskew parameter of 2.0. Clusters containing putative chimeras were not included in downstream analyses. Quality filtered CCS sequences were clustered based on a 98% similarity threshold with the -cluster_fast option and resulting alignments of clusters containing ≥ 7 sequences (i.e., a 7× threshold) were imported into the Geneious software package (v6.1; http://www.geneious.com) re-aligned using the MAFFT (v7.017) alignment plugin and then manually edited for accuracy. We selected the 7× coverage threshold based on our chimera detection results (i.e., 0.4% of all clusters comprised of putative chimeras contained 7 or more sequences; see Results and Additional file 1: Figure S2). Cluster consensus sequences were identified as V1R using NCBI BLAST (http://blast.ncbi.nlm.nih.gov/) and V1R subfamily membership was confirmed by phylogenetic comparisons with Hohenbrink et al.[30].

The minimum number of distinct V1R genes for each subfamily was estimated following Rodriguez et al.[46] whereby cluster consensus sequences sharing greater than 98% nucleotide homology were considered identical. This approach reduced concerns of spurious results due to sequencing error and/or repertoire inflation due to paralogous loci, but at the same time it is likely to underestimate the total number of genuine V1R paralogous copies. To overcome this limitation, to some extent, we also used a 99% minimum genetic similarity threshold to estimate maximum V1R repertoire size. Moreover, 99% is the minimum genetic similarity separating V1R sequences mined from distinct regions of the draft M. murinus genome [18]. Cluster consensus sequences were translated into amino acids and were checked for complete open reading frame to identify putatively functional and pseudogenized loci. Final alignments for all consensus sequences are provided in Additional file 2: Dataset 1. Given the clustering approaches described above, in combination with the observation that we used primers that bound to conserved regions within the V1R exon, we anticipated that closely related paralogs would be clustered together and thus we refrained from attempting to identify allelic variation within potentially non-homologous loci.

Phylogenetic and statistical analyses

Alignments of PacBio CCS cluster consensus sequences with V1R data from Young et al.[18] (Additional file 2: Dataset 1) and Yoder et al.[29]; [GenBank:KF272289–KF272350] were performed using MAFFT v7.017 (gap open penalty 1.53; offset value 0.123) as implemented within the Geneious software package. Sequences originating from Yoder et al. [29] were also clustered based on the 98% threshold described above in order to avoid the incorporation of potentially paralogous loci in the analyses presented herein (Additional file 2: Dataset 2). Phylogenetic analyses were performed using MrBayes v.3.2 [47] and RAxML v.7.7 [48]. The GTR + gamma model of substitution was used for all Bayesian and Maximum Likelihood analyses. Statistical support for nodes was evaluated using Bayesian posterior probabilities (resulting from 5 million iterations, 4 heated chains, 25%% burn-in length) and maximum likelihood bootstrap support values (percentage of 1,000 iterations). Resulting trees were edited using FigTree v1.4 software (http://tree.bio.ed.ac.uk/software/figtree/). Pairwise sequence similarity was measured using custom BLAST searches with the percent identity output option. Sequence similarity was visualized using hive plots (jHive v0.0.18; [49]) and the arcdiagram R package (https://github.com/gastonstat/arcdiagram). The software package DNAsp v5.1 [50] was used to calculate basic polymorphism statistics for each V1R subfamily including number of segregating sites (S), average number of nucleotide differences between alleles (k), nucleotide diversity (π), Watterson’s estimator of population mutation rate (θ_W), and number of synonymous and nonsynonymous mutations. Genetic distances were calculated using the Kimura-2 parameter (nucleotide) and p-distance (amino acid) algorithms as implemented within Mega v5.2 software [51]. Genetic divergence among V1R repertoires was assessed using Chi-square statistical tests as implemented in DNAsp.

Availability of supporting data

Final V1R consensus sequences generated by this study have been deposited in GenBank and have the following accession numbers [KF721294 - KF721403]. CCS sequence data generated from M. murinus 1 (origin of the M. murinus draft genome) are identified with a specimen-voucher number of DGM01. Additional file 2: Dataset 1 contains alignments of all CCS data used in the final analyses as well as V1R data mined from Young et al.[18]. Additional file 2: Dataset 2 contains filtered V1RI sequences originating from Yoder et al. [29]. Both Additional file 2: Datasets are located at http://www.labarchives.com with the following doi:10.6070/H4G73BN0.

Abbreviations

CCS:: Circular consensus sequencing
V1R:: Vomeronasal type 1 receptor
V2R:: Vomeronasal type 2 receptor
SMRT:: Single molecule real-time.

References

Nei M, Rooney AP: Concerted and birth-and-death evolution of multigene families. Annu Rev Genet. 2005, 39: 121-10.1146/annurev.genet.39.073003.112240.
Article CAS PubMed Central PubMed Google Scholar
Holland PW, Garcia-Fernàndez J, Williams NA, Sidow A: Gene duplications and the origins of vertebrate development. Development. 1994, 1994: 125-133.
Google Scholar
Amores A, Force A, Yan Y-L, Joly L, Amemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang Y-L: Zebrafish hox clusters and vertebrate genome evolution. Science. 1998, 282: 1711-1714.
Article CAS PubMed Google Scholar
Iwabe N, Kuma K-i, Miyata T: Evolution of gene families and relationship with organismal evolution: rapid divergence of tissue-specific genes in the early evolution of chordates. Mol Biol Evol. 1996, 13: 483-493. 10.1093/oxfordjournals.molbev.a025609.
Article CAS PubMed Google Scholar
Lundin LG: Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics. 1993, 16: 1-19. 10.1006/geno.1993.1133.
Article CAS PubMed Google Scholar
Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Methods. 2010, 8: 61-65.
Article PubMed Central PubMed Google Scholar
Hirsch CN, Robin Buell C: Tapping the Promise of Genomics in Species with Complex, Nonmodel Genomes. Annu Rev Plant Biol. 2013, 64: 89-110. 10.1146/annurev-arplant-050312-120237.
Article CAS PubMed Google Scholar
Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui L-C, Scherer SW: Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 2003, 4: R25-10.1186/gb-2003-4-4-r25.
Article PubMed Central PubMed Google Scholar
Strand T, Wang B, Meyer-Lucht Y, Höglund J: Evolutionary history of black grouse major histocompatibility complex class IIB genes revealed through single locus sequence-based genotyping. BMC Genet. 2013, 14: 29-
Article CAS PubMed Central PubMed Google Scholar
Klein J: Natural History of the Major Histocompatibility Complex. 1986, New York: Wiley
Google Scholar
Garrigan D, Hedrick PW: Perspective: detecting adaptive molecular polymorphism: lessons from the MHC. Evolution. 2003, 57: 1707-1722. 10.1111/j.0014-3820.2003.tb00580.x.
Article CAS PubMed Google Scholar
Ilmonen P, Penn DJ, Damjanovich K, Morrison L, Ghotbi L, Potts WK: Major histocompatibility complex heterozygosity reduces fitness in experimentally infected mice. Genetics. 2007, 176: 2501-2508. 10.1534/genetics.107.074815.
Article PubMed Central PubMed Google Scholar
Brouwer L, Barr I, Van De POL M, Burke T, Komdeur J, Richardson DS: MHC-dependent survival in a wild population: evidence for hidden genetic benefits gained through extra-pair fertilizations. Mol Ecol. 2010, 19: 3444-3455. 10.1111/j.1365-294X.2010.04750.x.
Article PubMed Google Scholar
Zhang X, Goodsell J, Norgren RB: Limitations of the rhesus macaque draft genome assembly and annotation. BMC Genomics. 2012, 13: 206-10.1186/1471-2164-13-206.
Article CAS PubMed Central PubMed Google Scholar
Roberts RJ, Carneiro MO, Schatz MC: The advantages of SMRT sequencing. Genome Biol. 2013, 14: 405-10.1186/gb-2013-14-6-405.
Article PubMed Google Scholar
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323: 133-138. 10.1126/science.1162986.
Article CAS PubMed Google Scholar
Travers KJ, Chin C-S, Rank DR, Eid JS, Turner SW: A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 2010, 38: e159-e159. 10.1093/nar/gkq543.
Article PubMed Central PubMed Google Scholar
Young JM, Massa HF, Hsu L, Trask BJ: Extreme variability among mammalian V1R gene families. Genome Res. 2010, 20: 10-18. 10.1101/gr.098913.109.
Article CAS PubMed Central PubMed Google Scholar
Yang H, Shi P, Zhang Y-p, Zhang J: Composition and evolution of the V2r vomeronasal receptor gene repertoire in mice and rats. Genomics. 2005, 86: 306-315. 10.1016/j.ygeno.2005.05.012.
Article CAS PubMed Google Scholar
Grus WE, Zhang J: Origin and evolution of the vertebrate vomeronasal system viewed through system-specific genes. Bioessays. 2006, 28: 709-718. 10.1002/bies.20432.
Article CAS PubMed Google Scholar
Dulac C, Axel R: A novel family of genes encoding putative pheromone receptors in mammals. Cell. 1995, 83: 195-206. 10.1016/0092-8674(95)90161-2.
Article CAS PubMed Google Scholar
Young JM, Kambere M, Trask BJ, Lane RP: Divergent V1R repertoires in five species: Amplification in rodents, decimation in primates, and a surprisingly small repertoire in dogs. Genome Res. 2005, 15: 231-240. 10.1101/gr.3339905.
Article CAS PubMed Central PubMed Google Scholar
Grus WE, Shi P, Zhang Y-p, Zhang J: Dramatic variation of the vomeronasal pheromone receptor gene repertoire among five orders of placental and marsupial mammals. Proc Natl Acad Sci USA. 2005, 102: 5767-5772. 10.1073/pnas.0501589102.
Article CAS PubMed Central PubMed Google Scholar
Shi P, Bielawski JP, Yang H, Zhang Y: Adaptive diversification of vomeronasal receptor 1 genes in rodents. J Mol Evol. 2005, 60: 566-576. 10.1007/s00239-004-0172-y.
Article CAS PubMed Google Scholar
Grus WE, Zhang J: Rapid turnover and species-specificity of vomeronasal pheromone receptor genes in mice and rats. Gene. 2004, 340: 303-312. 10.1016/j.gene.2004.07.037.
Article CAS PubMed Google Scholar
Delbarco-Trillo J, Burkert B, Goodwin T, Drea C: Night and day: the comparative study of strepsirrhine primates reveals socioecological and phylogenetic patterns in olfactory signals. J Evol Biol. 2011, 24: 82-98. 10.1111/j.1420-9101.2010.02145.x.
Article CAS PubMed Google Scholar
Irwin MT, Samonds KE, Raharison J-L, Wright PC: Lemur latrines: observations of latrine behavior in wild primates and possible ecological significance. J Mammal. 2004, 85: 420-427. 10.1644/1545-1542(2004)085<0420:LLOOLB>2.0.CO;2.
Article Google Scholar
Andrew RJ, Klopman RB: Urine washing: comparative notes. Prosimian Biology. Edited by: Martin RD, Doyle GA, Walker AC. 1974, London: Duckworth, 303-312.
Google Scholar
Yoder A, Chan L, dos Reis M, Larsen P, Campbell C, Rasolarison R, Barrett M, Roos C, Kappeler P, Bielawski J, Yang Z: Molecular evolutionary characterization of a novel V1R subfamily in strepsirrhine primates. Genome Biol Evol. 2014, 6: 213-227. 10.1093/gbe/evu006.
Article PubMed Central PubMed Google Scholar
Hohenbrink P, Radespiel U, Mundy NI: Pervasive and ongoing positive selection in the vomeronasal-1 receptor (V1R) repertoire of mouse lemurs. Mol Biol Evol. 2012, 29: 3807-3816. 10.1093/molbev/mss188.
Article CAS PubMed Google Scholar
Dulac C, Torello AT: Molecular detection of pheromone signals in mammals: from genes to behaviour. Nat Rev Neurosci. 2003, 4: 551-562. 10.1038/nrn1140.
Article CAS PubMed Google Scholar
Zhang J, Webb DM: Evolutionary deterioration of the vomeronasal pheromone transduction pathway in catarrhine primates. Proc Natl Acad Sci USA. 2003, 100: 8337-8341. 10.1073/pnas.1331721100.
Article CAS PubMed Central PubMed Google Scholar
Isogai Y, Si S, Pont-Lezica L, Tan T, Kapoor V, Murthy VN, Dulac C: Molecular organization of vomeronasal chemoreception. Nature. 2011, 478: 241-245. 10.1038/nature10437.
Article CAS PubMed Central PubMed Google Scholar
Steiger SS, Fidler AE, Valcu M, Kempenaers B: Avian olfactory receptor gene repertoires: evidence for a well-developed sense of smell in birds?. Proc R Soc Lond B Biol Sci. 2008, 275: 2309-2317. 10.1098/rspb.2008.0607.
Article CAS Google Scholar
Del Punta K, Rothman A, Rodriguez I, Mombaerts P: Sequence diversity and genomic organization of vomeronasal receptor genes in the mouse. Genome Res. 2000, 10: 1958-1967. 10.1101/gr.10.12.1958.
Article CAS PubMed Central PubMed Google Scholar
Go Y, Satta Y, Kawamoto Y, Rakotoarisoa G, Randrianjafy A, Koyama N, Hirai H: Frequent segmental sequence exchanges and rapid gene duplication characterize the MHC class I genes in lemurs. Immunogenetics. 2003, 55: 450-461. 10.1007/s00251-003-0613-6.
Article CAS PubMed Google Scholar
Liang B, Luo M, Scott-Herridge J, Semeniuk C, Mendoza M, Capina R, Sheardown B, Ji H, Kimani J, Ball BT: A comparison of parallel pyrosequencing and sanger clone-based sequencing and its impact on the characterization of the genetic diversity of HIV-1. PLoS ONE. 2011, 6: e26745-10.1371/journal.pone.0026745.
Article CAS PubMed Central PubMed Google Scholar
Hughes GM, Gang L, Murphy WJ, Higgins DG, Teeling EC: Using Illumina Next Generation Sequencing technologies to sequence multigene families in de novo species. Mol Ecol Res. 2013, 13: 510-521. 10.1111/1755-0998.12087.
Article CAS Google Scholar
Larsen P, Smith T: Application of circular consensus sequencing and network analysis to characterize the bovine IgG repertoire. BMC Immunol. 2012, 13: 52-10.1186/1471-2172-13-52.
Article CAS PubMed Central PubMed Google Scholar
Neves LG, Davis JM, Barbazuk WB, Kirst M: Whole-exome targeted sequencing of the uncharacterized pine genome. Plant J. 2013, 75: 146-156. 10.1111/tpj.12193.
Article CAS PubMed Google Scholar
Sommer S, Courtiol A, Mazzoni CJ: MHC genotyping of non-model organisms using next-generation sequencing: a new methodology to deal with artefacts and allelic dropout. BMC Genomics. 2013, 14: 1-17. 10.1186/1471-2164-14-1.
Article Google Scholar
Goecks J, Nekrutenko A, Taylor J, Team TG: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11: R86-10.1186/gb-2010-11-8-r86.
Article PubMed Central PubMed Google Scholar
Blankenberg D, Kuster GV, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: A Web-Based Genome Analysis Tool for Experimentalists. Curr Protoc Mol Biol. 2010, 19 (10): 11–19.10. 21-
Google Scholar
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J: Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005, 15: 1451-1455. 10.1101/gr.4086505.
Article CAS PubMed Central PubMed Google Scholar
Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26: 2460-2461. 10.1093/bioinformatics/btq461.
Article CAS PubMed Google Scholar
Rodriguez I, Del Punta K, Rothman A, Ishii T, Mombaerts P: Multiple new and isolated families within the mouse superfamily of V1r vomeronasal receptors. Nat Neurosci. 2002, 5: 134-140. 10.1038/nn795.
Article CAS PubMed Google Scholar
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP: MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012, 61: 539-542. 10.1093/sysbio/sys029.
Article PubMed Central PubMed Google Scholar
Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 2688-2690. 10.1093/bioinformatics/btl446.
Article CAS PubMed Google Scholar
Krzywinski M, Birol I, Jones SJ, Marra MA: Hive plots—rational approach to visualizing networks. Brief Bioinform. 2012, 13: 627-644. 10.1093/bib/bbr069.
Article PubMed Google Scholar
Librado P, Rozas J: DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009, 25: 1451-1452. 10.1093/bioinformatics/btp187.
Article CAS PubMed Google Scholar
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011, 28: 2731-2739. 10.1093/molbev/msr121.
Article CAS PubMed Central PubMed Google Scholar

Download references

Acknowledgments

We greatly appreciate the assistance of the Duke GCB Genome Sequencing Shared Resource staff, especially Graham Alexander and Olivier Fedrigo. We thank the anonymous reviewers for their comments and suggestions that helped to improve this manuscript. PAL thanks the American Society of Mammalogists for financial support. This project was funded by Duke University start-up funds to ADY. This is Duke Lemur Center publication number 1266.

Author information

Amy M Heilman
Present address: Primatology Department, Max Planck Institute for Evolutionary Anthropology, Leipzig, 04103, Germany

Authors and Affiliations

Department of Biology, Duke University, Durham, NC, 27708, USA
Peter A Larsen, Amy M Heilman & Anne D Yoder

Authors

Peter A Larsen
View author publications
You can also search for this author in PubMed Google Scholar
Amy M Heilman
View author publications
You can also search for this author in PubMed Google Scholar
Anne D Yoder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter A Larsen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AMH prepared samples for sequencing and PAL analyzed the molecular data reported herein. ADY and PAL conceived of the study and drafted the manuscript. AMH assisted with project design and coordination. All authors read and approved the final manuscript.

Electronic supplementary material

12864_2013_6387_MOESM1_ESM.pdf

Additional file 1: Figure S1: Read lengths of raw PacBio CCS sequences for two SMRT cells. Pooled amplicons for V1R subfamilies I and IX were sequenced from two individual of Microcebus murinus (SMRT cells 1 [A] and 2 [B]; see Methods and Results). Figure S2. Results of de novo chimera detection analysis of CCS sequence data. Number of CCS clusters and putative chimeras for V1RI (A) and V1RIX (B) subfamily sequences from M. murinus 1 and 2, respectively. Figure S3. Neighbor-joining phylogeny of functional V1R sequences mined from the draft M. murinus genome. Highlighted clades were selected for CCS sequencing. Subfamily nomenclature follows Hohenbrink et al. [30]; I–IX. Table S1. Primer sequences used to amplify Microcebus murinus V1RI and V1RIX subfamiles. Table S2. Results of Chi-squared tests of genetic differentiation. Table S3. Intra- and inter- V1R subfamily genetic distances (percentages) for M. murinus. (PDF 1 MB)

Additional file 2: Supplementary Data.(TXT 141 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Larsen, P.A., Heilman, A.M. & Yoder, A.D. The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms. BMC Genomics 15, 720 (2014). https://doi.org/10.1186/1471-2164-15-720

Download citation

Received: 09 October 2013
Accepted: 08 August 2014
Published: 26 August 2014
DOI: https://doi.org/10.1186/1471-2164-15-720

The utility of PacBio circular consensus sequencing for characterizing complex gene families in non-model organisms

Abstract

Background

Results

Conclusions

Background

Results

CCS quality and clustering

Comparisons between CCS and draft genome V1R sequences

M. murinusV1R repertoire comparisons

Discussion

Insights into the Microcebus murinusV1R repertoire

Utility of CCS for gene family characterization and discovery in non-model species

Conclusions

Methods

Molecular methods

Quality filtering and sequence clustering

Phylogenetic and statistical analyses

Availability of supporting data

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us