Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales
© Bi et al.; licensee BioMed Central Ltd. 2012
Received: 14 May 2012
Accepted: 28 June 2012
Published: 17 August 2012
To date, exon capture has largely been restricted to species with fully sequenced genomes, which has precluded its application to lineages that lack high quality genomic resources. We developed a novel strategy for designing array-based exon capture in chipmunks (Tamias) based on de novo transcriptome assemblies. We evaluated the performance of our approach across specimens from four chipmunk species.
We selectively targeted 11,975 exons (~4 Mb) on custom capture arrays, and enriched over 99% of the targets in all libraries. The percentage of aligned reads was highly consistent (24.4-29.1%) across all specimens, including in multiplexing up to 20 barcoded individuals on a single array. Base coverage among specimens and within targets in each species library was uniform, and the performance of targets among independent exon captures was highly reproducible. There was no decrease in coverage among chipmunk species, which showed up to 1.5% sequence divergence in coding regions. We did observe a decline in capture performance of a subset of targets designed from a much more divergent ground squirrel genome (30 My), however, over 90% of the targets were also recovered. Final assemblies yielded over ten thousand orthologous loci (~3.6 Mb) with thousands of fixed and polymorphic SNPs among species identified.
Our study demonstrates the potential of a transcriptome-enabled, multiplexed, exon capture method to create thousands of informative markers for population genomic and phylogenetic studies in non-model species across the tree of life.
KeywordsMicroarray-based exon capture Phylogenetics Population genomics SNP identification Tamias Target enrichment
High-throughput, next generation sequencing (NGS) technologies and associated bioinformatics tools have fundamentally changed the scale at which DNA sequence data can be gathered and analyzed . NGS allows for a massive amount of sequence data to be affordably and quickly obtained. In principle, these approaches can be implemented without prior genomic knowledge of the focus species, thus offering tremendous potential for addressing various novel and long-standing evolutionary questions previously hampered by technology and cost .
NGS allows researchers to investigate genome-wide molecular, structural, and regulatory mechanisms underlying adaptation, diversification, and speciation . NGS also enables comparative genome scans for polymorphism which can then be used to infer demography and selection . Molecular phylogenetics also benefits from the increasing accessibility of NGS. Large-scale, multi-locus data (i.e., hundreds to thousands of loci) combined with improved analytical tools for inferring gene trees, provides unprecedented opportunities for resolving species phylogenies . Toward this end, a core challenge of population genomic and phylogenetic studies is obtaining a reliable set of orthologous loci from a sufficient number of individuals across populations or species spanning a range of divergences . Even though the cost of NGS continues to fall, most evolutionary labs cannot sequence whole genomes or a large portion of genomic regions from samples spanning divergent clades. Moreover, whole genome data simply is not necessary to answer many research questions. In this context, genome partitioning and targeted re-sequencing of a consistent subset of genomic regions will remain the most cost-effective and analytically straightforward approach for most evolutionary applications. Genome partitioning with targeted DNA capture allows for the selective NGS of thousands of genomic regions , facilitating rapid assays of genetic variation. Compared to partitioning methods that search for anonymous markers (i.e. restriction site associated DNA tags, or RADtags , DNA capture is expected to be more efficient for finding orthologous markers among divergent genomes [6, 9, 10]. When applied to exonic regions, DNA capture can also provide information on gene function and evolution. Exon capture involves the hybridization of genomic libraries to short oligonucleotide baits complementary to complete or partial exomes printed on a microarray  or attached to magnetic beads in solution . The captured exon-containing DNA fragments of individual or pooled genomic libraries are then eluted from the array and the target-enriched elute is sequenced using an NGS platform. To date, the design of exon capture relies heavily on existing high quality genomic resources (e.g. ). However, the genomes of most organisms of ecological and evolutionary interest are yet to be sequenced, which has largely impeded the expansion of DNA capture across the tree of life.
Results and discussion
Tamias alpinus transcriptome
The T. alpinus multi-tissue cDNA library generated 30,233,530 raw sequence reads with a total length of 3.02 Gb (NCBI SRA ID: SRA053502). After filtering, the total length of cleaned reads was 2.69 Gb. The de novo assembly information is summarized in Additional file 2. Instead of subjectively selecting a single kmer size for transcriptome assembly, we explored multiple-kmer assemblies followed by merging raw assemblies and removing redundancies, as recommended by previous studies [15–17]. Previous work has demonstrated that a multiple k-mer strategy increases both contig length and transcript diversity when compared to a single k-mer assembly method . The mean contig length among the 20 raw assemblies generated by ABySS ranged from 771 to 1,074 bp with an overall average of 902 bp. The final merged consensus assemblies contained 37,563 contigs (36.5 Mb) with a mean length of 972 bp. Among these contigs, 21,994 (28.1 Mb) showed strong orthology (BLASTX E-values < 1e-10) with known proteins from the combined human (Homo sapiens), mouse, rat and thirteen-lined ground squirrel protein dataset. The annotated transcripts matched 11,320 (49%) of the unique protein-coding genes in the Ensembl mouse protein database. Among these contigs, 55.5% covered the full coding sequences of a gene including partial sequences of 5’ and 3’ untranslated regions (UTRs), 26.2% partially covered coding sequences with partial 3’ UTR, 8.1% covered partial coding sequences with partial 5’ UTR, and 10.2% only contained partial coding sequences. As expected, the annotated contigs are 3’ UTR biased because the mRNA were enriched through oligo-dT selection . The mean length of the annotated transcripts was 1,297 bp and the average base coverage was 54X (median 11X). There was a positive relationship between transcript length and the number of reads mapped, but we found no strong correlation between coverage and the length of contigs ( Additional file 3: Figure S1). Errors during the transcriptome assembly appear to be trivial. The percentage of open reading frames (ORFs) that contained premature stop codons in the annotated contigs, which could be derived either from pseudogenes or by assembly errors, was only 1%. We also found that 1.2% of the annotated contigs comprised more than one distinct transcript, which were spuriously combined during the assembly. The array probes were designed according to the exons identified by comparing mouse protein to mouse genomic DNA, not the entire transcriptome. Thus, the low level of chimeria present in our dataset is unlikely to introduce errors during the array design. We were able to identify 127,456 putative exons across 21,262 annotated transcripts.
Exon capture array design
We used all identified exons that were longer than 200 bp (average size 332 bp) in length to design tiling probes for the capture arrays. Note that shorter targets (≥ 60 bp) are possible, but targeted capture performance is expected to decline for relatively short regions. In addition to the T. alpinus exons, we also targeted regions from the Tamias mitochondrial genome, 162 anonymous I. tridecemlineatus genomic intervals, the Y-linked SRY gene, and seven previously sequenced nuclear genes (See M&M for details). After filtering probes that likely contained repetitive elements, 962,438 probes were synthesized onto the Agilent 1M arrays.
Probe design is critical for successfully enriching orthologous loci with variable rates of evolution from cross-species DNA hybridization. Ideally, probes should reflect the full range of evolutionary rates of loci. While highly conserved genes are expected to provide more efficient capture across species, they will often be less useful for resolving phylogenetic relationships. At the other extreme, rapidly evolving loci will reduce the efficiency of cross-species hybridization. Furthermore, they might tend to reflect signatures of natural selection to a greater extent, and are thus more likely to introduce bias into the inference of phylogenetic relationships among species. Our goal was to design an array that targeted the full genomic range of evolutionary rates in an unbiased manner. We compared T. alpinus annotated transcripts against the transcriptome of Belding’s ground squirrel (Urocitellus beldingi), which was sequenced and assembled with the same method described above (data not shown). We found that the average sequence divergence between all identified orthologous regions from the two species was 5.2 ± 3.4%. The divergence between the targeted T. alpinus exons and the corresponding U.beldingi orthologs was 5.0 ± 2.9%, and followed a distribution similar to that of the overall divergence between the two species ( Additional file 3: Figure S2). These results indicate that our array design is unlikely to be biased towards highly conserved genic regions.
Exon capture data filtration and assemblies
The libraries for all individuals within a single species were pooled into a species library. Four exon capture arrays were used to capture each of the species libraries. After capture, the T. alpinus library (n=20) was sequenced on one lane, and the T. amoenus (n=4), T. ruficaudus (n=4), and T. striatus (n=3) libraries were equally combined and sequenced on two lanes. Summary information for the de-multiplexed libraries is shown in Additional file 4. In total, 11.2 and 29.1 Gb of raw sequence data were respectively obtained for the T. alpinus (SRA053501) species library and for the combined library of T. amoenus (SRA053503), T. ruficaudus (SRA053504), and T. striatus (SRA053505) species libraries. The sequence quality among individual libraries on each lane was consistent. For example, across the 20 T. alpinus libraries, the overall mean base quality score was 33.7 ± 0.1, while 84.7 ± 0.3% of the bases in the raw sequence reads had quality scores greater than 30.
The results for data filtration are shown in Additional file 5. The raw reads of each individual library were filtered to remove exact duplicates, adapters, bacteria and human contamination, as well as low complexity and quality reads. On average, 90.7% of the raw T. alpinus reads, and 71.2-72.6% of the raw reads for the other Tamias species passed all of these filters. Compared to the percentage of read duplicates present in T. alpinus libraries (1.5%), there was a substantial portion of raw reads identified as duplicates in T. amoenus (20.7%), T. ruficaudus (19.8%), and T. striatus (20%). Increased data yield makes repeated sequencing of the same molecule, and thus duplicates, more likely in PCR amplified libraries . These results indicate that additional sequencing of the captured T. alpinus libraries would result in higher unique coverage, while additional sequencing effort in the other species would present increasingly diminishing returns. Duplicate reads were removed in all subsequent analyses to avoid inappropriate pseudo-replication.
The cleaned reads of each species library assembled by ABySS and SOAPdenovo produced 24 and 25 raw assemblies, respectively ( Additional file 6). The mean and median lengths, and N50 of assemblies generated by various combinations of k-mer and k-cov were similar among species ( Additional file 3: Figure S3). The raw assemblies were merged by species to produce consensus assemblies, which were then compared to the T. alpinus exons and other targeted regions to identify the portion of consensus assemblies that are associated with various targets (in-target assemblies). As expected, the length of a contig in the assemblies was much greater than the length of the corresponding targeted T. alpinus exon ( Additional file 7), illustrating the potential for exon capture to obtain sequence information at flanking intronic regions even when probes are restricted to exons.
Specificity levels are lower in our study compared to most studies in humans (50-75%, [7, 21]). This decline likely has two principle causes. First, amplified, multiplexed libraries tend to show much lower specificity . We see the same range of specificity across differing levels of multiplexing (Figure 2), indicating that this is not an issue of multiplexing per se, but rather specific to enriching amplified libraries with long complementary adapters. The use of a different barcoding system that enables pooling and hybridization prior to amplification would likely help reduce these effects. A second and likely larger issue is the fundamental limitation of not having an available reference genome. We calculated specificity following previous studies as the proportion of reads that map to the genome that fell within target regions. One important caveat in our data is that we observe a much larger proportion of un-mapped reads. We suspect that the observed low specificity is a reflection of reduced mapping efficiency in the absence of a complete reference genome. For example, among the reads mapped to assemblies corresponding to T. alpinus target exons, only ~4% of them were exclusively within flanking regions, compared to 32.6% in a recent human study . Although high specificity is often obtained in single sample capture experiments, this approach is simply not cost-effective for population-level sampling  and the low overall specificity is more than compensated for by the ability to capture many individuals in parallel.
As expected, average coverage for the mitochondrial genome was at least one order of magnitude greater than for nuclear exons due to the higher per cell copy number of mtDNA. We also observed greater variance associated with sequence coverage among samples for the mitochondrial genome, which could reflect a difference in the amount of mitochondrial template present in each DNA sample ( Additional file 3: Figure S4; Additional file 5). Nevertheless, the distribution of coverage for aligned reads (valleys and peaks) along the entire mitochondrial genome showed great concordance across all four species, indicating that the performance of each capture experiment is consistent.
The comparison of sequence coverage among intended targets within each library was influenced by the base composition of the targets. We found that targets with exceptionally high or low G/C ratios led to low coverage ( Additional file 3: Figure S5), which may be due to poor annealing and secondary structure formation during the hybridization . We found a strong correlation in terms of capture efficiency between the same targets among species libraries ( Additional file 3: Figure S6), as well as among samples within each species (data not shown), negating the existence of a species-specific capture bias. These results suggest that the performance of each target in different capture experiments is highly reproducible.
Estimate of empirical error rates
Two important sources of errors in NGS analyses are sequencing (or base calling) error and contamination. Sequencing error, particularly on the Illumina platform, can be as high as ~1% . Multiplexing samples can lead to cross sequence contamination caused by DNA contamination among specimens during lab work, swapping sample-specific barcodes during library preparation or bulk amplification of libraries (i.e., recombinant PCR), and/or mis-assigning the reads during the library de-multiplexing. Potential errors can also be derived from de novo assembly and alignment. These sources of errors can introduce bias into the variant calling based on exon capture data and result in false SNPs and hence, unreliable genotypes.
To address these related issues of exon capture and NGS sequencing, we targeted the complete mitochondrial genome, putative X-linked genes, and the Y-linked SRY gene on our capture arrays, while using both male and female specimens in the pooled genomic libraries. For males, these loci are haploid, and therefore, all variation within individuals should stem from sequencing or assembly errors. The Spermophilus SRY gene was captured in all male Tamias samples and was absent in all females except for one T. striatus female that had 6 reads that mapped to SRY. We identified a set of 26 putatively X-linked genes in T. alpinus for which we calculated overall error rate (the number of mismatched bases over the total number of aligned bases) in males to be 0.037%. The error rate for the T. alpinus mitochondrial genome was 0.045%, which is consistent with the X-linked loci. The results show that the empirical error rate for cleaned reads was comparable to that observed in DNA libraries of other species sequenced using the same platform (Singhal et al. unpublished), and corresponds to a Phred quality score between 33.5-34.3. We also calculated the average un-calibrated quality score of cleaned reads to be 36.3, indicating an error rate of 0.023%. This result suggests that the un-calibrated quality scores may under-estimate the true error rates of the sequence reads, which supports the notion that error rates estimated solely based on raw quality scores produced by NGS base-calling algorithms can be inaccurate . The empirical error rate can be used to recalibrate raw quality scores and improve the accuracy of genotype calling . Since genotype calling is not a focus of the present paper, we did not perform quality score re-calibration but related information can be found in  and .
Divergence vs. capture efficiency
One hundred and sixty-two 1 kb genomic intervals of I. tridecemlineatus were targeted on the arrays. The sensitivity of this set of divergent targets dropped to ~90% for all Tamias species libraries, compared with a sensitivity of 99% when using T. alpinus exons as probes for target enrichment. Owing to the fact that we had 20 T. alpinus libraries sequenced on the same Illumina lane while the other 11 Tamias libraries were sequenced on two other lanes, the total data yield of each T. alpinus library was only about 1/4 that of other Tamias libraries. To eliminate this bias we randomly sub-sampled 5 million reads from each individual library and aligned them to the assemblies associated with targeted T. alpinus exons and the I. tridecemlineatus genomic intervals, respectively. The normalized average coverage for T. alpinus libraries was 10X. The average sequence divergence between the targeted T. alpinus exons and the other two western chipmunks, T. amoenus and T. ruficaudus, was 0.58%. The normalized coverage for T. amoenus and T. ruficaudus (12X) was slightly higher but still consistent with that of T. alpinus. The eastern chipmunk, T. striatus, has a slightly more divergent genome (1.5%) from T. alpinus yet the coverage for the two species was nearly identical ( Additional file 3: Figure S7). Vallender  applied human exome capture to non-human primates and suggested that the divergence cutoff for unbiased capture is around 4%. Our study clearly demonstrates that there is no decrease in capture efficiency within 1.5% divergence (such is the level of T. striatus to T. alpinus). According to these results, it is reasonable to postulate that unbiased sequence coverage could be obtained for all 23 chipmunk species across the western Tamias clade using the array designed from a single species transcriptome.
Orthologous markers and detection of candidate SNPs
Summary of identified SNPs in orthologous loci
Our goal for this study was to develop genomic resources for the application of NGS to a lineage lacking reference genomes, develop targets for exon capture, and test the performance of these new resources across phylogenetically divergent species. In summary, our strategy as it pertains to our study species, involved generating a de novo transcriptome for the chipmunk (T. alpinus), designing an array based on identified exons, pooling individually barcoded genomic libraries of a few moderately divergent species (T. alpinus, T. amoenus, T. ruficaudus and T. striatus) before exon capture, sequencing using multiplexed NGS, and applying bioinformatics approaches to analyze exon capture data without a reference genome.
The strategy developed here proved successful and will benefit future studies on multiple fronts: 1) Our capture array design is only based on transcriptome sequences from one individual specimen. Compared to whole genome sequencing, this is a much faster and economically viable approach for generating tens of thousands of genetic markers; 2) Multiplexing prior to capture is key to cost-effective population-level sequencing. The level of specificity, sensitivity, and read coverage of target regions is highly consistent among samples, indicating that pooling indexed libraries before exon-capture is a highly effective method; 3) The performance of each target is highly reproducible among independent capture experiments and the specificity is independent of the number of libraries that can be multiplexed on an array; 4) Base coverage is fairly even within exons, except at the edges. This edge effect could be further mitigated by extending the range of high density tiling baits at the ends of exonic regions, or by subsequent extension of probes into intronic regions of the assembled contigs recovered by initial captures; 5) Without having a prior reference genome, de novo assemblies of the captured reads can effectively provide a reference for mapping sequence reads, which allows for variant calling using existing tools; 6) Coverage does not decrease for species within 1.5% divergence in coding regions, but significantly declines when using more divergent targets (>8.7%) for array design. However, even then, a sensitivity of ~90% can be expected based on our results; 7) Over ten thousand orthologous loci and thousands of high quality candidate SNPs were identified that can be used for future population genomic and phylogenetic inference.
There are now diverse DNA capture methods available for non-model organisms, including PCR-generated bait capture to enrich specific loci from tens of samples [26, 27] and DNA-hybridization based capture to target hundreds of loci from hundreds of samples. We suggest that studies should select the methods that best suit the questions at hand. For example, approaches that use ultraconserved element anchors (UCEs) [9, 28] and anchored enrichment  have demonstrated great utility for tackling deeper phylogenetic nodes. Our study showcases the potential of a de novo transcriptome-enabled, multiplexed, exon capture method for sequencing thousands of orthologous loci over a vast number of non-model samples. Our approach can target exons spanning a wide range of evolutionary rates, thus it can be applied to population genomics for detecting selection and demography, species delimitation, and resolving phylogenies at low-moderate phylogenetic distance. Overall, adopting this cost-effective approach and associated analytical methods by ecological and evolutionary labs will expand the realm of possibility for addressing various evolutionary questions, ranging from populations to large assemblages of related species.
Three species of chipmunks from the western Tamias clade, T. alpinus (n=20), T. amoenus (n=4) and T. ruficaudus (n=4), and an eastern chipmunk, T. striatus (n=3), were collected for sequencing (Figure 1; Additional file 8). Tissues for DNA extraction were preserved in 90% ethanol or frozen. The specimen used for transcriptome sequencing was a male T. alpinus (MVZ224483) collected from Bullfrog Lake, Kings Canyon National Park, California in November 2009. Liver, kidney, spleen, and heart tissues were preserved in RNAlater immediately following euthanasia and then archived at −80°C.
We used multiple tissues for RNA preparation in order to retrieve as many expressed transcripts as possible. Total RNA was extracted from each tissue using Qiagen RNeasy kits. The quantity and quality of total RNA was assayed using an Agilent Bioanalyzer 2100 and Nanodrop. Only RNA with a total amount ranging from 5–10 μg and an RNA integrity number (RIN) greater than 8 was used. Poly (A+) RNA was isolated from the total RNA by two cycles of purification using Sera-Mag® Magnetic Oligo(dT) magnetic beads. Resulting mRNA was quality verified as above and combined in equal molar ratios across tissues. Synthesis of cDNA and library preparation was performed as outlined in the Illumina mRNA sequencing sample preparation guide (rev. D) with slight modifications: we used 70°C instead of the recommended 94°C in order to avoid over-fragmentation of the mRNA and we size-selected the final library from 350–400 bp using an agarose gel extraction. We then sequenced the pooled library using one lane of 100 bp Illumina (GAIIx) paired-end sequencing at the QB3 research facilities at the University of California, Berkeley.
In order to clean the raw sequence reads, we first removed identical forward and reverse reads. Duplicated reads are removed to avoid inflating coverage estimates and to decrease the computational burden for de novo assembly. We further trimmed unique reads by removing adapters, low complexity, and low quality (Q-score < 20) sequences using Blat , Trimmomatic (http://www.usadellab.org/cms/index.php?page=trimmomatic) and custom Perl scripts. We identified and removed any reads derived from contaminants by screening against the human and Escherichia coli genomes using Bowtie .
De novo assembly, annotation, and exon identification
Cleaned reads were de novo assembled using ABySS  on the Texas Advanced Computing Center Ranger cluster (http://www.tacc.utexas.edu/). We generated individual assemblies under a wide range of k-mers (21, 31, 41, 51, and 61) and coverage values (2, 3, 5, and 10), obtained 20 raw assemblies, and then used cd-hit-est (http://weizhong-lab.ucsd.edu/cd-hit/), Blat, and CAP3 (http://seq.cs.iastate.edu/) to cluster and merge all raw assemblies into final, less-redundant assemblies.
The resulting T. alpinus transcriptome was then annotated using the BLASTX algorithm  and a database of human, mouse, rat, and thirteen-lined ground squirrel proteins with a minimum E-value of 1e-10. The exon-intron structure of each annotated T. alpinus transcript was identified using a 3-step strategy: i) BLASTX was used to search for correspondence between T. alpinus annotated transcripts and mouse proteins; ii) matched mouse proteins from step i were compared to mouse genome nucleotide sequence using exonerate (http://www.genome.iastate.edu/bioinfo/resources/manuals/exonerate/exonerate.man.html) -protein2genome in order to locate the exon-intron boundaries of each protein, and to determine the nucleotide sequence of each exon; iii) Exonerate -est2genome was then used to compare the sequences of mouse exons to each T. alpinus transcript to identify exon boundary coordinates.
We used the Agilent SureSelect custom 1M-feature microarrays to target 11,975 nuclear exons (200 bp or greater) identified from the annotated T. alpinus transcriptome. These exons were derived from 6,249 annotated proteins with an overall target size of around 4 Mb. Probes were designed using customized scripts originally developed by Hernán Burbano and following the recommendations of Hodges et al. (; see also ). Briefly, 60 bp probes were tiled at 4 bp intervals across individual exon targets. Typically in exome capture, probes are designed to extend beyond exon-intron boundaries to promote uniform read depth at both ends of the targets . However, because we did not have genomic sequence for the introns, we used a 1 bp tiling strategy near exon ends to mitigate the effect of reduced coverage at the exon edge.
In addition to targeting the T. alpinus transcriptome, we targeted probes from four other sources. First, we used 2 bp tiling probes to capture 162 1 kb anonymous genomic intervals from the thirteen-lined ground squirrel (I. tridecemlineatus) draft genome (2X, Sanger sequencing). These regions were selected to test the robustness of probes designed from a divergent reference. Second, we used 764 20 bp tiling probes spanning non-repetitive regions of the ~16 kb Tamias complete mitochondrial genome, which enabled us to determine sequence contamination and empirical sequencing error rate. Third, we used 1 bp tiling probes targeting 350 bp of the consensus Spermophilus (S. fulvus S. major S. pygmaeus) SRY gene. This Y-linked locus can be used as an additional control for cross-sample contamination in females. Finally, we targeted seven nuclear genes previously sequenced in Tamias (6,031 bp total), with 4 bp tiling: acrosin (1,558 bp), acp5 (361 bp), cmyc (896 bp), rag-1 (764 bp), anon (720 bp), zan (853 bp), and zp2 (879 bp). These genes represent the entirety of genomic data available in this group prior to our project [13, 14] and were used as positive controls in a post-capture qPCR assay to determine the initial enrichment quality.
In order to avoid nonspecific hybridization of genomic DNA to the capture arrays, we employed a soft-masking approach  to exclude probes that likely contain highly repetitive elements. Specifically, we combined the T. alpinus transcriptome and the I. tridecemlineatus genome to create a single genome set as a reference. We then calculated the frequency of all 15-mers on both strands of the reference and excluded probes containing 15-mers found at a frequency of 50 or greater. This cut-off is arbitrary and intentionally more stringent than commonly used given the inherent limitations expected when relying upon a divergent and poor quality reference as the basis for soft-masking.
Genomic DNA library preparation and multiplexing
Genomic DNA was extracted from liver or muscle tissues of T. alpinus T. amoenus T. ruficaudus, and T. striatus specimens using Qiagen DNeasy Blood & Tissue kits. DNA was sheared using a Bioruptor with 4–6 rounds of sonication (7 minutes per round on high, 30s on 30s off). The ideal DNA fragment size distribution should range from approximately 100–500 bp, with a mean of 200–250 bp. Individual genomic libraries were prepared following the protocol outlined by  with slight modifications. This protocol describes a fast, simple, and cost-effective method for preparing indexed genomic libraries for Illumina. Briefly, universal adapters are ligated to each library, and then the libraries are indexed via PCR using 7-nt barcoded primers prior to exon capture. Each PCR indexed library was pooled by species in equimolar ratios to obtain 20 μg total DNA for each hybridization experiment.
We used one microarray for each pooled set of species-specific libraries. For example, 20 pooled T. alpinus libraries were hybridized on one array. We followed the detailed procedure for array-based exon capture that is described in  with slight modifications by . Each genomic library was denatured in a solution containing excess blocking oligos and Cot-1 DNA. Tamias Cot-1 DNA was isolated following .
The pooled libraries were then hybridized in a SciGene 777 microarray hybridization oven at 65°C for 65 hours, rinsed, and eluted at 95°C. The eluted DNA was then PCR amplified for 14 cycles, reflecting the number of cycles expected to produce the optimal yield as empirically estimated from the starting template concentration. This precaution was taken to ensure that we did not over-amplify libraries and increase the likelihood of barcode swapping between fragments and the formation of hetero-duplexes . We verified target enrichment with qPCR analysis of amplified elute, following . Three target and non-target regions were evaluated independently.
We then diluted each library to a final concentration of 10 nM in 10 μl. Sequencing was performed using the Illumina HiSeq 2000 platform (100 bp paired end) provided by the QB3 at UC Berkeley. The library consisting of 20 T. alpinus was sequenced on one lane, while the 4 T. amoenus, 4 T. ruficaudus, and 3 T. striatus libraries were equally combined and sequenced on two lanes.
Genomic sequence data cleanup and de novo assemblies
Sequences were assigned to individuals based on their unique barcodes and filtered as described above. ABySS and SOAPdenovo  were used to assemble cleaned reads of each of the four species libraries. For example, 20 T. alpinus libraries were assembled together to make species-specific consensus assemblies. Assembly using multiple kmer sizes was performed. Twenty raw assemblies were obtained with ABySS by setting the k-mers to 21, 31, 41, 51, and 61, and k-cov to 6, 10, 15, and 20. SOAPdenovo did not allow adjustment of k-cov so we only set the k-mers to 21, 31, 41, 51, and 61 (5 raw assemblies). Twenty-five raw assemblies from each species library were further merged using the same methods described above. This produced one consensus assembly for each of the four species.
Evaluation of capture efficiency
The following parameters were used to evaluate the performance of the exon capture results: i) sensitivity, or the percentage of the intended targets covered by sequence reads; ii) specificity, or the percentage of sequence reads aligned to selected targets; iii) sequence or base coverage, or the number of reads aligned to a base within a target sequence or a particular base; iv) uniformity, or the variance of the average read depth among different targets and within targets; and v) reproducibility, or the consistency of results obtained from replicated capture experiments.
We used reciprocal BLAST to associate contigs in the consensus assemblies with the target exons and estimate the exon capture sensitivity. To improve mapping efficiency, we did not use the T. alpinus transcriptome as a reference for alignment. Instead, we used the contigs in the consensus assemblies that were best matches to targets, or in-target assemblies, as our reference. These contigs were annotated for exon-intron structure. The sequence reads of each individual library were then mapped to its own species consensus assemblies as well as in-target assemblies using Novoalign (http://www.novocraft.com). The exon capture specificity was calculated by dividing the number of reads aligned to targets by the total number of reads aligned to the consensus assemblies. Only reads that mapped to unique locations in the in-target assemblies were used to calculate specificity. Novoalign output was parsed using SAMtools , and the ‘mpileup’ function in SAMtools was used to estimate coverage. Uniformity was measured by comparing average sequence coverage among targets, as well as the average base coverage within each target. The level of uniformity could also be estimated by comparing enrichment efficiency for the same targets across different species libraries. Since each species library was captured on independent arrays, the level of uniformity is reflective of the reproducibility. If the sequence coverage of a given target is correlated among libraries and capture arrays then we believe that the enrichment is uniform and reproducible across regions.
Estimating empirical error rates
We compared each of the captured Tamias mitochondrial consensus genomes to the mouse complete mitochondrial genome obtained from NCBI (JQ003190.1), and arranged the sequences in the order that genes/regions occur in the mitochondrial genome. We aligned reads to the captured mitochondrial consensus, parsed the alignments and identified the mismatches. The error rates were calculated by dividing the observed total number of mismatches by the total number of aligned bases. Likewise, we also determined the empirical level of errors for putatively X-linked genes in the libraries of Tamias males, for which we should observe no polymorphism.
Putatively X-linked genes were identified by: i) using the Ensembl BioMart tools to search non-redundant, conserved orthologs on X chromosomes of human, dog (Canis familiaris), rat and mouse; ii) these conserved orthologs were searched against T.alpinus exons with BLAST; iii) matched exons with female to male average sequence coverage ratios ranging from 1.9-2.1 were selected as putatively X-linked. The level of errors was also assessed based on the proportion of reads mapped to the Y-linked SRY gene in females.
Capture efficiency estimate of array designed from a divergent genome
We anonymously selected 162 genomic intervals of the thirteen-lined ground squirrel (I. tridecemlineatus) to hybridize with genomes of the four Tamias species. The methods used for searching in-target assemblies, and measuring sensitivity and sequence coverage of the targeted regions were identical to the ones described above. Sequence divergence was calculated by dividing the number of observed mismatches by the number of alignable bases in the alignment between in-target contigs with the corresponding I. tridecemlineatus genomic intervals. The divergence between the targeted T. alpinus exons and in-target contigs of each species was measured using the alignment mismatches within exonic regions. We focused on the relationship between average sequence coverage and sequence divergence in different chipmunk species libraries when using I. tridecemlineatus (divergent) and T. alpinus (closely related) genomes to design arrays. All analyses and graphing was conducted in R .
Identification of orthologous loci and preliminary SNP detection
We used reciprocal BLAST searches to compare in-target assemblies of each species to search for orthologous markers that were enriched among the four species. We aligned the reads of each species to the same reference (in-target assemblies of T. alpinus) to estimate the level of genetic variability of these markers. We did not align reads directly to the T. alpinus transcriptome in order to maximize mapping efficiency. Furthermore, mapping reads to species-specific de novo assemblies not only helps identify SNPs located within exons, but also identifies those located in flanking intronic and UTR regions. Candidate homozygous SNPs (fixed differences between species) and heterozygous SNPs (polymorphic within species) were determined from the reference sequences. Variant calling was conducted using SAMtools and SNPs were filtered via samtools.pl (varFilter -d 20 -D 5000). Reads that aligned to multiple reference regions were discarded, and sites containing mismatches were kept only if the coverage was 20X or greater. We also retained only high quality variable sites by filtering out sites with a Phred-scaled quality score of less than 30. The types of SNPs (synonymous, non-synonymous, intronic, 5’ and 3’ UTRs) were determined by comparison with the combined human, mouse, rat and thirteen-lined ground squirrel protein reading frames using bl2seq and BLASTX.
The authors would like to thank Chris Conroy, John Demboski, Hopi Hoekstra, Eileen Lacey, Hillery Metz, James Patton, Karen Rowe, Kevin Rowe and Jack Sullivan for providing specimens, Hernán Burbano for sharing scripts for array design, and Brice Sarver for providing Tamias mitochondrial sequence. We also thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing computational support. We would also like to acknowledge the Moritz lab group for their insightful comments on this manuscript. This work was supported by an NSERC postdoctoral fellowship (KB), University of Montana start-up funds (JG), and University of California Berkeley VCR-BiGCB and the Gordon and Betty Moore Foundation (CM).
- Shendure J, Hanlee J: Next-generation DNA sequencing. Nat Biotech. 2008, 26: 1135-1145. 10.1038/nbt1486.View ArticleGoogle Scholar
- Good JM: Reduced representation methods for subgenomic enrichment and next-generation sequencing. Methods Mol Biol. 2011, 772: 85-103.View ArticlePubMedGoogle Scholar
- Gilad Y, Pritchard JK, Thornton K: Characterizing natural variation using next- generation sequencing technologies. Trends Genet. 2009, 25: 463-471. 10.1016/j.tig.2009.09.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZX, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, Zheng H, Liu T, He W, Li K, Luo R, Nie X, Wu H, Zhao M, Cao H, Zou J, Shan Y, Li S, Yang Q, Asan , Ni P, Tian G, Xu J, Liu X, Jiang T, Wu R, Zhou G, Tang M, Qin J, Wang T, Feng S, Li G, Huasang , Luosang J, Wang W, Chen F, Wang Y, Zheng X, Li Z, Bianba Z, Yang G, Wang X, Tang S, Gao G, Chen Y, Luo Z, Gusang L, Cao Z, Zhang Q, Ouyang W, Ren X, Liang H, Zheng H, Huang Y, Li J, Bolund L, Kristiansen K, Li Y, Zhang Y, Zhang X, Li R, Li S, Yang H, Nielsen R, Wang J, Wang J: Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010, 329: 75-78. 10.1126/science.1190371.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith SA, Wilson NG, Goetz FE, Feehery C, Andrade SCS, Rouse GW, Giribet G, Dunn CW: Resolving the evolutionary relationships of molluscs with phylogenomic tools. Nature. 2011, 480: 364-367. 10.1038/nature10526.View ArticlePubMedGoogle Scholar
- Lemmon A, Emme S, Lemmon E: Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst Biol. 2012, 10.1093/sysbio/sys049.Google Scholar
- Hodges E, Rooks M, Xuan Z, Bhattacharjee A, Benjamin Gordon D, Brizuela L, Richard McCombie W, Hannon GJ: Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing. Nat Protoc. 2009, 4: 960-974. 10.1038/nprot.2009.68.PubMed CentralView ArticlePubMedGoogle Scholar
- Hohenlohe PA, Amish SJ, Catchen JM, Allendorf FW, Luikart G: Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Mol Ecol Res. 2011, 11: 117-122.View ArticleGoogle Scholar
- Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC: Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst Biol. 2012, 10.1093/sysbio/sys004.Google Scholar
- Cosart T, Beja-Pereira A, Chen S, Ng SB, Shendure J, Luikart G: Exome-wide DNA capture and next generation sequencing in domestic and wild species. BMC Genomics. 2011, 12: 347-10.1186/1471-2164-12-347.PubMed CentralView ArticlePubMedGoogle Scholar
- Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27: 182-189. 10.1038/nbt.1523.PubMed CentralView ArticlePubMedGoogle Scholar
- George RD, McVicker G, Diederich R, Ng SB, MacKenzie AP, Swanson WJ, Shendure J, Thomas JH: Trans genomic capture and sequencing of primate exomes reveals new targets of positive selection. Genome Res. 2011, 21: 1686-1694. 10.1101/gr.121327.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Good JM, Hird S, Reid N, Demboski JR, Steppan SJ, Martin-Nims TR, Sullivan J: Ancient hybridization and mitochondrial capture between two species of chipmunks. Mol Ecol. 2008, 17: 1313-1327. 10.1111/j.1365-294X.2007.03640.x.View ArticlePubMedGoogle Scholar
- Reid N, Demboski JR, Sullivan J: Phylogeny estimation of the radiation western American chipmunk (Tamias) in the face of introgression using reproductive protein genes. Syst Biol. 2012, 61: 44-62. 10.1093/sysbio/syr094.PubMed CentralView ArticlePubMedGoogle Scholar
- Surget-Groba Y, Montoya-Burgos JI: Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 2010, 20: 1432-1440. 10.1101/gr.103846.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu AL, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJ, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.View ArticlePubMedGoogle Scholar
- Zhao QY, Wang Y, Kong YM, Luo D, Li X, Hao P: Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics. 2011, 12: S2-View ArticleGoogle Scholar
- Wall PK, Leebens-Mack J, Chanderbali AS, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho LP, Hu Y, Carlson JE, Ma H, Schuster SC, Soltis DE, Soltis PS, Altman N, dePamphilis CW: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics. 2009, 10: 347-10.1186/1471-2164-10-347.PubMed CentralView ArticlePubMedGoogle Scholar
- Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D'Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jeddeloh JA, Muzny D, Albert TJ, Gibbs RA: Whole exome capture in solution with 3 Gbp of data. Genome Biol. 2010, 11: R62-10.1186/gb-2010-11-6-r62.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Vinckenbosch N, Tian G, Huerta-Sanchez E, Jiang T, Jiang H, Albrechtsen A, Andersen G, Cao H, Korneliussen T, Grarup N, Guo Y, Hellman I, Jin X, Li Q, Liu J, Liu X, Sparsø T, Tang M, Wu H, Wu R, Yu C, Zheng H, Astrup A, Bolund L, Holmkvist J, Jørgensen T, Kristiansen K, Schmitz O, Schwartz TW, Zhang X, Li R, Yang H, Wang J, Hansen T, Pedersen O, Nielsen R, Wang J: Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat Genet. 2010, 42: 969-972. 10.1038/ng.680.View ArticlePubMedGoogle Scholar
- Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ: Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010, 7: 111-118. 10.1038/nmeth.1419.View ArticlePubMedGoogle Scholar
- Burbano HA, Hodges E, Green RE, Briggs AW, Krause J, Meyer M, Good JM, Maricic M, Johnson PLF, Xuan Z, Rooks M, Bhattacharjee A, Brizuela L, Albert FW, de la Rasilla M, Fortea J, Rosas A, Lachmann M, Hannon GJ, Pääbo S: Targeted investigation of the Neandertal genome by array-based sequence capture. Science. 2010, 328: 723-725. 10.1126/science.1188046.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011, 12: 443-451. 10.1038/nrg2986.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J: SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009, 19: 1124-1132. 10.1101/gr.088013.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Vallender EJ: Expanding whole exome resequencing into non-human primates. Genome Biol. 2011, 12: R87-10.1186/gb-2011-12-9-r87.PubMed CentralView ArticlePubMedGoogle Scholar
- Maricic T, Whitten M, Pääbo S: Multiplexed DNA sequence capture of mitochondrial genomes using PCR products. PLoS One. 2010, 5: e14004-10.1371/journal.pone.0014004.PubMed CentralView ArticlePubMedGoogle Scholar
- Mason VC, Li G, Helgen KM, Murphy WJ: Efficient cross-species capture hybridization and next-generation sequencing of mitochondrial genomes from noninvasively sampled museum specimens. Genome Res. 2011, 21: 1695-1704. 10.1101/gr.120196.111.PubMed CentralView ArticlePubMedGoogle Scholar
- McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC: Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Res. 2012, 22: 746-754. 10.1101/gr.125864.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Birol İ, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE, Horsman DE, Connors JM, Gascoyne RD, Marra MA, Jones SJ: De novo transcriptome assembly with ABySS. Bioinformatics. 2009, 25: 2872-2877. 10.1093/bioinformatics/btp367.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Meyer M, Kircher M: Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc. 2010, 10.1101/pdb.prot5448.Google Scholar
- Trifonov VA, Vorobieva NN, Rens W: FISH With and Without COT1 DNA. Fluorescence In Situ Hybridization (FISH): Application Guide. Edited by: Liehr T. 2009, Springer, 99-109.View ArticleGoogle Scholar
- Ruano G, Kidd KK: Modeling of heteroduplex formation during PCR from mixtures of DNA templates. PCR Methods Appl. 1992, 2: 112-116.View ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: 1000 Genome Project Data Processing Subgroup: The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. 2011,http://www.R-project.org/,Google Scholar