The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics

Gopalakrishnan, Shyam; Samaniego Castruita, Jose A.; Sinding, Mikkel-Holger S.; Kuderna, Lukas F. K.; Räikkönen, Jannikke; Petersen, Bent; Sicheritz-Ponten, Thomas; Larson, Greger; Orlando, Ludovic; Marques-Bonet, Tomas; Hansen, Anders J.; Dalén, Love; Gilbert, M. Thomas P.

doi:10.1186/s12864-017-3883-3

Research article
Open access
Published: 29 June 2017

The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics

Shyam Gopalakrishnan ORCID: orcid.org/0000-0002-2004-6810¹,
Jose A. Samaniego Castruita¹,
Mikkel-Holger S. Sinding^1,2,
Lukas F. K. Kuderna^3,4,
Jannikke Räikkönen⁵,
Bent Petersen⁶,
Thomas Sicheritz-Ponten⁶,
Greger Larson⁷,
Ludovic Orlando¹,
Tomas Marques-Bonet^3,4,8,
Anders J. Hansen¹,
Love Dalén⁹ &
…
M. Thomas P. Gilbert^1,10,11

BMC Genomics volume 18, Article number: 495 (2017) Cite this article

12k Accesses
56 Citations
39 Altmetric
Metrics details

Abstract

Background

An increasing number of studies are addressing the evolutionary genomics of dog domestication, principally through resequencing dog, wolf and related canid genomes. There is, however, only one de novo assembled canid genome currently available against which to map such data - that of a boxer dog (Canis lupus familiaris). We generated the first de novo wolf genome (Canis lupus lupus) as an additional choice of reference, and explored what implications may arise when previously published dog and wolf resequencing data are remapped to this reference.

Results

Reassuringly, we find that regardless of the reference genome choice, most evolutionary genomic analyses yield qualitatively similar results, including those exploring the structure between the wolves and dogs using admixture and principal component analysis. However, we do observe differences in the genomic coverage of re-mapped samples, the number of variants discovered, and heterozygosity estimates of the samples.

Conclusion

In conclusion, the choice of reference is dictated by the aims of the study being undertaken; if the study focuses on the differences between the different dog breeds or the fine structure among dogs, then using the boxer reference genome is appropriate, but if the aim of the study is to look at the variation within wolves and their relationships to dogs, then there are clear benefits to using the de novo assembled wolf reference genome.

Background

In light of the ever-decreasing cost of high-throughput DNA sequencing, it is now possible to undertake large-scale genomic studies at not only the population level, e.g. [1, 2], but also the population paleogenomic level, e.g. [3,4,5,6,7,8]. While these datasets are being exploited across a growing range of applied questions, a number of research groups are beginning to also focus on how to interpret and treat this data in a way that minimizes biases, and thus yields robust inferences from the data.

Several human population genomic datasets have noted the existence of biases that arise when mapping the resequenced genomes of diverse individuals to a reference genome based on a single individual. Alignment against a single reference genome can lead to different samples appearing more similar to the reference genome, and underestimating the variation present in samples that come from a different population or species than the reference genome [9,10,11]. New mapping techniques are being developed to overcome these biases by allowing mapping to multiple genomes [12]. These methods rely on a high number of sequenced and de novo assembled samples, or a catalogue of polymorphisms for all the populations in the study. For species other than humans, such resources are scarce. Ultimately, these biases imply that thorough annotation of all variation in a genomics data set requires every individual to be represented by a de novo assembly [13,14,15]. Though this ideal is not feasible for a variety of economic reasons, there is a need to broaden the pool of reference genomes to ensure that we can minimize the effects of these biases on downstream analyses.

A research discipline where population genomics is rapidly making significant contributions is the study of domestication – a topic that has long held academic interest due to both its applied relevance and its broad general public appeal. Genomic and paleogenomic resources have previously been used to address major questions in domestication, including deciphering the population structure and admixture patterns in modern and wild lineages [16,17,18], discovering strcuture among ancient pre-domestic lineages [6, 19,20,21,22], and estimating levels of introgression from wild lineages into domesticated stocks [17, 23], applied to a multitude of species, such as maize [6, 16, 22], silkworms [24], chickens [25,26,27], and pig [28, 29].

Although these analyses can offer powerful insights into the domestication process, they come with their own sets of challenges. While the major challenge is the need to account for genetic diversity that has been lost as a result of full or partial extinctions of original wild lineages, mapping biases arising from experimental design, such as choice of reference genome, also pose a hurdle to robust analyses. At least one domestication related study has demonstrated that these effects can be considerable. In Orlando and colleagues’ [19] study of the genomic sequences of six horses (one from a pre-domestication Pleistocene sample), they showed how a variety of analyses such as D statistics, population divergence and heterozygosity estimation, led to different results when their resequenced genomes were mapped to the EquCab2.0 [30] reference genome, and a de novo assembly of the donkey genome. They attribute many of these biases to differences in how closely related the samples are to the horse reference genome. This problem is exacerbated in studies that include ancient, pre-domestication samples since the reference genomes are predominantly constructed using modern samples. Another difference in the reference genomes that might lead to different results in downstream analyses, is the technology used to generate the reference genome. Many older reference genomes were generated using Sanger sequencing while the newer reference genomes and resequenced genomes in studies have been generated using Illumina short read sequencing technology. Although the underlying causes for the biases remain unresolved, one powerful approach is to perform the analyses using several different closely related reference genomes, thus accounting for biases introduced by the mapping procedures and ensuring that the results are consistent across the choice of reference genomes.

With regards to the need for multiple reference genomes, while a number of genomics studies have recently been published that relate to the relationship between dogs and wolves, the sequence data from genome resequencing studies [21, 31,32,33,34] has either been mapped to the only currently available reference genome, that of the Boxer dog (CanFam3.1) [35], or compared to data drawn from SNP (Single Nucleotide Polymorphism) chip arrays developed to target variation in dog genomes [36, 37]. The results of such studies show that dogs are monophyletic with respect to wolves, and indicate the existence of a deep split between the modern wolf and dog lineages, and a deep split within the dogs as well [21].

There are still several questions regarding wolf and dog phylogeny, population history and domestication that remain unanswered. Although the results of these studies are largely consistent, there are some inconsistencies in the findings regarding the location and the time of the domestication event [21, 36, 38, 39]. It has also been suggested that the population of wolves that are ancestral to the modern dogs may be extinct [21, 32, 34].

It is possible that one explanation for discrepancies between studies is that important structural variation in the wolf genome is missed or misplaced by mapping to a dog reference, or targeting SNPs developed for dog variation. To test this hypothesis, we de novo generated the first wolf reference genome, then remapped the genomic datasets previously published by Wang et al., Freedman et al. and Zhang et al. [31,32,33]. We subsequently re-analysed the published and remapped data in the context of divergence, admixture and systematics, in order to explore whether any reference genome-specific biases occur.

Results

De novo reference genome assembly

In order to construct a de novo reference genome using a wolf, we generated a combination of 5–8 kilobases and 3 kilobases mate pair libraries, as well as 650 basepair and 180 basepair insert libraries. These were sequenced with 101 basepair paired end reads using 5 lanes of a Illumina Hiseq 2500, where one lane was allocated to the multiplexed mate-pair libraries, one lane to the 650 basepair insert library and the remaining three lanes were allocated for the 180 basepair insert libraries. Overall, this generated a 30× coverage of the genome. The de novo reference genome was assembled using the ALLPATHS-LG assembler [40]. The final assembly consisted of 8747 scaffolds, of which 8569 scaffolds were longer than 1 kilobase. The longest scaffold was 12.88 megabases. The scaffold N50 of the assembly is 1.56 megabases and the scaffold N80 of the assembly was 512 kilobases., the contig N50 of the assembly was 94 kilobases and the contig N80 of the assembly was 34 kilobases. The total length of the assembly was 2.34 gigabases, while the scaffolds longer than 1 kilobase covered more than 99.99% of the assembly.

Landscape of common repeats

To compare abundances of repetitive elements between the wolf assembly and canFam3, we sought to detect common interspersed repeats in both of them. We identify 902 megabases of repetitive elements along the wolf assembly, correspoding to 39.8% of the non-gapped assembly. We detect a similar, albeit slightly higher amount of repeats in canFam3 (1009 megabases, or 42.1% of the non-gapped assembly). When stratifying repetitive elements by their respective superfamilies, we observe simliar abundancies in the wolf and the dog assembly (see Additional file 1: Figure S2), with the exception of satellite sequences, a family of repetitive elements most commonly found in the telomeric and centromic regions of the chromosomes. To investigate the patterns underlying the differences in repeat annotations, we calculated the evolutionary distance of each annotation to its consesus sequence. Overall, the divergence landscapes are very similar, however, we observe a depletion of young and highly identical long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) insertions in the wolf assembly, most likely as an artifact of sequencing and assembly strategy (see Additional file 1: Figure S3).

Mapping, coverage statistics

Since the choice of reference genome directly affects the mapping process, we compared the efficiency of mapping previously published short reads to the reference genome when using one of the two genomes used in this study, viz., the dog reference genome [35] and the de novo assembled wolf reference genome. We compared the proportion of uniquely mapped reads for each sample and the depth of coverage across the genome. As shown in Table S1 (Additional file 1: Table S1), we find that the samples that come from the same sub-species as the reference genome, i.e. dogs when using the dog reference genome and wolves when using the wolf reference genome, have a higher proportion of reads that map uniquely to the genome. As a result, they also have a slightly higher coverage across the genome. Note that we do not find a large difference in coverages or proportions of reads that map uniquely, and the effect is consistent across all samples.

PCA

We performed a principal components analysis (PCA) to identify the major axes of variation in the genotype data. Fig. 1 shows the results of the PCA using data mapped to either the reference dog or the de novo wolf genome assembly. For this analysis, we used only common variants with minor allele frequency greater than 0.05. Irrespective of the reference genome used for the aligment, the first two principal components separate dogs from wolves. The proportions of variance explained by the first and second principal components are also very similar across the choice of the two reference genomes (see Fig. 1). Changing the missingness or allele frequency threshold leads to qualitatively similar results (Additional file 1: Figure S4).

Heterozygosity

We compared the estimates of per-sample heterozygosity using alignments to the two different reference genomes. Table S1 (Additional file 1: Table S1) shows that the estimated heterozygosity of the samples depends upon the reference genomes used for mapping. The heterozygosity estimates for dogs are consistently higher by upto 10% when using the dog reference genome compared to the de novo wolf genome assembly.

Population size

We additionally used the pairwise sequentially markovian coalescent (PSMC) [41] to explore the effect of reference genome on the estimated population size history of the populations that the resequenced individuals were obtained from. Figure 2 shows the reconstructed population size history for a subset of the samples in our study. The comparison of the population sizes shows that the estimates obtained are largely consistent. For the dogs in this study, the population size trajectories estimated using the two different reference genomes coincide beyond 10kya. However, the effective population sizes for the wolves are a bit lower when using the wolf reference genome, compared to the same when using the dog reference genome. We observed reference genome specific differences in the recent histories, which can be attributed to the difference in the rare/private variants discovered in the two species when using the different reference genomes. If the primary effect of changing the reference genome is in the number of rare variants discovered, the effect on analyses such as PSMC will be greatest in the recent population size estimates. As PSMC does not have the power to estimate these parameters well, the effect of this bias is not expected to be high in this analysis.

Phylogeny

We used RAxML and ExaML [42, 43] to estimate the phylogenetic relationships between samples using the variants identified by aligning to the wolf or the dog reference genome. Since our analysis only uses variant sites, we accounted for the ascertainment scheme of the variants using the ascertained version of the GTRGAMMA model of sequence evolution. As shown in Additional file 1: Figure S1, the overall topology of the resulting phylogenies differ depending on the choice of the reference genome. Specifically, when using the dog reference genome the dogs and wolves are reciprocally monophyletic. While using the de novo assembled wolf reference genome, the dogs were monophyletic with respect to the wolves but the wolves were not monophyletic with respect to the dogs. Note that the support values for these nodes that differ between the two topologies have very low bootstrap support values. Additionally, using a neighbour joining approach to estimate the phylogenetic relationships led to qualitatively similar results (data not shown).

Admixture

We estimated the ancestry proportions in the 23 samples using ngsAdmix [44]. When using two ancestry components for estimating admixture proportions, dogs and wolves are split into two different clusters for both choices of reference genome. In both cases, all the wolves, except for the high altitude wolves from the Zhang study [33], show up to 20% of the estimated dog ancestral component (Fig. 3). Increasing the number of estimated ancestral components from two to three leads to similar results, with the dogs and the wolves being separated into two clusters. Additionally, the wolves split into two clusters where the high altitude wolves are separated from the rest of the wolves. Further, the contribution of the estimated dog ancestry components in the wolves becomes negligible.

When estimating admixture with four ancestry clusters, the choice of the reference genome has an impact on the qualitative outcome of the admixture analyses. When using the de novo wolf reference genome, the newly added ancestry component separates the golden jackal (Canis aureus) from the other samples, whereas using the boxer dog reference genome reveals additional structure in the wolves, with the golden jackal assigned to one of the clusters containing the wolves. When estimating a higher number of ancestry components, the additional ancestry components explain variance in dogs if the dog reference genome was used and conversely, the use of the de novo wolf reference genome leads to additional structure in the wolves.

Discussion

Previous studies have speculated that the choice of reference genome has wide ranging effects, especially on the identification of population structure and the timing of demographic events in studies using multiple related species. This problem is expected to be exacerbated when the reference genome is closer to some species in the study than others. Given that there is currently a considerable amount of effort being applied to the sequencing and analysis of dog and wolf genomes, we decided to both explore the impact of the phenomenon in general, and specifically explore whether it holds implications for the results of several relevant previously published dog and wolf genome studies. In this regard, because the time of divergence between dogs and wolves is relatively recent (a conservative estimate of the divergence time is around 35,000 years ago [31, 34]) and the genetic divergence between the extant wolves and modern dogs is low, we did not, a priori, expect the choice of the reference genome to have a big impact on the qualitative inferences in the standard population genetics analyses. Overall, our findings bear this expectation out - the analyses that are primarily driven by common variation, such as principal components analysis and admixture analysis with low number of clusters result in very similar findings across the two reference genomes.

Nevertheless, since these two species are genetically very similar, the rare and/or private variation is informative for the differences between the two species. Regarding these variants, the choice of reference genome is clearly more important than for the common variants. As shown in both the table of heterozygosity (Additional file 1: Table S1) and the results from admixture analyses with higher number of estimated ancestry clusters (Fig. 3), the rare variation in the two datasets can lead to qualitatively different results. This is especially evident in the admixture analyses with four or more clusters, where the structure that is revealed is dependent on the choice of the reference genome. Using the data aligned to the dog genome results in earlier identification of structure in dogs, and vice versa.

One main concern when interpreting these results is the differences in the quality of the two reference assemblies. Clearly, the dog reference genome is in a much more mature state than our de novo assembly of the wolf reference genome. This difference in quality could lead to biases in the analyses, especially analyses that require large continuous regions with variant calls, e.g., effective population size estimation using PSMC as well as characterization of inbreeding levels using runs of homozygosity. Although the effective population size estimates are consistent for the two reference genomes, the difference in quality of assembly could result in different estimates in the most recent time periods, where the methods are typically underpowered.

The effect of the choice of the reference genome seems to be limited to analyses that rely of low frequency and private variants. When comparing the effect from mapping against wolf and dog reference genomes, we found the largest effect in the higher order structure identified in the wolves or dogs when estimating ancestry components. At lower number of ancestry components, the choice of reference genome had no effect on the identification of clusters.

In this study, neither of the two reference genomes used were equally distant from the wolves and dogs samples analysed. Ideally, one could use the genome of a relatively close outgroup – the golden jackal in our case – to ensure that there are no biases introduced due to the choice of the reference genome. Although this would avoid the pitfalls of choosing a reference genome that is more close to some of the samples than others, it may not be feasible in many cases, e.g. due to the relatively high economic and computational costs of generating outgroup genomes, or the absence of an appropriate outgroup. Since the reference genomes for most studies tend to not be equally distant from all samples, it is important to account for the biases while interpreting the findings from population and phylogenetics analyses.

Conclusions

We have generated the first de novo assembled wolf reference genome, which will be a useful resource for future studies exploring the genomic structure and relationship between dogs, wolves and other canids. Since the two species that are the focus of this paper are so closely related, the effect of the reference genome was minimal on many of the downstream analyses such as PCA and estimating the phylogeny of the samples. However, some analyses like admixture showed the effects of the reference genome at higher number of clusters. Since the use of the wolf reference genome results in identification of population structure that is hidden when using the dog reference genome, we recommend the use of the de novo wolf reference genome for any studies where the focus is on identifying the relationships between wolves and dogs or teasing apart the relationship between the various wolves of the world.