This study was designed to expand genomic resources for the red (silver) fox, an emerging animal model for genetic research whose genome has not yet been sequenced. Using the 454 GS FLX Titanium platform we sequenced non-normalized cDNA libraries from brain samples of two farm-bred foxes with markedly different genetically determined behavioral phenotypes.
Samples for sequencing were selected from prefrontal cortex because this region is easily recognisable in the brain; can thus be collected in a consistent manner; and is related to behavior. The role of different subdivisions of prefrontal cortex in regulating various aspects of cognition, memory, emotions, and social behavior is reviewed in Clark et al., 2010 . In humans, a link between brain damage of prefrontal cortex and aggressive behavior has been reported .
A total of 0.96 Gb of transcriptome sequencing data for the tame fox and 1.21 Gb for the aggressive fox were obtained. The depth of sequencing chosen for this study was based on previous experience sequencing non-normalized cDNA libraries . The fox reads were assembled in three different combinations: i) only reads from the tame fox, ii) only reads from the aggressive fox; iii) all reads together. The percent of reads assigned into contigs in each of the three assemblies was very similar (Table 1).
The assembly software used, the 454 GS De Novo assembler ("Newbler"), takes into account the existence of gene splice variants. Newbler first assembles sets of self-consistent reads into contigs, which are the rough equivalent of exons. It then builds isotigs (the rough equivalent of transcripts) from contigs that are consistently connected by subpopulations of reads. When there are splice variants, contigs can be connected via different paths that are supported by read data, leading to multiple isotigs being generated. Isotigs are grouped into isogroups, which are the rough equivalent of genes. In the context of a transcriptome assembly by Newbler, contigs are the sections of isogroups where corresponding isotig sequences are the same. Consequently, the assembly produces two sets of sequences: a theoretically non-redundant set of contigs, and a set of isotigs which are candidate long- or full-length transcripts from multiple splice variants of genes, and contains some partially redundant sequence segments. Newbler is currently the only transcripome assembler that attempts this task, and is one of the mostly widely used assemblers for de novo assembly of organisms with few genomic recourses .
Comparison of transcriptome assemblies from different studies is inherently problematic because different assembly algorithms handle the complexities of the transciptome in a variety of ways. CAP3, which was used in the assembly of transcriptomes of two mammalian species -- the bank vole (Myodes glareolus) and the domestic ferret (Mustela putorious furo) [14, 10] -- does not split reads into multiple contigs as Newbler does . This makes the number of contigs non-comparable. One measure of comparison is the span of contigs, the total number of bases in the all contigs in an assembly. This number is available for the bank vole, as the number of contigs reported, 63,581, multiplied by the average length of the contigs, 418, for an assembly span of 30.6 Mb. Assembly of the fox transcriptome spans 84.1 Mb. Relative to the bank vole, the fox transcriptome was sequenced to approximately 6.2 times greater depth and the assembly span was higher by 2.7 times.
The number of isogroups, isotigs, and contigs identified in the assembly of all fox reads together was significantly higher than the number of isogroups, isotigs, and contigs identified in the separate assemblies of tame and aggressive reads (Table 1). Singleton reads in the individual assemblies, when assembled together, created new contigs, isotigs and isogroups. Furthermore, sequence variations between individuals increased the number of contigs and isotigs in isogroups, when these variations were sufficiently large for the assembly program to assume them to be different exons, or real differences in splice variants of some genes between the two individuals. The average isotig length and N50 isotig length were also longer in the assembly of all reads together, as would be expected from an increase in the amount of sequencing data (Table 1).
Because the fox genome has not yet been sequenced and a fox RefSeq database is thus not available, we used SwissProt and the dog RefSeq database to identify gene orthologs in the fox transcriptome. BLASTX of fox isotigs from the assembly of all reads together against SwissProt database and mapping of all fox reads against dog RefSeq database identified very similar number of genes (13,624 and 14,418 genes, respectively).
The number of identified genes in the fox transcriptome was significantly lower than the number of isogroups (59,713) found in the assembly of fox reads. The total number of isogroups was also larger than the total number of genes in mammalian genome. One hypothesis is that the amount of sequencing data used in the assembly was not sufficient to cover the full lengths of all fox gene transcripts, therefore some poorly covered genes might have been partitioned into two or more isogroups. Consistent with this hypothesis, BLASTX of fox isotigs against SwissProt identified 13,624 genes, of which 6,952 (51.03%) were found in more than one isogroup (see Results section). In addition, a subset of fox reads (Table 5) was mapped to non-gene regions in the dog genome. These regions are expected not to be present in the dog RefSeq database. We expect that in the fox transcriptome these regions will each be represented by an isogroup with only one isotig assigned to it.
Also, not all genes that are transcribed are described in the databases we used. The genes not in public databases are expected to be genes which would have the more rare, and less covered sequences, and subsequently less complex isogroups. This low level of expression would imply that they would have been less likely to be found and described previously. This may explain why only 28.31% (14,347) of all isogroups with a single isotig were successfully aligned using BLASTX. At the same time, 77.08% (6,960) of isogroups containing more than one isotig (9,030) yielded a hit to the SwissProt database.
The completeness of the fox transcriptome was tested by mapping fox reads combined variously into datasets of different sizes against the dog RefSeq database (Figure 2). One striking result of this experiment was the identification of a very similar number of dog gene orthologs in transcriptomes from tame and aggressive samples (13,618 and 13,855 genes, respectively). The total number of genes identified in both transcriptomes was only approximately 5% larger than a number of genes identified in individual samples (Figure 2). These data demonstrate that, without increasing the depth of sequencing by at least an order of magnitude, it is unlikely that a significantly higher number of rare transcripts would be identified in these transcriptome libraries. The number of genes identified, in this study, from fox pre-frontal brain transcriptome is similar to that identified from mouse cortex (14,787 genes at day E18 and 15,423 genes at day P7, respectively) .
For each accession number in the dog RefSeq database to which at least one fox read mapped, we estimated the breadth of coverage of dog transcripts by fox reads. Figure 3 demonstrates the bivariate coverage of identified transcripts: roughly 75% of dog transcripts had less than 90% coverage and ~25% of dog transcripts had coverage over 90%. We examined the depth of coverage of dog transcripts with relatively low coverage. The slight rise in the middle of the graph, from around 40-60% of coverage contains, as expected, mainly transcripts with a low depth of coverage, but it also contains a significant portion of the transcripts, responsible for the rise above a simple fit curve, that have a large number of fox reads mapped to them. It suggests that in many cases it is not the depth of coverage that is limiting the breadth of coverage. The transcripts in the canine RefSeq database can originate from different tissues and it is likely that for some genes, brain transcripts only partially overlap with transcripts from other tissues. Possible sequence differences between canine and fox transcripts would also reduce the coverage. The breadth of coverage of dog transcripts mapped by reads from two individual samples or from both samples together was almost identical. These data again suggest that a moderate further increase in the amount of sequencing data is unlikely to increase the breadth of coverage of dog transcripts.
The dog is the closest relative of the fox with a sequenced genome . The dog and fox, however, have very different karyotypes, the dog having 78 mostly acrocentric chromosomes, whereas the red fox has 34 metacentric chromosomes [42, 43]. The cytological relationship between the dog and fox genomes is well understood: cytogenetic methods [44–47] and alignment of the fox meiotic map against of the dog genome  have demonstrated that most fox chromosomes correspond to two or three canine chromosomes. The canine genome sequence assembly CanFam2 was used to localize fox sequencing reads and contigs in the dog genome. The approximate location of corresponding regions in the fox genome could be then inferred using the comparative dog-fox genetic and cytogenetic maps.
Mapping fox reads to the canine genome allowed the sequence divergence between the dog and fox genomes to be estimated. One sequence difference (single nucleotide difference, small indel, or difference in 2-3 adjacent nucleotides) between fox and dog was identified per 129 mapped bases. The number of differences between fox chromosomes was one difference per 1,309 mapped bases. The level of divergence between fox and dog coding regions is comparable with the level of divergence between sheep and cattle .
Mapping fox reads against both the canine genome and the assembled fox transcriptome sequence identified 30,491 high-confidence fox-specific SNPs. The filtering criteria were established by Sanger resequencing of SNPs with different numbers of reads per allele. The number of SNPs identified in this study significantly exceeded those identified in the bank vole transcriptome (19,114 SNPs), although filtering parameters in our study (minor allele present in at least three reads in one individual) significantly exceed the filtering parameters used in the vole transcriptome study (each allele present in at least two reads) . These comparisons demonstrate that a greater depth of transcriptome sequencing enables the discovery of a large number of SNP's with higher confidence.
Our filtering parameters yielded 100% confirmation of the polymorphisms in the validation study (Additional file 5, Table S2a,b) but differences in individual genotype (homozygosity vs heterozygosity) identified by 454 sequencing were different from Sanger sequencing in some cases. To validate SNPs by Sanger sequencing, we used cDNA prepared from exactly the same RNA samples which were used for 454 sequencing (Methods section; SNP validation). Therefore, we expect allele specific bias to be equal in the samples used in both 454 and Sanger sequencing experiments. All SNPs for which the second allele in the same sample was not identified by 454 sequencing were among SNPs with the lowest read coverage depth.
Over 90% of SNPs from those with the lowest sequence coverage included in the validation study (minor allele present in at least three reads between the individuals) were also validated by Sanger sequencing. This filter parameter is still stricter than that in the bank vole study . Therefore, for a small risk of false positives, we can also use some of these lower confidence SNP's to increase the total set of SNPs by about 50% to around 45,000 SNPs. Taking into account the significant cost difference between 454 and Sanger technologies, our data suggest that, at the small cost of obtaining a relatively small number of false polymorphisms, SNPs from this category would be worth including in a SNP array. A small number of false SNPs can be easily detected during post-processing analysis of genotyping calls and filtered out prior to further analysis. For regions in the fox genome associated with traits of interest it would also be useful to invest in validation of SNPs represented by an even lower number of reads. Although the false positive rate for these SNPs is expected to be higher, validation experiments can still provide a significant number of SNPs for these critical regions. The 454 technology has been successfully used for developing SNP resources for multiple organisms with limited genomic resources [49–51].
Initially, a large number of SNP's that mapped to the canine X chromosome were heterozygous in at least one male individual. To test whether this represented errors in our data or in the assembly of the canine X chromosome, the contigs bearing these SNPs were mapped onto the Human X chromosome assembly. All fox contigs containing SNPs that mapped to both the human and the dog X chromosome outside of the recognized pseudoautosomal region were homozygous for each individual fox. To gain a better understanding of the high heterozygosity of SNPs assigned to CFAX, we mapped approximately 100 bp region surrounding 50 randomly selected SNPs that were reported as heterozygous on the X back to the dog genome. We found that, in 34% of cases the area surrounding the SNP could not be located unambiguously to the X, or to a single area on the X; 50% of SNPs had high homology to the X but also had significant homology with another area of the genome. This casts doubt on the X chromosome SNPs showing heterozygosity in our male individuals. We subsequently filtered out all SNPs on the CFAX that were not in the pseudoautosomal region and were heterozygous in either individual.
Of all the fox high-confidence SNPs identified 97.8% were localised in the dog genome, and the inferred location in the fox genome was identified for 96.3% (Additional file 4, Figure S3; Figure 4). Heterozygosity of SNPs in the two fox samples was compared and several continuous regions along fox chromosomes where SNPs were fixed for opposite alleles between tame and aggressive sample were revealed. These regions warrant investigation in a larger number of samples from tame and aggressive fox strains for evaluation of putative selection sweeps. One extended region of reduced heterozygosity in the tame sample was observed in a region of VVU12 corresponding to part of a previously identified QTL interval . Overall, the number of high confidence SNPs and their locations on fox chromosomes indicated that now we have a set of SNPs sufficient for genome wide mapping in these populations.
Detailed evaluation of the location in the dog genome of a randomly selected set of 100 fox SNPs revealed that approximately 77% of SNPs are located in transcribed regions and distributed approximately equally between coding regions and UTRs. Approximately 9% of SNPs are located in introns and 14% in non-coding regions. Similar results were also identified in the bank vole transcriptome study . As has become apparent recently, the mammalian transcriptome is more complex than previously recognized [52–54].
Comparison of gene expression between the two fox samples revealed only a relatively small number of genes whose expression level differed by two-fold or more (355 genes p < 0.05). The biofunction groups over-represented in the set of differentially expressed genes differed between the two samples. Genes associated with neurological diseases were over represented in the tame sample. Interestingly, the second biofunction group over-represented in the aggressive sample represents genes associated with cardiovascular diseases. The links between the cardiovascular system and behavior have been increasingly recognized recently [55–58].
To validate differences in gene expression that were identified by transcriptome sequencing, we selected a subset of nine genes for expression analysis by RT-qPCR. Exactly the same two RNA samples (one from a tame and one from an aggressive fox) which had been used for transcriptome sequencing were used for RT-qPCR validation experiments. Eight of nine differentially expressed genes selected for the validation study showed differences in expression in the same direction by both methods. All genes that were validated by RT-qPCR showed differential expression between the two samples in transcriptome analysis with p < 0.04. Although the consistent validation by RT-qPCR of results obtained by transcriptome analysis increases confidence in the observed differences in expression between the two samples, it will clearly be necessary to test these observed differences in a larger set of samples before coming to conclusions regarding biological relevance. Among the genes that were validated by RT-qPCR, HTR2C is particularly noteworthy. HTR2C plays an important role in serotonergic and dopaminergic signalling [59–61]; and is differentially expressed in specific brain regions of two strains of rats - one of which was selected for and the other against aggressive behavior . Increased level of HTR2C mRNA in the frontal cortex and hippocampus of tame rats was observed . Notably, the direction of difference in HTR2C expression between the tame and aggressive fox individuals in the present study was consistent with that between rat strains in the Popova et al. study. None of the genes reported in other studies [63–65] investigating comparative gene expression in canid brain samples were significantly different between the two fox samples in the present study.
Further resequencing and deeper analysis of brain transcriptome, including identification of non-synonymous mutations, comparison of splice variants between tame and aggressive samples, and evaluation of differentially expressed genes in larger independent sample sets (biological replicates) will be necessary to characterize critical differences in these fox strains associated with markedly different behavioral phenotypes. In any experimental system there will be a certain amount of noise that might stem from a variety of sources. These biological replicates would let us explore the differences in the underlying biological system we are interested in and validate the differences in gene expression.