General patterns of protein-coding variation
DNA sequence variants were identified for 146 samples that were sequenced by RNA-Seq (see Supplementary Table S1 for further information about the samples). A total of 585,766 variants were found along 22,022 protein-coding sequences, representing 94.72% of the predicted protein-coding sequences of the A. mexicanum genome assembly. A total of 1229 protein-coding sequences did not contain any variants in this study. The variants were categorized into two groups: i) insertions and deletions of bases (INDELs, 128,504, 21.94%), and ii) single nucleotide variants (3289 multiallelic and 453,973 bi-allelic variants, 0.56 and 77.5%, respectively). Bi-allelic, single nucleotide variants (SNVs) consisted of 282,603 transitions (62.25%) and 171,370 transversions (37.75%; see Supplementary Table S3). SNVs were also classified using their minor allele frequency (MAF), where 251,934 variants presented a MAF > 0.01 (55.5%, for further information about number of SNVs at different MAF values, see Supplementary Table S4).
The mean frequency of protein-coding sequence variation from the identified SNVs was 7.13 SNVs per kilobase, Kb (see the sequences cumulative distribution based on their scaled SNVs frequency in the Fig. 1a). A total of 17 predicted genes had a SNV frequency > 20 per Kb (agrin, mgc83164, sbno2, cbx2, ulk4, pr [AMEXTC_0340000102199_PR], fanca, loc104496260, ube3b, loc104369368, loc108695830, ezh1, mRNA00976, mRNA00719, AMEXTC_0340000073769_hypothetical, and AMEXTC_0340000214929_hypothetical). The number of SNVs may reflect the functional importance of a protein, and the strength and mode of selection that acts to maintain or diversify the underlying coding sequence. PANTHER enrichment tests identified 33 significantly enriched annotation terms, which were supported by at least a set of 100 sequences. Those 33 set of sequences classified under the same term yielded SNV frequency value distributions that deviated significantly from the overall distribution (see Fig. 1b–f and Supplementary Table S5). Three terms were enriched by genes with higher than expected SNV frequency values; these included the Protein Class term G-protein modulator (N = 172, false discovery rate [FDR] p value = 2.70E-04) and Pathways term PDGF signalling pathway (N = 100, FDR p value = 1.03E-02). In contrast, the other 30 terms were enriched by genes with fewer SNVs than expected. Many of these terms were significantly enriched by genes that encode proteins with binding functions, including DNA-binding functions, associated with transcription and gene expression (e.g. Molecular Function gene ontology, GO, terms: nucleic acid binding, N = 838, FDR p value = 5.88E-06; and transcription regulatory activity, N = 506, FDR p value = 2.68E-02; see Supplementary Table S5). These results suggest a similar type of selection on sequences that are annotated with the same term.
Diversity and relationships among RNA-Seq samples
For each RNA-Seq sample, SNVs were counted and compared to the total number of SNVs identified across all samples. On average, 64.30% of the total number of SNVs was represented in any given RNA-Seq sample (see SNVs information for each sample in the Supplementary Table S1). SNVs with non-reference alleles were used to quantify genetic variability among samples by computing heterozygosity ratios. The mean heterozygosity ratio for all samples was 1.42, with values ranging from 0.51 to 5.3 across samples. Mean heterozygosity ratios were also calculated for each project. The project (PRJNA354434) that sequenced multiple tissue samples from a single wild-caught animal had the lowest ratio with a mean of 0.59, followed by the project PRJNA306100 with a mean ration of 0.64, while the project PRJNA378982 had the highest ratio with a mean of 2.69.
Genetic differences among samples were further explored by measuring genetic distance and by using multidimensional scaling analysis, MDS (see Fig. 2a and Supplementary Fig. S1 for a graphical representation of the MDS coordinates, and Supplementary Table S6 for percentage of variance explained by each of the coordinate/dimension). The first three coordinates explained 65.24% of the genetic variance among samples and identified three different clusters that grouped their own samples (PRJNA312389, PRJNA354434 and PRJNA427437). Samples within other projects were clustered together by coordinate six (PRJNA186654 and PRJNA306100), coordinate seven (PRJNA480225), and coordinate nine (PRJNA400170), with each of these coordinates explaining 5.75, 4.97, and 3.7% of the genetic variance, respectively. To characterize the nature of the genetic differences among samples, Admixture and relatedness analyses were performed. The maximum likelihood number of ancestral clusters was four (see Supplementary Fig. S2). The proportional representation of these clusters varied among samples (Fig. 2b). Samples from three projects (the first two already grouped by the MDS analysis, PRJNA312389, PRJNA354434, and PRJNA400170) were represented primarily by one of the four clusters. In terms of relatedness, pairwise comparisons revealed high levels of kinship among samples. The six wild-caught samples (project PRJNA354434) were identified as duplicates (samples from the same organism, technical replicates, 0.354 inferred kinship coefficient or higher). Also, duplicates, first, second and third-degree relationships (kinship coefficients ranging from 0.354 to 0.177, from 0.177 to 0.0884, and from 0.0884 to 0.0442, respectively) were identified for same samples within projects PRJNA480225, PRJNA300706, PRJNA312389, PRJNA378982, PRJNA400170, PRJNA186654. Only all the samples within projects PRJNA306100 and PRJNA427437 showed kinship coefficients lower than 0.0442, which is typical of unrelated samples. Among projects, the majority of the pairwise relationships showed a negative estimated kinship coefficient, which is suggestive of heterogeneity and genetic structure (see Fig. 2c and Supplementary Fig. S3).
The genetic diversity and relationship analyses described above were performed in an unbiased way, without prior information about the sample project. Fixation indexes (FST) as a measure of population differentiation were calculated for each pair of projects to further assess population differences among projects, and to potentially identify projects that used animals from the same population. The estimated FST values ranged from 0.087 (FST between PRJNA300706 and PRJNA378982) to 0.664 (FST between PRJNA306100 and PRJNA354434, see Supplementary Table S7). The highest FST values were estimated for all projects compared against PRJNA354434, which used wild-caught axolotl samples, followed by the values against the project PRJNA306100. This indicates that the wild-caught axolotl samples were the most genetically distinct, sharing less genetic variation with samples from all other projects. The projects PRJNA480225, PRNJA300706, PRJNA378982, PRJNA427437, and PRJNA400170 (with the exception of the pairwise value between PRJNA 400170 and PRJNA480225, FST = 0.262) shared values smaller than 0.2, which might indicate that animals used in those projects came from the same interbreeding laboratory population.
Pigment phenotypes comparison
The RNA-Seq projects isolated tissue mainly from either wild-type or white axolotls. Using a selection of samples from the same interbreeding laboratory population (see Supplementary Table S2 for detailed information of the samples and Supplementary Fig. S4 for a representation of the likelihood of the Admixture analysis for this sample subset), associations between variants of the gene predictions and the pigment phenotypes were identified by genome-wide association analysis using the threshold of adjusted p-value <1e-5 (see Fig. 3 and Supplementary Table S8). A total of 104 SNVs were identified in 63 gene predictions along with the gene edn3. Most of the genes (40%) with significantly-associated variants for pigment phenotype were located to chromosomes 3 and 1 (Chr3: edn3, bcas1, gmeb2, amextc_0340000052082_hypothetical, nsfl1c, dnm2, kcnmb1.l, loc108702636, glucocorticoid, loc102357877, mapre1.l, rad50, tmed1, and farsa; Chr1: mydgf, fbn3, naa38, loc104535489, syt11, loc100619372, nuf2.l, enpp4, lipe, rnf11.l, and top1.2). Variants in several genes across all chromosomes except for Chr14 were identified. Two variants were found in genes that have not been mapped to the chromosome assembly (sfrp2 and loc108717189). The identification of SNVs that are associated with the white pigment phenotype but not physically associated with edn3 reveals genomic differences between white and wild-type axolotls. Genes bearing SNVs associated to wild-type/white phenotypes were generally found more than 10 megabases (Mb) apart, however some genes in the short arm of Chr3 (Chr3P) and a pair of genes in Chr6 and Chr7 were found more closely. A total of 16 variants caused unequivocally a change in the amino-acid sequence of their gene predictions (bcas1, edn3, farsa, fundc2, glucocorticoid, hagh.l, ldb3, loc103760181, loc106732794, loc108702636, pxk, and tmed1; see Supplementary Table S8) resulting in a different phenotype, at least in the amino acid profile, between wild-type and white axolotls beyond their skin coloration.
Private variation of the wild-caught axolotl
Using sequence data from project PRJNA354434, which sequenced RNA from a wild-caught axolotl, 25,825 SNVs (5.69% of the total SNVs) distributed across 10,639 protein-coding genes were identified as unique to this project. A total of five gene predictions showed more than 20 wild-caught private SNVs (loc108701391, tropomyosin, ncaph2, fructose-bisphosphate, and cnp). The frequency of private, wild-specific SNVs among chromosomes was uneven, with significantly more SNVs observed for genes on Chr2 and Chr3, and significantly fewer observed for chromosomes 4, 7, 9, 12, and 13 (see Supplementary Table S9). These variants have been plausibly lost in domesticated axolotls.