Mining ancient microbiomes using selective enrichment of damaged DNA molecules

Background The identification of bona fide microbial taxa in microbiomes derived from ancient and historical samples is complicated by the unavoidable mixture between DNA from ante- and post-mortem microbial colonizers. One possibility to distinguish between these sources of microbial DNA is querying for the presence of age-associated degradation patterns typical of ancient DNA (aDNA). The presence of uracils, resulting from cytosine deamination, has been detected ubiquitously in aDNA retrieved from diverse sources, and used as an authentication criterion. Here, we employ a library preparation method that separates molecules that carry uracils from those that do not for a set of samples that includes Neandertal remains, herbarium specimens and archaeological plant remains. Results We show that sequencing DNA libraries enriched in molecules carrying uracils effectively amplifies age associated degradation patterns in microbial mixtures of ancient and historical origin. This facilitates the discovery of authentic ancient microbial taxa in cases where degradation patterns are difficult to detect due to large sequence divergence in microbial mixtures. Additionally, the relative enrichment of taxa in the uracil enriched fraction can help to identify bona fide ancient microbial taxa that could be missed using a more targeted approach. Conclusions Our experiments show, that in addition to its use in enriching authentic endogenous DNA of organisms of interest, the selective enrichment of damaged DNA molecules can be a valuable tool in the discovery of ancient microbial taxa.

Background DNA retrieved from historical or ancient samples is a complex mixture of molecules that contains not only endogenous host DNA, but also DNA from microorganisms that were present ante-mortem or that colonized the tissue post-mortem [1]. Therefore, all ancient DNA (aDNA) shotgun sequencing projects are metagenomic in nature. While earlier aDNA research has mostly focused on the evolution of animals and plants [2,3], a growing number of studies are now centering on the identification and characterization of ancient pathogens and microbiomes [4]. Ancient microbes permit the replacement of indirect inferences about the past with direct observations of microbial genomes through time. By studying ancient pathogens, it has been possible to identify causal and/or associated agents of historical plant and animal disease outbreaks, as well as their spreading patterns throughout both space and time (e.g. [5,6]). Beyond single taxa, another challenging endeavour is the characterization of shifts in composition of microbial communities over time. For example, dental calculus from hominids has been exploited as a source of ancient microbiomes and analyzed in the context of diet and lifestyle changes [7][8][9], whereas coprolites have been used to investigate ecological interactions between animals and microorganisms [10]. However, this approach is at its beginnings and the interactions of evolutionary processes in shaping microbiomes remain to be explored.
A major challenge for the study of aDNA in general, and ancient microbiomes in particular, is the presence of contaminating exogenous DNA, which makes distinction between bona fide ancient microbiome sequences and those of recent origin crucial. One of the most typical features of aDNA is the presence of uracils (Us) that originate from post-mortem deamination of cytosines (Cs), especially in single-stranded overhangs at molecule ends [11]. Uracils are read as thymines (Ts) by most DNA polymerases, which generates a characteristic increase in C-to-T substitutions at the end of aDNA sequences ( [11], Figs. 1d and 2a). This pattern makes aDNA damage distinguishable from biological sequence variation, and thus, its presence can be used as evidence for the authenticity of DNA sequences retrieved from ancient and historical material [12][13][14].
Recently, a single-stranded library preparation method (U-selection) was developed, which allows physical separation of uracil-containing molecules from nondeaminated ones [15]. In U-selection all library molecules are initially immobilized on streptavidin beads, to which molecules without uracils remain attached (U-depleted fraction), while uracil-containing molecules (originally deaminated) are released into solution (Uenriched fraction). U-selection was originally developed with the aim of increasing the amount of ancient hominid DNA (e.g. Neandertals) from a background of present-day human and microbial DNA [15]. However, the method seems to be specially suited to study microbiomes, due to the inherent difficulty to authenticate their ancient origin. This complication arises because microbes can colonize tissues at different times, resulting in different levels of deamination in ancient DNA molecules from different sources. Although sequences that carry terminal C-to-T substitutions can be selected in silico [16,17], there are two factors that could hinder this approach. Firstly, low levels of deamination will reduce the number of molecules suitable for selection in silico. Secondly, high sequence divergence between Fig. 1 Relative enrichment, taxonomic assignment and substitution profiles of Neandertal-derived U-selected libraries. a Relative enrichment (number of reads) in the U-enriched relative to the U-depleted fraction from Vindija Neandertal assigned to the phyla Actinobacteria and Proteobacteria, as well as to Homo sapiens. b Relative enrichment (number of reads) in the U-enriched relative to the U-depleted fraction from Sidrón Neandertal assigned to the phyla Actinobacteria and Proteobacteria, as well as to Homo sapiens. c Taxonomic tree of reads from Sidrón Neandertal assigned to different taxonomic levels. The size of the circle represents the amount of reads assigned to the node displayed in the tree or to any taxonomic level below it. Assignments to the phyla Actinobacteria and Proteobacteria, as well as the species Streptosporangium roseum and Homo sapiens are named in the taxonomic tree. Other taxonomic groups are either unlabeled or removed from the tree to increase clarity. d Cytosine to Thymine substitutions at the 5′ end of reads aligned to S. roseum from the Sidrón Neandertal U-selected library (U-enriched and U-depleted fractions) samples and reference genomes can mask age-associated deamination signals and thereby hinder authentication. Consequently, enriching for deaminated molecules during library preparation is fundamental to tackling these problems. As a proof-of-principle experiment, we used here Uselection in combination with taxonomic binning of Illumina sequenced reads to characterize the microbiomes of Neandertal bones (~39,000 years old), herbarium specimens (between 41 and 279 years old) and plant archaeological remains (~2000 years old) ( Table 1). Instead of exhaustively characterizing the composition of microbiomes in each sample, we aim at identifying authentic and highly abundant historical and ancient microbes through the presence of aDNA-associated damage patterns.

Results
Our experiments were motivated by the previous observation that in some Neandertal samples, e.g. from El Sidrón, Spain, the proportion of Neandertal DNA fragments remains unchanged in both the U-depleted and U-enriched fractions, whereas in others, e.g. from Vindija Cave, Croatia, this proportion increased in the Uracil-enriched fraction [15]. It was hypothesized that the latter effect could have been due to differences in deamination, and hence in age, between Neandertal-and microbial-derived DNA fragments. To explore this effect further, we re-analyzed the previously generated Neanderthal sequence data from both sites by performing taxonomic binning of reads derived from the U-depleted and U-enriched fractions, instead of aligning them only to the human reference genome, as had been done previously. Reads aligning to the two most abundant bacterial phyla (Actinobacteria and Proteobacteria) from the Vindija Neandertals were enriched in the U-depleted fraction, while hominid reads were enriched in the Uenriched fraction (Fig. 1a). This is in accordance with a previous study that reported absence of DNA damage in Actinobacteria derived from a Neandertal bone from Vindija cave [18]. In contrast, in reads obtained from the El Sidrón Neandertals, we found enrichment of both hominid and Actinobacteria reads in the U-enriched fraction, whereas Proteobacteria reads were enriched in the U-depleted fraction (Fig. 1b). Overall, bacteriaderived reads were dominated by the Actinobacterium Streptosporangium roseum (Fig. 1c), which showed almost 50% deamination at the first base in the Uenriched fraction (Fig. 1d), suggesting its ancient origin. The analysis of reads derived from Neandertal bones Fig. 2 Patterns of cytosine to thymine (C-to-T) substitutions at the 5′ end of plant-and Pseudomonas-derived reads. a C-to-T substitutions at the 5′ end of Solanum tuberosum sample KM177500 for a non-selected and U-selected library (U-enriched and U-depleted fractions). b Distributions of C-to-T substitution percentage at first base (5′ end) for non-selected and U-selected libraries (U-enriched and U-depleted fractions). Median values are denoted as black lines and points show the original value for each individual sample. c Substitution patterns at the 5′ end of Pseudomonas syringae and Pseudomonas rhizosphaerae mapped reads from a non-selected library from a Solanum tuberosum sample KM177500. d Cytosine to Thymine substitutions at the 5′ end of P. syringae and P. rhizosphaerae mapped reads from a U-selected library (U-enriched and Udepleted fractions) from a Solanum tuberosum sample KM177500 illustrates how U-selection permits distinguishing between ancient bacteria enriched in the U-enriched fraction and more recent colonizers enriched in the Udepleted fraction.
In order to further evaluate the performance of Uselection in characterizing microbial communities in different sample types and with varying levels of deamination, we selected a set of plant samples including both herbarium specimens and archaeobotanical remains. In particular, we sought to investigate how the method performs in herbarium samples, which have much lower levels of deamination than for example Neandertal remains [19]. We extracted DNA from these plant samples and generated libraries using both a regular doublestranded (ds) approach [20], and U-selection [15]. Sequences from the dsDNA libraries, the most common library preparation method used nowadays, were then used as a baseline to evaluate depletion and enrichment of uracil-containing molecules (Fig. 2a). U-selection successfully enriched for deaminated molecules in all plant samples, as shown by the much higher levels of deamination present in the U-enriched fraction compared with the dsDNA libraries and the U-depleted fraction ( Fig.  2a-b). The plant samples showed substantial variation in the content of endogenous DNA (2.8-91%), which was very similar between the U-depleted and U-enriched fractions, indicating similar levels of deamination between host-and microbe derived reads ( Figure S1A). Assuming that plant-and microbial-derived DNA deaminate at a similar rate [19], this observation indicates that microbes found in our plant samples were present at the time of collection or colonized the tissue shortly thereafter, without substantial recent colonization. The percentage of reads (including host-derived reads) that could be taxonomically binned varied depending on the sample ( Figure  S1B) and, since the host genome was included in the nucleotide database, positively correlated with the percentage of host endogenous DNA ( Figure S1E). The inability to taxonomically assign the vast majority of reads from samples with low endogenous DNA reflects the incompleteness of the reference database compared to the diversity of the microbiomes in those samples. Additionally, single stranded DNA library preparation methods as employed during Uracil enrichment generate shorter reads [21,22], which are more difficult to map to a reference genome and to assign taxonomically to a nucleotide database. This is reflected in the higher percentage of reads mapped and assigned from the dsDNA library compared with shorter reads derived from both the U-depleted and U-enriched fraction (Figures S1C,D and S2A-B). Originally, it was reported that the U-enriched fraction shows a mild increase in GC-content [15], however in the plant libraries analyzed here we did not find a significant difference in GCcontent between the U-depleted and U-enriched fractions ( Figure S2C). In theory, since Us originate from Cs, the Uenriched fraction would be enriched for GC-rich species and GC-rich genomic regions within a given genome. However, as the enrichment would depend on the diversity of taxa present and their relative age difference, and hence difference in deamination, GC-biases, if any, are expected to be highly sample-dependent.
Given the low compositional complexity of microorganisms in the samples included in our proof-ofprinciple experiment, instead of centering our analyses on the compositional assessment of microbial communities, we investigated in detail samples in which a specific microbe or group of microbes were more prevalent based on read abundance. Identifying highly abundant microbes, especially pathogens, is currently the most common use of ancient microbiomes [4]. We identified a large number of reads that were assigned to the bacterium Pantoea vagans in a potato (Solanum tuberosum) and a maize (Zea mays) sample (Fig. 3a). In both samples we found patterns of C-to-T substitutions that suggest the historical nature of the sequenced reads (Fig.  3b). Since P. vagans is a plant epiphyte [23], it is not entirely surprising to find it in two different plant species. We compared the potato and maize P. vagans with publicly available genomes using single nucleotide polymorphisms (SNPs) ascertained in these modern samples, and genotyped in the U-depleted fraction of either specimen. Our analysis linked the two historical strains to a distinct cluster of modern strains based on genetic similarity (Fig. 3c). Based on a set of 432,891 SNPs, the two historical isolates showed 95% SNP identity between them, and an average of 92% SNP identity between historical and modern strains of the same cluster. Conversely, comparisons between historical strains and any modern strain of a different cluster showed only an average of 59% identity at variable positions. In a potato sample where the pathogenic oomycete Phytophthora infestans was previously identified [6], we found a large portion of reads assigned to the bacterial genus Pseudomonas. While most reads were classified only to the genus level, some reads were assigned to either the species Pseudomonas syringae or Pseudomonas rhizosphaerae in different proportions ( Figure S3). The limited length of aDNA fragments made it challenging to gain higher resolution of the taxonomic diversity within the Pseudomonas genus. Therefore, we performed de novo assembly using reads assigned to the genus Pseudomonas, aiming to generate longer sequences (contigs) that allow more detailed taxonomic classification. We aligned the contigs to the reference genomes of P. syringae and P. rhizosphaerae, which resulted in covering about 80% of both reference genomes ( Figure S4). To reliably identify contigs that originate from either P. syringae or P. rhizosphaerae, we subsequently filtered for contigs that aligned uniquely to either reference genome ( Figure S5A). Among these species-specific contigs, we found different k-mer coverage distributions in contigs aligning uniquely to either genome ( Figure S5), an observation that reinforced our confidence in the presence of the two Pseudomonas species in this sample. Due to the high level of sequence divergence between the Pseudomonas in our sample and the reference genomes present in the database, it is difficult to assess typical deamination patterns in the dsDNA library (Fig. 2c). However, we were able to examine damage patterns in both Pseudomonas species using the U-enriched fraction (Fig.  2d), since the C-to-T signal is amplified and thus much higher than the basal level of substitutions.
In summary, we showed here that the U-selection method selectively enriches for authentic microbial aDNA molecules in samples from plant and animal tissues with a wide-distribution of ages and deamination levels.

Discussion
The utility of DNA from ancient and historical specimens is being recognized in a growing number of fields, ranging from human, animal and plant genetics, to microbiology and epidemiology of infectious diseases. Providing positive evidence for the authenticity of such ancient DNA from diverse sources is instrumental for all studies that make use of this resource, as mistaking modern contaminants or colonizers for ancient microbes can drastically influence the interpretation of results. Authentication is especially challenging when studying ancient microbes, due to their high genetic diversity and the incompleteness of reference databases. The method we employ and characterize here aids this process through the selective enrichment of molecules that carry signatures of age-associated degradation. For instance, in P. vagans, Uselection increases the fraction of molecules carrying a terminal C-to-T substitution at the 5′-end 2-3 fold over the library without enrichment, relative to the total number of molecules sequenced. We think that the application of Uselection for ancient microbiome research will be particularly useful in samples with minute levels of deamination, where the nucleotide divergence between samples and reference genomes will obscure the identification of the C-to-T pattern typical of aDNA, as well as in moderately or heavily deaminated samples which carry modern contaminants, since in those samples ancient taxa would be efficiently enriched.
Since it is extremely difficult to differentiate between antemortem and early post-mortem colonizers based only on deamination patterns, it is fundamental to also evaluate the biological relevance of detected taxa by comparing them with reference modern microbiomes.

Conclusions
We have applied the selective enrichment of uracil carrying DNA fragments to ancient DNA metagenomics. The presence of uracil is associated with age-associated degradation, and is commonly used as an authentication criterion when identifying ancient DNA sequences. We show that by assessing the relative abundance of microbial taxa in both the uracil enriched fraction and the uracil depleted fraction, it is possible to identify the presence of authentic ancient microbial taxa. Additionally, the "magnifying glass" effect of uracil enrichment on the detection of deamination patterns can help authenticating very divergent microbial taxa with moderate levels of degradation. Since there is no intersection of molecules between the U-selected and Uenriched fractions, splitting the sequencing efforts between the two fractions comes at no extra sequencing cost. All in all, we think that selective uracil enrichment is a valuable addition to ancient DNA metagenomics, both for de-novo taxon discovery as well as for cases where authenticity might be contentious.

Sequencing libraries from Neanderthal remains
We used DNA sequencing libraries from Neanderthal specimens (Table 1) prepared by [15], which were sequenced deeper for this study.

DNA extracts from historical plant samples
Previously published DNA extracts from four different plant species were used for this study. These extracts were derived either from herbarium specimens (Arabidopsis thaliana [24], Solanum tuberosum [6], Solanum lycopersicum [6]), or from archaeobotanical remains (Zea mays [25]). Ages ranged from 41 to 279 years for herbarium samples, and 1852 to 1881 years for Zea mays samples ( Table 1). The sequencing data for these samples is available on the European Nucleotide Archive under study number PRJEB30666.

Sequencing libraries from historical plant samples
Three sequencing libraries were produced for each plant derived DNA extract, one using a double-stranded library preparation [20,26] without enzymatic removal of uracils [27], and one for each fraction resulting from the single-stranded U-selection protocol [15].

Sequencing and initial data processing
Since the length of aDNA molecules is often shorter than the read length of the sequencing platform, it is possible that a fragment of the aDNA molecule is sequenced by both the forward and reverse read, and also that a part of the adapter is sequenced [28]. Therefore, it is recommended to merge sequences based on the overlapping fraction sequenced by both forward and reverse reads [28]. We remove adapters and merged sequences using the software leeHom with the "--ancientdna" option [29] 91-99% of read pairs were merged (mean: 96.9%). Using the U-selection protocol, 81-97% of read pairs were merged (mean: 93.4%). Putative chimeric sequences were flagged as failing quality.

Mapping of sequenced reads to their host genome
Merged reads were mapped as single-ended reads to their respective or most closely relative genome: Zea mays [30], Arabidopsis thaliana [31,32], Solanum tuberosum [33], Solanum lycopersicum [34], Homo sapiens (Genome Reference Consortium Human Build 37). The mapping was performed using BWA-MEM (version 0.7.10) with default parameters, which includes a minimum length cutoff of 30 bp [35].

Metagenomics assignment of sequenced reads
Reads were aligned to the full non-redundant NCBI nucleotide collection (nt) database (downloaded January 2015) using MALT (version 0.0.12, [36]) in BlastN mode. The resulting RMA files were analyzed using MEGAN (version 5.11.3, [37]). The reads were assigned to the NCBI taxonomy using a lowest common ancestor algorithm [37].

Mapping of sequenced reads to microbial genomes
Libraries were mapped to microbial reference genomes of interest, after the presence of certain taxa was detected during metagenomic assignment. Specifically, the references of Streptosporangium roseum [38], Pseudomonas syringae pv. syringae B728a [39], Pseudomonas rhizosphaerae [40] and Pantoea vagans [23] were used. Since mapping metagenomic libraries to bacterial reference genomes is very prone to false alignments, we used a different mapping strategy for these genomes. The mappings were performed with bowtie2 (version 2.2.4, [41]), with the settings "--score-min 'L,-0.3,-0.3' --sensitive --end-to-end" to increase stringency.

Assessment of nucleotide substitution patterns
All types of nucleotide substitutions relative to the reference genome were calculated per library using mapDamage 2.0 (v. 2.0.2-12, [42]). The percentages of C-to-T substitutions at the 5′ end were extracted from the output file 5pCtoT_freq.txt produced by mapDamage.

Pantoea vagans genomic variation
In order to reduce the effect of aDNA-associated C-to-T substitutions on variant discovery, we used exclusively the Udepleted fraction of libraries where P. vagans was detected in the metagenomic screening. The libraries were mapped to the P. vagans reference genome using BWA-MEM, to reduce reference bias and increase SNP discovery. False alignments from the metagenomic libraries posed a lesser problem here, as variants were ascertained based on modern material. Variants for historical samples were called for both libraries together using the bcftools (version 1.8, [43]) utilities mpileup ("bcftools mpileup -q 1 -I -Ou -f $REF $IN1 $IN2") and call ("bcftools call --ploidy 1 -m -O -z"). Additionally, 11 assemblies of different contiguity were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/genome/genomes/2 707). These assemblies were aligned to the reference genome using minimap2 (version 2.10-r764, [44]) and its "asm20" parameter preset. Only strains with at least 80% reference coverage were kept for subsequent analysis (9/11, average reference coverage: 91%). The paftools utility, which is distributed with minimap2, was used to call variants from these alignments, with the parameter set "-l 2000 -L 5000". All resulting VCF files from modern samples were merged using bcftools' merge utility with the parameter "--missing-to-ref", assuming that those positions not called by paftools in any one sample were indeed reference calls. The merged VCF from modern material was then merged with the VCF from the two historical samples using bcftools (version 1.8, [43]), and filtered to include only full information, biallelic SNPs. This approach discovers sites, which are segregating in modern material, and have read data (be it reference, alternative or segregating sites) in both historical samples. The resulting VCF file was loaded into R using vcfR (version 1.7.0, [45]), and a PCA was produced by converting the information into a genlight object using adegenet (version 2.0.1, [46]) in R (version 3.3.3, [47]).

Pseudomonas spp. assembly and evaluation
To evaluate the presence of Pseudomonas spp. strains in a Solanum tuberosum historic herbarium sample, we extracted from this library all reads that were taxonomically assigned to the Pseudomonas genus or to inferior taxonomic levels within it. These reads were then assembled using SPAdes (version 3.5.0) with default parameters [48]. The resulting contigs were filtered for a minimum length of 2Kb, which yielded 3314 contigs with a total length of 16 Mb (avg. contig: 4.8 kb; N50: 5.3 kb). We used the lastz (version 1.03.66, [49]) and Circos (version 0.64, [50]) interface of AliTV [51] to align these contigs to either the P. syringae or P. rhizosphaerae reference genome. We were able to align 72% of contigs to either one or both of these reference genomes in alignments of at least 1Kb. We then extracted all contigs which had alignments of at least 10Kb in length and were unique to one of the reference genomes. These contigs were again aligned to their corresponding reference using AliTV as described above. Using the "links" file produced by AliTV, we visualized all alignments with circlize [52]. Additionally, we used uniquely aligning contigs to assess their average kmer coverage during the assembly as reported by SPAdes.
Additional file 1: Figure S1. Mapped and taxonomically assigned reads of plant historical specimens. A. Distributions of percentage of mapped reads for non-selected and U-selected libraries (U-enriched and Udepleted fractions). B. Distributions of percentage of taxonomically assigned reads for non-selected and U-selected libraries (U-enriched and U-depleted fractions). C. Correlation of the percentage of mapped reads between the U-enriched and the non-selected library D. Correlation of the percentage of taxonomically assigned reads between the U-enriched and the non-selected library E. Relation between percentages of mapped and taxonomically assigned reads from U-selected libraries (U-enriched fraction). Additional file 3: Figure S3. Taxonomic tree of reads from a Solanum tuberosum library assigned to different taxonomic levels. The size of the circle represents the amount of reads assigned to the node displayed in the tree or to any taxonomic level below it. Reads assigned to some species Phytophthora infestans, Pseudomonas syringae and Pseudomonas rhizosphaerae, as well as the genera Pseudomonas and Solanum are named in the taxonomic tree.
Additional file 4: Figure S4. Analysis of de novo assembled Pseudomonas contigs from a Solanum tuberosum sample. Contig coverage of Pseudomonas syringae (A.) and Pseudomonas rhizosphaerae (B.). The circles represent each circular bacterial reference genome, and the black lines depict genomic regions covered by alignments of de novo assembled contigs. C. Percentage of reference genome of P. syringae and P. rhizosphaerae covered by de novo assembled contigs from A. and B., respectively.
Additional file 5: Figure S5. Analysis of uniquely mapped Pseudomonas contigs from a Solanum tuberosum sample. Uniquely mapped contig coverage of Pseudomonas syringae (A.) and Pseudomonas rhizosphaerae (B.). The circles represent each circular bacterial reference genome, and the black lines depict genomic regions covered by alignments of uniquely mapped contigs. Histograms of contig k-mer coverage from de novo assembled contigs uniquely mapping to P. syringae (C.) and P. rhizosphaerae (D.).