PolyCRACKER, a robust method for the unsupervised partitioning of polyploid subgenomes by signatures of repetitive DNA evolution

Background Our understanding of polyploid genomes is limited by our inability to definitively assign sequences to a specific subgenome without extensive prior knowledge like high resolution genetic maps or genome sequences of diploid progenitors. In theory, existing methods for assigning sequences to individual species from metagenome samples could be used to separate subgenomes in polyploid organisms, however, these methods rely on differences in coarse genome properties like GC content or sequences from related species. Thus, these approaches do not work for subgenomes where gross features are indistinguishable and related genomes are lacking. Here we describe a method that uses rapidly evolving repetitive DNA to circumvent these limitations. Results By using short, repetitive, DNA sequences as species-specific signals we separated closely related genomes from test datasets and subgenomes from two polyploid plants, tobacco and wheat, without any prior knowledge. Conclusion This approach is ideal for separating the subgenomes of polyploid species with unsequenced or unknown progenitor genomes. Electronic supplementary material The online version of this article (10.1186/s12864-019-5828-5) contains supplementary material, which is available to authorized users.


Background
Researchers studying two traditionally distinct areas of biology, metagenomics and polyploid genome evolution, face a similar technical challenge: How do you separate closely related genomes or subgenomes from a single sample? Current approaches consist of either supervised binning, requiring a database of known genomes, or unsupervised binning using general genome characteristics as a 'genome signature' to identify and separate DNA sequences from different species/subgenomes [1,2]. Genome signatures can be any kind of bias (e.g. similarity to known sequences, short nucleic acid sequences, GC content, sequence depth) that differs between genomes. In particular, di-and tetra-nucleotide frequencies along with contig co-abundance (a given species may have a distinct abundance relative to others in a metagenome sample) can be used to successfully bin sequences within metagenomes [3]. These coarse approaches work best with highly divergent taxa and there is much room for improvement in sensitivity and accuracy when grouping sequences appropriately at the species level. Furthermore, subgenomes within allopolyploids have the same sequence depth profile and similar di-and tetranucleotide frequencies, making these features uninformative. In fact, orthologous gene sequences between the subgenomes in an allopolyploid (homeologous sequences) are far more similar to each other than they are to other sequences within the same subgenome.
Polyploidization has played a large role in the evolution of all flowering plants and many extant species are recent polyploids [4]. In addition, polyploidization has played a role in the evolution of many fungal, fish and amphibian species [5,6]. A large fraction of polyploid species are derived from the hybridization of different species and are termed allopolyploids. Understanding the evolution and regulation of polyploid genomes requires the creation of accurate assemblies for each subgenome. The high level of sequence similarity between subgenomes makes this particularly challenging and even state-of-the-art genome assembly algorithms often collapse and/or incorrectly interweave chunks of homeologous chromosomes. Significantly, without high resolution genetic maps or related data these errors remain undetected. These errors are not necessarily resolved by long-read technologies, since such assemblies are still fragmented in many thousands of pieces [7]. Lack of subgenome resolution for allopolyploid genomes is a major obstacle to studying the origin, evolution, and functional properties of allopolyploid genomes [8].
For some allopolyploid species, extant diploid species similar to the true progenitors can be used to disentangle the subgenomes [7]. However, for many allopolyploids [9,10], the diploid parents are unknown or extinct. Even when extant progenitor-like species exist, they may differ substantially from the genomes of the individuals that gave rise to the allopolyploid. Thus, unguided methods for identification and extraction of subgenomes within allopolyploids would be extremely useful. Recently a phasing method was developed for the highly heterozygous sweet potato genome [11]. However, this method is limited to highly polymorphic genomes and is restricted to single-copy sequence present in each subgenome that can be accurately genotyped.
Deliberate integration of synthetic transposons with molecular barcodes is a common experimental approach to label subpopulations of DNA or cells to allow subsequent retrieval from complex mixtures [12]. Nature's equivalents of synthetic molecular barcodes include transposons, viruses, and other classes of repetitive DNA. These elements rapidly proliferate and evolve in natural populations, thus labeling the recent evolutionary history of an organism's genome ( Fig. 1) [13]. Transposon families may dramatically expand and contract over short periods of evolutionary time, during which they may also significantly change in sequence identity. For example, the size of the Oryza australiensis genome doubled in just a few million years due to the amplification of a few long terminal repeat (LTR) retrotransposon families [14]. Removal of LTR retrotransposons by illegitimate recombination can eliminate megabases of sequence over similar timescales [13,15]. Thus, two recently diverged species may quantitatively differ in the Fig. 1 Overview of the strategy for determining species/subgenome of origin for sequences based on their profile of repetitive k-mers. The algorithm takes sequences from a sample containing one or more species/subgenome. Each species has both common and divergent repetitive k-mers scattered throughout its genome. Each k-mer is indicated by a different color (for clarity, only divergent k-mers are shown). We construct a graph where sequences are nodes and shared instances of repetitive k-mers are edges. By connecting all instances of a given repetitive k-mer we are able to connect and subsequently group sequences derived from a common species/subgenome frequency of transposon families they harbor and the sequence identity of those families.
In the context of allopolyploids, repeats in the parental species change and proliferate independently after those species diverge creating species-specific genomic signatures. Once reunited within a single cell of an allopolyploid, these repeats are free to proliferate between the subgenomes and these additional copies are specific to the allopolyploid but not the progenitor species. If additional subgenomes are added via subsequent additional hybridizations, then the initial allopolyploid-specific repeat signatures may distinguish initial subgenomes from subgenomes added during later rounds of inter-specific hybridization. Fortuitously, the process of interspecific hybridization may in fact trigger virus and transposon proliferation [16]. Thus, naturally occurring repetitive DNA can be used as molecular barcodes to distinguish genomes and subgenomes.
Here we describe an algorithm that identifies natural molecular barcodes that act as species of origin labels for the DNA sequences in which they reside. Our method creates and partitions a graph in which sequences are connected by edges, corresponding to specific signatures of repetitive DNA inferred from short nucleotide sequences (k-mers). Our method allows accurate separation of the subgenomes of polyploid species without prior knowledge of the genomes or a database of known sequences. We further built a toolkit of functions, implemented as a python package called poly-CRACKER (polyploid Cluster Repeats by Ancestral Common K-mer Estimation and Retrieval), for the automatic binning of sequence scaffolds from polyploid genomes. Our method draws its inspiration from other unsupervised algorithms for binning sequences commonly applied to metagenome datasets, but by focusing on repetitive k-mers with faster rates of divergence between two species, rather than all k-mers or a random sampling of k-mers, our method is not confounded by the overall high sequence similarity between homeologous sequences in allopolyploids or sequencing datasets containing genomes from multiple closely related species.

Results
We designed an efficient unsupervised learning algorithm to automatically identify and bin DNA sequences according to subgenome based on the shared presence of short repeated DNA sequences (k-mers) that act as naturally occurring molecular barcodes (Fig. 1). The workflow is as follows: a polyploid genome is sequenced and assembled to scaffolds that are unordered with respect to their subgenome. Most scaffolds contain multiple k-mers derived from repetitive DNA elements. A graph is constructed where sequences are nodes and their profile of shared repeat-k-mers are edges. By connecting all instances of a given repetitive k-mer, we are able to connect and subsequently group sequences derived from each subgenome (Fig. 1). It is notable that while both high and low copy k-mers may be specific to one subgenome versus another, only high copy k-mers provide the multiple edges between multiple different sequence nodes necessary to group sequences together by subgenome of origin.
polyCRACKER method for unsupervised partitioning of sequence into species bins Unsupervised partitioning of sequences proceeds in two phases: 1) initial partitioning of sequences into subgenome bins via a nearest neighbors graph of repeat-kmers and 2) signal amplification from groups identified in the first phase.
1) Initial partitioning of scaffolds into subgenomes can be broken down into three main steps. a. Repeat-k-mers are selected based on the number of times they appear throughout the entire genome, all subgenomes. Typically, as the k-mer size is increased, the threshold for the lowest allowable k-mer frequency should be reduced, due to a reduction in the number of repeat k-mers found per scaffold. This balance is based on the requirement that any given contig or scaffold in the assembly must have one or more repeat-k-mers associated with it in order for it to be portioned into a subgenome bin. b. Sequences from a sparse count matrix of scaffolds versus k-mers are projected into a lower dimensional space, where the distance between points (scaffolds) is indicative of the degree of shared repetitive k-mer content. c. A nearest neighbors graph is constructed from these projected data points. Each node represents a sequence, while each edge demonstrates that two sequences share a highly similar distribution of repetitive k-mers. In this sense, an edge is a proxy for connecting the instances of shared repeat-k-mers. The scaffolds are partitioned into bins, subgenomes, by cutting this graph at regions with a low density of connections.
2) The signal amplification phase consists of two iterative steps that recruit previously unclassified scaffolds to the appropriate subgenome bin. a. Differential repeat-k-mers are identified between the initially extracted subgenome bins. The frequencies of those differential repeat-k-mers within all genome scaffolds is used to recruit additional sequences into respective subgenome bins. b. A new set of differential repeat-k-mers pertaining to the new subgenomes bins are identified, and are once again used assign sequences to a subgenome. These steps are iterated until the new subgenome bins converge towards a fixed size.

Polyploid genome analysis
The creation of polyCRACKER was motivated by prior sequence binning applications, particularly in the unsupervised metagenomics binning field. However, tools designed for unsupervised binning of metagenomes either require input that is not relevant for allopolyploid analysis, such as multiple samples from different microbial environments, or rely too heavily on large overall differences in k-mer frequency that are typically not found between highly similar allopolyploid subgenomes. However, as discussed above, it has been shown that the profile of repetitive k-mers may vary considerably between allopolyploid subgenomes. We therefore investigated whether a method specifically focusing on the tracking of repeat-k-mers in allopolyploid genomes would enable database-free subgenome binning. Initial results on simulated datasets representing mixtures of closely related microbial genomes showed promise, even when these genomes had relatively low levels of repetitive DNA (Additional file 1). These results encouraged us to turn our attention to the unsupervised separation of subgenomes from the complex allopolyploid genomes of tobacco and wheat.

Subgenome analysis of the allotetraploid plant, Nicotiana tabacum
Nicotiana tabacum is thought to have formed less than 200,000 years ago by the hybridization of two diploid species similar or identical to N. sylvestris (source of the Ssubgenome) and N. tomentosiformis (source of the Tsubgenome). The genome of N. tabacum is slightly smaller than the combined sizes of its diploid progenitors due to the preferential loss of repetitive sequence from the T-subgenome [17]. Initial studies comparing the N. tabacum genome to the genomes of N. sylvestris and N. tomentosiformis assigned about 80% of the assembly to the S-or T-subgenomes [7,18]. A recent study used optical and genetic maps to assign 64% of the assembled N. tabacum sequence to pseudomolecules [7]. A significant complicating factor in the separation of the S-and Tsubgenomes is the large number of translocations that have occurred between the S-and T-progenitor chromosomes [4]. Thus, identifying and separating the ancestral S-and T-subgenomes from the modern allopolyploid requires classification of alternating stretches of contiguous DNA within chromosomes and not simply labeling of whole chromosomes. We created an initial dataset for testing and verification of polyCRACKER by splitting the sequences within the pseudomolecule-anchored portion of the N. tabacum genome into non-overlapping 250 kb segments. This later allowed us to visualize subgenome classification with respect to position within pseudomolecules. This was particularly important for N. tabacum due to the numerous translocations that have occurred between subgenomes and the uncertainty of whether poly-CRACKER could accurately assign progenitor of origin to pseudomolecules containing sequences from different progenitors. We validated the assignment of progenitor of origin by polyCRACKER by using, in this case, the known (and also assembled) progenitors of the allopolyploid, N. sylvestris and N. tomentosiformis. We quantified agreement between polyCRACKER subgenome assignments and the subgenome assignments determined by comparison to the progenitor genome sequences. Relationships between N. tabacum sequences are depicted in principal component analysis (PCA) plots ( Fig. 2a,b) and a spectrally embedded graph (Fig.  2c). Each data point in the PCA plot corresponds to a 250 kb segment and the proximity of the points indicates the degree of shared repetitive k-mers. Points were colored by their inferred progenitor of origin via poly-CRACKER ( Fig. 2a) or by comparison to the diploid progenitor genomes (Fig. 2b). Segments were linked to their 20 nearest neighbors to create a network graph. The energy minimization of a non-repulsive forcedirected graph of the 20 nearest neighbors yields spectrally embedded data (spectral embedding/laplacian eigenmaps); polyCRACKER groups sequences by clustering the spectral embedding of the dimensionality reduced data. This is depicted in Fig. 2c, in which nodes (sequences) are colored according to their species of origin assignment based on comparison to the diploid progenitor genomes. Spectral clustering performs k-means clustering on the spectrally embedded data in Fig. 2c, analogous to making cuts in the weakest links in the graph.
PolyCRACKER correctly classified 99.3% of sequence estimated to belong to the ancestral S-subgenome and 99.7% of sequence belong to the ancestral T-subgenome, based on comparison to the subgenome assignments made using the diploid progenitor genomes. The intersection of polyCRACKER classification and classification based on the diploid progenitor genomes constituted 86.5% of the total N. tabacum assembly, with the remaining sequence almost entirely unclassified by either method (very short scaffolds). We graphically show the high similarity between polyCRACKER and referencegenome based subgenome classification in Fig. 3b, c. We explored k-mer length as a parameter for polyCRACKER performance and found k-mers as short as 15 nucleotides were sufficient to confidently distinguish the tobacco subgenomes (Table 1). Increasing k-mer length (specificity) can decrease sensitivity by decreasing the frequency of repeat-k-mers spread across the genome sequences, but this can in part be compensated for reducing threshold for defining a k-mer as repetitive (Fig. 4, Table 1). Sensible recommendations for selecting this kmer length can be ascertained by observing the number of repetitive k-mers per fragment of the fragmented N. tabacum assembly (Fig. 5). If the number of repetitive kmers per fragment is low, as in our initial test with two fragmented algae assemblies (Additional file 1: Figure  S1), then the k-mer length should be decreased, at the expense of introducing k-mers with less specificity to a given repeat signature.
The N. tabacum analysis described above was based on the subset of sequences anchored into pseudomolecules [7] in order to show the classification agreement distributed across the chromosomes. However, only 64% of the assembled N. tabacum sequence is assigned to pseudomolecules. In order to both provide a comprehensive analysis of N. tabacum subgenomes and to show that polyCRACKER works efficiently and accurately on fragmented draft assemblies, we repeated our analysis on the Illumina/454 draft genome assembly from the same study. This draft assembly was produced prior to longrange scaffolding with optical mapping and has a contig L50 of only 8.9 kb (scaffold L50 of 278 kb). Although this assembly was already composed of many small subsequences, we still split larger scaffolds of this assembly into non-overlapping 100 kb segments in order to partially normalize subsequence lengths. polyCRACKER works best with datasets that do not have wide dispersion in length. Subsequences are later easily tracked back to their origin. We then combined subsequences with all scaffolds greater than 2.5 kb. Scaffolds less than 2.5 kb were excluded because they do not contain enough repetitive k-mers for confident classification. After removing the smallest sequences, this dataset contained 94% of the N. tabacum genome assembly, and Poly-CRACKER was able to assign 99.5% of sequence estimated to belong to the S subgenome (2.1 Gb, 990 Mb S and T genomes are colored green and blue respectively, ambiguous segments are labeled red. a Sequences labeled by similarity to the known progenitor-like species. b Sequences labeled by polyCRACKER's k-mer analysis. Note that a and b are nearly identical. c Spectral embedding of N. tabacum dimensionality reduced information, in which edges (grey lines) represent shared repetitive k-mer profiles) that connect sequences. Sequences labeled by similarity to the known progenitor-like species. d-f Analogous unsupervised grouping as above, but using informative and differential repeats rather than k-mers. Color labels as described above. d Sequences labeled by similarity to the known progenitor-like species. e Sequences labeled by polyCRACKER's repeat analysis. Note that a, b, d and e are nearly identical. f Spectral embedding of N. tabacum dimensionality reduced information, in which edges (grey lines) represent shared repeat element profiles that connect sequences. Sequences labeled by similarity to the known progenitor-like species. Note the increased number of ambiguous subsequences in f as compared to c is due to the fact that there are fewer repeats than k-mers in any given scaffold. This is analogous to the effect of increasing k-mer size substantially more sequence than that anchored to psuedomolecules) and 99.1% of sequence estimated to belong to the T subgenome (1.7 Gb, 262 Mb more sequence than that anchored to psuedomolecules). Sequence assigned to the same subgenome by both polyCRACKER and comparison to the progenitor species constituted 99.2% of the genome ( Table 2).

Subgenome classification in the complex allohexaploid, Triticum aestivum
To show the scalability of our method, we applied poly-CRACKER to group sequences by subgenome of origin for wheat, Triticum aestivum [9]. The enormous allohexaploid wheat genome is approximately five times larger than the human genome and its three subgenomes are Aegilops tauschii for the D subgenome and Triticum urartu for the A subgenome. A chromosome-scale genome assembly of Aegilops tauschii, the D-genome progenitor of wheat, was recently published [19] and assembled scaffolds are available for T. urartu [20]. We used an alignment-based method to assign progenitor of origin to 250 kb non-overlapping segments of the 15 Gb T. aestivum assembly [9]. We used this draft genome of T. aestivum, rather than the more recent chromosomescale assembly, as the draft genome is still a relatively fragmented genome (N50 contig size of 232,659 bases), which is reflective of more typical of assemblies for large, complex, allopolyploid genomes. We used 250 kb subsequences for initial clustering and included all possible smaller sequences later via signal amplification. PolyCRACKER grouped 12.5 Gb of sequence into one of three sequence bins. Overall agreement between poly-CRACKER groups and classification based on alignment to Aegilops tauschii and Triticum urartu was 0.997 (Cohen's Kappa) ( Table 2). In Fig. 6c, the PCA plot of scaffolds versus repeat-kmers (Fig. 6a) were subset and colored by scaffolds belonging to the polyCRACKER-identified A and D genomes. There is a high correspondence between this plot and a PCA plot with the same set of scaffolds colored by their reference-based classification (Fig. 6d).

Repeat annotation and analysis in N. tabacum
In addition to using k-mers to separate subgenomes, PolyCRACKER can use repetitive elements, such as LTRs and other transposons, found via common repeat finding programs such as RepeatModeler and RepeatScout, to separate subgenomes by identifying differential repetitive elements.
Using the pseudomolecule-anchored scaffold set, chosen because of a lower dispersion in subsequence length as compared to the unanchored scaffolds, 300 differential repeats were identified, then used to separate the subgenomes of N. tabacum following the same approach used for k-mers (Fig. 2d-f). A total of 2.3 Gb out of 2.9 Gb, 79.5% of the total pseudomolecule-anchored genome sequence, was successfully assigned to the S and T subgenomes that were also in agreement to progenitor-mapped labels. The subgenome assignments from the repeat analysis demonstrated 0.87 agreement with the progenitor mapped labels (Cohen's Kappa) (Fig. 2 d-f).
In addition, polyCRACKER identified enriched subclasses of repeat elements that contributed significantly to the subgenome assignments by comparing the categorical distribution of the informative differential consensus repeats to the null distribution, and assigning a high chisquared value to repeat subclasses that are over or under represented in the group of informative differential repeats. The top subclass, unknown, with a highest chi-squared value of 780.74, was found. The same analysis was done for top subclass two, Simple Repeats, and top subclass three, LTR/Gypsy (Additional file 1: Table S3).

Discussion
The ability to accurately separate closely related genomes or subgenomes without prior knowledge of their composition, relationship or a database of known species is a significant advancement for the study of polyploid genomes. We analyzed test datasets comprised of multiple closely related species and real polyploid genomes to demonstrate the ability of polyCRACKER to separate genomes ranging from small fungal genomes with little repetitive DNA (Additional file 1: Table S1) to enormous repeat-rich polyploid plant genomes ( Table 2). By enabling subgenome-specific analysis polyCRACKER can accelerate rapid advancements in our understanding of polyploid genome evolution and regulation. In particular, it will enable functional studies of genes according to subgenome, promote a deeper understanding of genome dominance and enable comparative studies of diverse polyploids to discern any rules governing polyploid genome evolution.
PolyCRACKER will be particularly useful for the latter by allowing analysis of polyploid species that do not have advanced genetic resources like genetic maps. It should be noted that our overall primary goal for developing  polyCRACKER is beyond simply assigning sequences to their subgenome within the modern allopolyploid. Assignment of sequences to their location within the modern allopolyploid can be accomplished by ever improving technologies including optical maps, genetic maps, Hi-C data, and various long-read sequencing technologies. Rather, the goal of polyCRACKER is to assign sequences to their ancestral progenitor subgenome and clarify what sequences were donated by which ancestral progenitor (even a possibly extinct progenitor). Poly-CRACKER is not limited to the analysis of only genomes with long-range contiguity or scaffolding, as demonstrated by our use of fragmented draft genome assemblies. However, if the analyzed genome does have sufficient long-range information, additional insight can be gleaned in the form of identification of translocations between the original progenitor genomes. As shown in Fig. 3, modern chromosomes of tobacco are a mosaic of the original progenitor genomes. While this is known for tobacco, due to the availability and prior identification of its progenitor species, for many allopolyploid species the progenitors of origin are unknown or extinct. Therefore, polyCRACKER may be particularly useful when applied to genomes with long-range information.
In allopolyploid genomes with long-range scaffolding, polyCRACKER can uncover the frequency and nature of chromosomal translocations between ancestral progenitor genomes. Nonetheless, biological translocations within the evolutionary history of a species is not the only means by which chromosomal mosaics of ancestral progenitors is observed. Indeed, errors in genome assembly of allopolyploids in which a subsequence of one homeologous chromosome is inserted in place of the true sequence often produces the same observation. This occurs frequently in allopolyploid genome assemblies as many sequences within homeologous chromosomes are virtually identical (with the exception of transposons, viral DNA and other repetitive sequences). Thus poly-CRACKER may be useful check for evaluating assembly accuracy and identifying potential mis-assemblies in allopolyploid genomes. Repetitive DNA is a major component of eukaryotic genomes. While polyCRACKER exploits the rapid turnover and evolution of repetitive elements to separate closely related genomes, it can also be used as tool to study the origin, evolution, and impact of repetitive Fig. 6 Unsupervised grouping of Triticum aestivum subsequences by progenitor of origin using repeat-k-mers. a PCA labeled by database-free grouping of 250 kb genome segments by polyCRACKER (Subgenomes a, d and ss are colored green, blue, and red respectively). b Spectral embedding of Triticum aestivum dimensionality reduced information (reduction to 4-dimensions, visualized in 3-dimensions), in which edges (grey lines) represent shared repeat-mer profiles that connect genome sequences. c Database-free grouping of a and d subgenomes by polyCRACKER, and (d) sequences colored by their similarity to T. urartu (green) and A. tauschii (blue) DNA on the entire genome. We used the differential kmers identified by polyCRACKER to assign assembled repeats to subgenomes. This also placed the k-mers into the context of intact repetitive elements. We also demonstrated that subgenome-specific repetitive elements could be used to bin subsequences by subgenomes.
Existing unsupervised methods used to separate subgenomes were developed to bin genomes from metagenome samples and are ill-suited for classifying allopolyploids because the homeologous chromosomes often have more similar overall k-mer profiles than the profiles of chromosomes within their respective subgenomes. In contrast to previous algorithms and approaches for grouping sequences by species of origin, we focus on the fact that even closely related species often differ in the frequency of repeat sequences present throughout their genome. This is particularly useful for polyploid subgenomes because such subgenomes are usually so closely related that they do not differ in coarse genome features like tetranucleotide frequency, GC content, or codon usage. The k-mers used by polyCRACKER may correspond to sequences associated with transposons, viruses, and other parasitic sequences that have proliferated in one species relative to another. PolyCRACKER groups scaffolds into bins that correspond to distinct species by: 1) counting the number of instances of short and/or long repeat k-mer sequences, 2) creating a sparse matrix of number of instances of kmers by the subsequences on which they occur, 3) reducing the dimensionality of the sparse matrix by one of several known (Table 3) applying a clustering method (Table 3). PolyCRACKER can then further increase signal by taking the resulting binned sequences and 1) counting k-mer frequency, 2) identifying k-mers that are enriched in one bin versus other bins, and 3) use the presence of enriched k-mers on previously ambiguous sequences to group them into the appropriate species bin.
Biased loss of repetitive sequence is often observed in allopolyploids and in the case of N. tabacum, repetitive sequence appears to have been preferentially lost from the T-subgenome [17]. Indeed, both polyCRACKER and database-dependent classification of N. tabacum confirm that the S-subgenome is substantially larger than the Tsubgenome as suggested by others previously, and consistent with biased repetitive sequence loss from the T subgenome. We also identify several classes of repeats that are enriched in each subgenome. Our observation of significant enrichment of simple repeats in the Tsubgenome is consistent with the previous observation that some classes of satellite repeats are several-fold more prevalent in N. tomentosiformis than in N. sylvestris [17,21]. Our observation of significant enrichment of tranLTR/Gypsy and unknown elements in the Ssubgenome of N. tabacum is consistent with the prior suggestion that N. sylvestris, but not N. tomentosiformis, had recent bursts of repeat element proliferation, likely involving Gypsy transposable elements [17]. In conclusion, polyCRACKER is a robust method for classifying sequences according to species of origin that will be important in future studies of allopolyploids.

Methods
We describe a method to group sequences belonging to the same subgenome that consists of connecting sequences within a graph in which edges are shared profiles of repetitive k-mers. The network graph is established by considering a user-supplied number of regions with closest repeat distributions (number of nearest neighbors). The distance between sequence nodes in the graph is indicative of the level of shared k-mers. Thus, polyCRACKER groups sequences by species of origin by connecting nodes, representing sequences, by edges that represent the common occurrence of repetitive DNA sequences unique to that species or subgenome. This technique was applied to highly fragmented draft genome assemblies or simulated assemblies. Scaffold lengths were normalized by splitting larger sequences into shorter subsequences. PolyCRACKER identifies potentially informative k-mers that occur above a minimum frequency. This greatly reduces the number of k-mers that are subsequently mapped to the scaffolds to determine their per scaffold frequency.
PolyCRACKER creates a sparse matrix, in which rows correspond to fragments and columns are unique kmers, and each intersection contains the frequency of each unique k-mer on each scaffold subsequence. Then, it performs dimensionality reduction on the sparse matrix, projecting the high-dimensional data into a lower-dimensional space, where each subsequence is represented by a point in the lower dimensional space. At this point we can visualize the data in threedimensional graphical plots (Figs. 2 and 6). In order to assign sequences to species or subgenomes, poly-CRACKER performs unsupervised clustering on the low-dimensional data via clustering. Any ambiguous sequences are removed, but may be used later through a signal amplification method described below.
To assign additional subsequences to species bins, polyCRACKER identifies highly differential k-mers between the initially grouped scaffolds and uses these kmers to recruit the remaining, ambiguous subsequences. Similar to above, we identify unique k-mers for each preliminary group of scaffolds, and then identify k-mers that differentiate those already grouped sequences by comparing the number of occurrences of a particular kmer in one group versus the others. K-mers that occur frequently in one group, but not in at least one of the others, past a certain threshold, are output to a FASTA file as differential k-mers for that group. PolyCRACKER maps each set of the differential k-mers against the entire set of scaffolds in the assembly. We then find the total counts for the sum of all differential k-mers of a particular group for each subsequence. The results of the aforementioned step may be plotted across entire chromosomes (Fig. 3) as a cross-check for the test case, which was done using shinyCircos [22] (polyCRACKER formats the input data for shinyCircos). When the total counts of an inferred subgenome's differential k-mers in a region are substantially larger than that of the differential k-mers of another inferred subgenome in that region, polyCRACKER extracts the binned sequences and stores them in FASTA format. The now larger set of grouped subsequences can be used as the new input for another round of differential k-mer analysis to recruit more subsequences. The previous steps (recruiting differential k-mers and binning sequences) are repeated until the number of iterations reaches a user-defined cutoff, and the user can select the species bins from any iteration of the binning process. The final results and spectral graph can be visualized in several ways, including force-directed physics simulations of the Knearest-neighbors graph to visualize changes in the data's structure over time. An analogous approach can be taken for binning sequences by repeat elements, although the k-mer analysis results must be used as an input for that algorithm to identify the initial differential repeat elements.
Below is a high level overview the methodology: Break the assembly into fragments of a fixed size, keeping track of the scaffolds from which these fragments came from. Bin the fragments by subgenome. Optionally, merge the binned fragments back together based on their scaffold of origin (or bin the entire original scaffold based on the total counts of differential repetitive sequence). In addition, there is an option available to assign remaining unassigned fragments to a subgenome bin based on the consensus assignment of all its neighboring fragments in the original assembly.

Sensible recommendations for selection of PolyCRACKER parameters
While the search space of polyCRACKER's parameters remains fairly unexplored, the algorithm provides a few functions that can make recommendations on how to set the parameters to achieve reasonable performance. Of vital importance is making sure that there is not too little repetitive content and not too much. If there is not enough repeat content included in the genome fragments, then they will be hard to bin. The function num-ber_repeatmers_per_subsequence plots a histogram of the number of repeat-mers present in each chunked genome fragment. If this histogram is skewed to lower k-mer counts in each fragment, then reducing the k-mer size, increasing the length of the fragments, decreasing the minimum genome-wide k-mer frequency for k-mer inclusion, amongst other changes can help increase the repetitive content of each fragment, thereby improving the initial binning of these fragments into subgenomes. There exists help documentation in the polyCRACKER source code repository that can help inform about sensible binning practices. Too much repetitive content can be problematic because analyzing the excessive number of repetitive k-mers is computationally demanding, and reductions in the k-mer size could lead to the inclusion of k-mers that are not unique to a particular subgenome. This can be dealt with by subsampling as described below.
Propagating PolyCRACKER labels via the relative position of fragments Since polyCRACKER performs best when the length of the input sequences are similar, we split scaffolds into fragments of similar length as we did when we prepared the N. tabacum input dataset. However, it is still possible to use prior information about the location of the fragments in scaffolds. For example, if one or several fragments belonging to a scaffold are classified to a subgenome, it is likely that other subsequences from the same scaffold are also associated with that same subgenome. PolyCRACKER uses this information by passing subgenome labels of classified subsequences through a semi-supervised label propagation algorithm that takes into account labels as well as relative genomic positions from which subsequences are derived to infer the labels of the unclassified subsequences. Conversely, when interrogating unvalidated assemblies, discordance between scaffold labels and species assignment may indicate assembly errors that can be fixed by breaking scaffolds.

Random sampling of repeat k-mers for datasets too large to include all k-mers
PolyCRACKER can randomly subsample repeat k-mers when the data set is too large to include all k-mers with available computational resources. For example, there were 550,790,662 unique k-mers with frequencies between 3 and 100 within the Triticum aestivum genome. A matrix 550, 790,662 columns by many thousand rows is too computationally intensive to construct and subsequently analyze. We therefore randomly sampled 2,420,450 k-mers to generate initial partitions of the subgenomes, and further subsampled differential k-mers to obtain 568,637 k-mers for the final partitioning of Triticum aestivum subgenomes.

Subgenome validation for complex allopolyploid species
For validating polyCRACKER subgenome groups in the context of the N. tabacum and Triticum aestivum genomes in which subgenomes were not already labeled, we used the progenitor species to assign species of origin to respective sequences. The PolyCRACKER progenitor-Mapping function exploits seal.sh function within the bbtools toolkit (sourceforge.net/projects/bbmap/) to bin subsequences based on reference progenitors, and serves as a tool to compare polyCRACKER to reference-based binning solutions. Progenitor mapped labels and binned sequences are output from this analysis and can be visualized on the PCA or the spectral embedded data. If polyCRACKER is used to separate species from an input genome instead of subgenomes, the original species labels for the subsequences can be recovered.

Quantitative analyses on the classification and clustering polyCRACKER results
The polyCRACKER function final_stats was used to perform quantitative analyses on polyCRACKER's classification and clustering results, with comparisons to the results achieved by the reference-based progenitorMapping and to the ground truth species extraction with known species. For a classifier comparison of poly-CRACKER's results to the progenitor mapped subgenomes, the lengths of each subgenomes from both analyses were found and compared to each other and the original progenitor subgenome sizes, and amongst other measures, Cohen's Kappa and Jaccard. Similarity between the amount of unambiguous sequence shared between both analyses were calculated as measures of agreement between the reference based and unsupervised methods. For accuracy measures of polyCRACKER versus ground truth labeling, such as when the subgenomes or species of each region/chunk are already known, the amount of sequence correctly classified, precision, recall, and f1 scores were reported for each species/subgenome and averaged, amongst other reported metrics. Confusion matrices for each of the two classification test types are output from the analysis. Analyzing the clustering results, total sequence lengths for each cluster is found and Silhouette and Calinski-Harabaz scores can be calculated if supplying the original PCA data along with the final polyCRACKER output labels. If either ground truth labels or progenitor mapped labels are supplied, some measure of agreement, homogeneity, and completeness between the two results can be found.

Repeat annotation and analysis in N. tabacum
PolyCRACKER studies annotated repeats that are identified using RepeatModeler, which uses the de-novo repeat finding programs RECON [23] and RepeatScout [24] to employ complementary computational methods for identifying repeat element boundaries and family relationships. RepeatModeler builds, refines and classifies consensus models of putative interspersed repeats from RECON and RepeatScout output. We further filtered repeat annotations using Pfam, and PANTHER annotations of predicted repeats. The final non-redundant de novo repeat database was then used with RepeatMasker to annotate repeats across the genome sequence.
Subgenome classification using genome-wide annotated repeats in N. tabacum Genome-wide RepeatMasker annotation of repeats was used with TE_cluster_analysis to generate a sparse matrix of the subsequences by their distribution of repeats belonging to a particular family, family member, class, and subclass (had a specific consensus sequence label) via repeatGFF2Bed and sam2diffk-mer_clusteringmatrix. Each row of this matrix was labeled by poly-CRACKER's k-mer analysis final subgenome labeling, and the highly informative repeats were found by calculating the chi-squared statistic of each column of the matrix. PolyCRACKER transformed this matrix into a matrix depicting the total counts of a particular repeat for each subgenome (the differential repeat subgenome matrix), and labeled a repeat by a subgenome if it was differentially present in one subgenome above a userspecified threshold. The consensus repeats were assigned a higher chi-square statistic if their assembly frequency was indicative of greater dependence between repeat and species label and polyCRACKER denotes these repeats as highly informative. The consensus repeats with the highest chi-square statistic were chosen from the differential repeats to find the most informative differential repeats, and the same number of repeats were selected from each subgenome for subsequent analysis. To cluster and extract the subgenomes using a repeat count matrix, this matrix was fed into subgenome_extraction_ via_repeats, which finds the most highly informative differential repeats, feature selects the matrix by these repeats, performs dimensionality reduction using PCA, finds a neighborhood community graph and clusters the resulting dataset via Spectral Clustering. Signal amplification is employed on these clusters analogous to the k-mer analysis in order to iteratively recruit more subsequences to the final bins. PolyCRACKER also employs multiple algorithms to discover repeat element subclasses that are important and informative for differentiating the subgenomes, as discussed in the Additional file 1.
Distinguishing species of origin using de novo repeat annotations PolyCRACKER grouped N. tabacum's pseudomolecule-anchored assembly (broken into 250 kb fragments) by species of origin using de novo repeat annotations in a similar method as used for k-mers. A matrix of scaffolds vs repeats was constructed. Differential repeats were identified as defined by a fivefold greater presence in one subgenome versus the other. Subgenome labels found via polyCRACKER's kmer based binning were then used to identify highly informative repeats (features with high chi-squared value). Intersection of differential repeats with the 150 most informative repeats from each subgenome yielded 300 informative differential repeats. Poly-CRACKER then performed feature selection on the matrix, keeping columns/repeats corresponding to highly informative repeats. PolyCRACKER performed principal component analysis and clustered the feature selected matrix to find new labels of scaffolds and to see if repeats can distinguish subgenomes. In this clustering process, nodes, representing scaffolds, were linked by edges that connected two nodes with similar consensus repeat coverage/content, and Spectral Clustering was used to perform normalized cuts of this graph to form initial partitions. Using the same process as outlined for k-mers, differential repeats were identified between the initial sequence bins by being eight or greater times present in one genome versus the other, and then they were used to recruit more subsequences via the signal amplification process, finding the total hits of these differential repeats in each bin. PolyCRACKER reassigned subgenome labels if their ratio of hits of the total differential repeat counts is greater than a threshold, in this case 3 times greater. The process of identifying differential repeats and then re-binning by the ratio of total repeat counts for each scaffold is reiterated via a bootstrap process until convergence, in which after a number of trials, the amount of sequence assigned to each bin does not appear to make large changes, converges on a set amount of sequence, and the number of times this process can be run is left to the user's discretion.

Identification of informative repeat classes
Consensus repeats, identified during de novo repeat finding and labelled by unique identifiers were identified by polyCRACKER as differential if the number of identified instances of that consensus repeat were five or more times more frequent in one subgenome versus the other. Out of all differential repeats identified, polyCRACKER selected 400 (200 from each subgenome) of these repeats (from the unanchored N. tabacum scaffolds) as subgenome signatures. The repeats were grouped together by their subclass annotation. Specific subclasses of repeats were found by polyCRACKER to be highly informative if their number of informative differential consensus repeats was statistically significantly under or over represented versus a similar distribution of the consensus repeats across the entire genome (via a chisquared statistic).

Additional file
Additional file 1: Supplementary data. This files contains supplementary data about polyCRACKER performance on simulated datasets comprised of mixtures of subsequences from multiple closely-related species and highly differential repeat subclasses between N. Tabacum subgenomes.