Signal enrichment with strain-level resolution in metagenomes using topological data analysis

Background A metagenome is a collection of genomes, usually in a micro-environment, and sequencing a metagenomic sample en masse is a powerful means for investigating the community of the constituent microorganisms. One of the challenges is in distinguishing between similar organisms due to rampant multiple possible assignments of sequencing reads, resulting in false positive identifications. We map the problem to a topological data analysis (TDA) framework that extracts information from the geometric structure of data. Here the structure is defined by multi-way relationships between the sequencing reads using a reference database. Results Based primarily on the patterns of co-mapping of the reads to multiple organisms in the reference database, we use two models: one a subcomplex of a Barycentric subdivision complex and the other a Čech complex. The Barycentric subcomplex allows a natural mapping of the reads along with their coverage of organisms while the Čech complex takes simply the number of reads into account to map the problem to homology computation. Using simulated genome mixtures we show not just enrichment of signal but also microbe identification with strain-level resolution. Conclusions In particular, in the most refractory of cases where alternative algorithms that exploit unique reads (i.e., mapped to unique organisms) fail, we show that the TDA approach continues to show consistent performance. The Čech model that uses less information is equally effective, suggesting that even partial information when augmented with the appropriate structure is quite powerful.


Background
A metagenome is a collection of genomes in a microenvironment, such as the gut of an animal, bottom of an ocean, or soil. This captures the influence of the immediate environment on the phenotype of an organism. For instance, one of the factors in the safety of our food supply chain is knowing the microbiome in the food [1]. The state of disease and health of a host has been shown to be related to the microbiomes in its gut [2]. The sturdiness *Correspondence: parida@us.ibm.com 1 Computational Biology Center, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Full list of author information is available at the end of the article or weakness of a plant is shown to be related to its soil microbiome [3]. It is turning out that these microorganisms are perhaps playing a much bigger role than earlier anticipated. The DNA technology to capture these organisms also has been disruptive in the area of microbiology, i.e., each organism does not need to be cultured individually before sequencing but the entire volume of samples can be put through the sequencing process en masse [4]. The obvious advantage is that the recalcitrant organisms that were resistant to being cultured no longer pose a problem, as long as careful sample processing is performed to avoid sequencing biases [5]. However, there are two major challenges with the sequencing approach.
Firstly, the completeness and correctness of reference databases limits the power of detection. The database of reference sequences must systematically be updated when new genomes become available. Secondly, no matter how complete and accurate the databases are, there is the problem of correctly assigning sequencing reads when distinct organisms with very similar genomes are present. In this paper, we address the second challenge of accurately detecting the organisms present in a sample, given a database.
Characteristics of the short sequencing reads frequently results in them being mapped to multiple reference genomes in the database, even under very strict matching criteria. Thus when the database organisms are very similar to each other there is usually a substantial dearth of reads assigned to unique organisms. As a consequence, most solution pipelines yield mapping results that are riddled with false positives. In fact, in (microbial) simulation studies we find that often a large percentage of the predicted potential organisms, using standard pipelines from literature, are false positives. A recent benchmark study found the number of species reported for the same sample varying by orders of magnitude, depending on the classifier used [6]. Popular methods for metagenomic read classification include Kraken [7], CLARK [8], MetaPhlAn [9] and others.
Topological data analysis (TDA) is emerging as a promising approach for analyzing large genomic datasets for a variety of questions [10][11][12], with already demonstrated applications in evolutionary biology, cancer genomics, and analysis of complex diseases. TDA extracts information from the geometric structure of data; in our application the structure is defined by the relationships between sequencing reads and organisms in a reference database.
Based primarily on the patterns of co-mapping of reads to the organisms in a reference database, we use a subcomplex of a Barycentric subdivision complex to model the multi-way maps of reads, along with the extent of read coverage on the respective organisms. This subcomplex allows a natural mapping of our problem to homology computation and interpretation. We test this approach on the special scenario (dearth of uniquely mapped reads) and find that using an appropriate voting function on the typical bar diagrams of TDA, we can sort the organisms to separate true positives from false positives. Next, we test if a reduced information set, that is, only the number of reads without utilizing their lengths or coverage when combined with TDA is powerful enough to enrich for true positives. We observe success in this scenario as well, suggesting that the use of topology captures the non-obvious structure defined by the reads promiscuously mapping to multiple organisms. All the data used in the paper are simulated reads by necessity since it is the most reliable way of knowing the ground truth, to assess the results.

Methods
In this section we map the problem to topological data analysis, more specifically to bar diagrams arising from persistent homology.

Motivation and problem statement
Given a reference database and a collection of metagenomic reads from the sequencing process, we denote by Z the set of all possible organisms, resulting from a reads-toorganism-mapping (ROM) pipeline: Z = {X 1 , X 2 , ..., X N }. We treat all the reads of the pipeline output equally, i.e., do not consider the extent of the match. The rationale is that these characteristics have already been used by the pipeline to produce the output. Hence we focus on the subsequent information gathered from the pipeline: the relationship between the reads an the organisms.
Consider the bipartite graph of reads and organisms, where an edge connects a read to the assigned organism, that captures the ROM relationships completely. But a much more natural fit is to a simplicial complex where a k-simplex, k > 0 (analogous to edge in a graph) is a read and the 0-simplex (analogous to a vertex in a graph) is an organism. Furthermore each of the the k-simplices are weighted, i.e., the weight is the number of reads that map to exactly these k organisms, which is not immediately apparent from the bipartite graph. The next question is whether the topological features of this naturally emerging simplicial complex can be exploited to solve the problem of false positives. In other words, is there some hidden characteristics of the ROM that separate the true from the false organisms.
In this paper, to make it more tractable, we simplify the problem as follows: Given the ROM pipeline output, obtain an order for the elements of Z as X i 1 , X i 2 , ..., X i N , such that the top of the list is enriched for true positives (TPs). In a practical scenario, a threshold can be used to snip away the bottom of the list to eliminate potential false positive candidates. Further, the retained true positives can be used to rescue and re-assign the reads that had been misassigned due to the false positives.

Model I: Barycentric subdivision
Each organism X j has a collection of reads mapped to it. For each subset Y = {X 1 , . . . , X k } ∈ 2 X , we denote by S Y the set of reads mapped to the organisms X 1 , X 2 , ..., X k , and none other.
For instance, this implies that S {X 1 ,X 2 } ∩ S {X 1 ,X 2 ,X 3 } = ∅. In general, the area a Y is defined to be the sum total of all the lengths of the reads in S Y .
For Y ∈ 2 X and X j ∈ Y we denote the length of the genome of the organism X j covered by the reads in S Y by l Y ,X j , and define This is motivated by the fact that if there is true overlap in the m organisms, then the depth d of coverage of each organism is magnified by some factor of m. Note that it is quite possible that for Fig. 1 and the values of d Y for each set Y ∈ 2 X are given in Fig. 2

(top).
The depth values were computed after simulating a small set of reads' placement on three reference genomes.
Note that the sets (X i ) 1≤i≤N give a cover of the set of reads, while the sets (S Y ) Y ∈2 X form a partition of the reads. This corresponds to the collection of organisms present at a time t, mapped to some stage of a filtration of a simplicial complex (see Definition 5).

Definition 4 Given any finite set S, a simplicial complex with vertices in S, is a collection K of subsets of S, with the property that if
In other words, K is closed under the operation of "taking subsets". This last property is crucial for the definition of the simplicial homology groups, H * (K), associated to a simplicial complex K.
This particular setup allows us to apply persistent homology to our problem.
To any finite simplicial complex K one can associate for each j ≥ 0, a finite dimensional vector space H j (K, Q) (called the j-th simplicial homology group of K). Moreover, if we have a filtration of a simplicial complex K The images of these linear maps are usually called the persistent homology groups of the filtration, and their dimensions (i.e. the ranks of the maps i s,t, * ) determine the so called "bar diagram" associated to the filtration. Intuitively, the dimensions of the vector spaces H j (K, Q) (also called the j-th Betti number of the simplicial complex K) measure the number of independent j-dimensional cycles which are not boundaries of any (j + 1)-dimensional subcomplex of K (so called j-dimensional holes), and the bar diagram of the filtration is a record of the "times" of the births and deaths of these "homology classes", where we think of the sequence (K t ) t∈[0,N] as a complex growing with time t. Each bar in the bar diagram represents the interval in time in which a homology class persisted. We refer the reader to the book by Edelsbrunner and Harer [13] for further details about persistent homology and its use in Topological Data Analysis.
However, notice that for each t ∈ R, t ≥ 0, the set system (Y t ) Y ∈2 X does not necessarily satisfy the condition of being closed under taking subsets -and hence, does not necessarily form a simplicial complex with X as the set of vertices.
Instead, we utilize the following construction (a simplicial subcomplex of a barycentric subdivision, detailed at the beginning of this section) that does produce a simplicial complex (in fact a filtration of complexes), and which is also naturally aligned with the various functions d Y (·) defined earlier.
Intuitively, the simplices of the barycentric complex correspond to chains of subsets of X . For example, if X 1 , X 2 , X 3 ∈ X , then the sequence {X 1 } ⊂ {X 1 , X 2 } ⊂ {X 1 , X 2 , X 3 } is a chain in the partially ordered set 2 X (ordered by inclusion). Given t ≥ 0, we include the chain

Barycentric subdivision and its subcomplex of interest
We now define more precisely the barycentric complex, and the subcomplex we will use.
The first barycentric subdivision of X of X (see Fig. 3) is then defined by Now let X be a finite set, and let d = (d Y : Y → R) Y ∈2 X be a tuple of maps. For each t ∈ R, we denote by X ,d (t) the subcomplex of X defined as follows.
and set

Since Y d (t) is not empty, Z d (t) is not empty as well, so Z ∈ X ,d (t).
Now assume that t i > t j for some i, j ∈ {0, . . . , N}, and let Y ∈ X ,d (t i ). We have that if Y d (t i ) = ∅ then there exists some X such that for all

Filtration
Armed with Fact 1 above, we have the following definition.

Definition 6 Define the filtered simplicial complex
This definition means that each of the simplicial complexes X ,d (t) can be thought as a particular point in time of the filtration X ,d .
While the constructions so far deal with theoretical considerations, for actual computations we use the following remarks, which were in fact used to compute the filtration values for Example 1.
Define t 0 by Then t 0 is the time of addition of σ to X ,d .

Remark 2
To make X ,d covariant with respect to t, we use the change of variable t = m − t where m denotes the maximum value attained by the functions d Y .
From now on, unless stated otherwise, we will use the aforementioned change of variable.
Voting Scheme. Next we develop a voting scheme for all organisms that aims to solve the problem of ordering the organisms from true to false positives, using tools from algebraic topology. We refer the reader to Subsection 2.6 and Section 3 of [14] for the definition of persistent homology of a filtered simplicial complex, and a characterization in terms of P-intervals respectively. In particular, using Corollary 3.1 of said article we have the following definition.

Definition 7
Let K be a filtered simplicial complex and i ≥ 0 an integer. We define the i-th degree bar diagram of K as the collection of P-intervals associated to the i-th degree persistent homology of K.
We now consider the persistent homology of the filtered complex X ,d . For 0 ≤ i ≤ |X | − 1 let B i be a set of generators for the i-th degree persistent homology of X ,d , with fixed representing cycles, and B = B i . We can put these generators in a bijection with bars of the bar diagram of X ,d .
To help motivate the definition of the votings for each organism, we first list some of the components of the final formula. At each degree i, we assign to each organism X a collection of generators in B i such that these generators witness X as a contributor to the features of our complex. For instance, for each X, consider all B such that X appears in its representing cycle (remember that such cycle is a linear combination of simplices, containing organisms as vertices). This gives rise to functions F i : For each generator B ∈ B, let start(B) and end(B) denote its beginning and end as a bar. Consider now a strictly decreasing function f : R ≥0 → R ≥0 with f (x) → 0 as x → ∞; this function will modulate the contributions of each bar to the voting value of a given organism along the filtration value, while a decreasing function g : R ≥0 → R ≥0 will modulate the contributions along homology degree.
With all this in place, the vote v(X) for organism X is computed as follows: Examples of f , g include f (t) = 1 t+1 and g(i) = 1 i+1 , which is what we used to obtain the results in this paper.
Continuing our example from Figs. 1and 2 (bottom) shows the resulting bar diagram with associated generators after application of the mentioned procedure on the depth values. In this case there is only non-trivial homology in 0-th degree. After applying the voting scheme described above on the bar diagram shown in Fig. 2, the votes are v(A) = 1.00, v(B) = 0.14, v(C) = 0.20.

Model II:Čech complex
Recall that Z = {X 1 , X 2 , ..., X N } is the set of all possible organisms, resulting from an ROM pipeline. Let Y = S X 1 X 2 ..X k ∈ 2 Z be the set of reads associated with organisms X 1 , X 2 , ..., and X k . Note that for all i, j, k, ..., The t (time) order filtration is t max down to t min in say steps of 1. A k-simplex on X 1 , X 2 , ..., X k , X k+1 at time t is introduced if |S X 1 X 2 ...X k X k+1 | ≥ t. Note that if the k-simplex on X 1 , X 2 , ..., X k , X k+1 belongs to the complex, then so does each (k − 1)-simplex on X 1 , X 2 , ..., X k , X k+1 since, based on Eq. 7, for 1 ≤ i ≤ k, Each node or organism X gets a true-positive score v(X) as follows: In the bar diagram, let b be the h-simplex = X 0 X 1 ...X h with bar length denoted as len(b). In this model, the vote v(X) for organism X is computed as follows: A simple example with four organisms is shown in Fig. 4.

Results
We applied the model on simulated shotgun sequencing reads from a collection of 36 recently published Salmonella genomes [15], in an effort to study the applicability of the approach to strain-level detection.
We simulated 150 bp paired end reads, 100,000 per each input genome, with dwgsim [16] (with parameters y0, e0.005, E0.005, d500, s0, r0.001, R0.15, X0.3) from the 36 genomes. The reads were mapped to a database consisting of the same set of genomes using bowtie2 [17] (very-sensitive-local mode, searching for up to 101 hits per read). Read simulation and mapping was performed with the Metagenomics Computation and Analytics Workbench (MCAW) [18]. The mapped reads were processed with custom scripts to prepare the Barycentric depth values andČech set sizes for each set of organisms. Concordance of paired reads was checked (both reads of a pair mapping to the same organism), a random representative selected if the organism had several hits from the same read, and only the best In this proof of concept we focused on the set of reads shared among 1 (unique reads), 2, or 3 organisms; the simulated data had 1-3 truly present organisms.
Indicative of the shared genome content between the closely related sequences, between only 2,087 to 92,176 of the 100,000 reads simulated per genome uniquely mapped to that genome after the process described above. Looking only at unique read counts per genome would certainly yield erroneous order of certain strains. For example, we observed a simulated mixture of 3 strains where two false positives (FPs) had more unique reads (2,596 and 2,427) than one of the true positives (TPs) (2,105). In this case, relying only on the number of unique reads would miss the truly present strain and falsely indicate the presence of missing ones. The observed unique reads for false positive organisms can arise from sequencing read errors, effects of the bioinformatics pipeline, and from subtle differences between the reference genomes and the observed genomes.
For analysis, the 36 Salmonella genomes were split into 18 non-overlapping sets of 2 strains each, and into 12 sets of 3 strains each. The TDA methodology, including the voting schemes described in Eqs. 6 and 8, was applied to the simulated reads from each set. The resulting voting values v for each organism were ordered from large (indicating truly present organisms) to small (indicating potential false positives). The voting values for simulated mixtures of 1, 2, or 3 Salmonella strains are visualized in Fig. 5 for the Barycentric approach and in Fig. 6 for theČech approach. Ordering the organisms by voting values perfectly delineates the TPs from FPs with both methods. As a comparison, ordering the organisms by read coverage (computed by bedtools [19]) of the same reads as in the TDA input, the TPs could also be separated from FPs. However, as discussed earlier, using only the uniquely mapping reads would lead to false positives. We demonstrated that the TDA approach solves the problem of enriching the top of the list of voting values for truly present organisms.

Discussion and conclusions
In this proof of concept study we apply topological data analysis to the problem of separating signal from noise in the analysis of frequently multi-mapping metagenomic sequencing reads. Our approach is based on the construction of a particular subcomplex of a Barycentric subdivision complex, to rank-order the potential organisms and tease out the truly present ones.
The results from applying the approach on simulated genome mixtures show not just separation of signal from noise but also the potential for identifying microbes from metagenome samples, at strain level. We demonstrate the power of the TDA approach even in cases where alternative algorithms that exploit uniquely mapping reads fail. In fact, for the simple test cases the TPs bubble to the top from amongst a reference collection of highly related organisms, indicating promise of success for complicated real-life scenarios. TheČech model that uses less information is equally effective, suggesting that even partial information when augmented with the appropriate structure is quite powerful.
Additionally, the voting value curves show patterns of sharp decrease after the last true positive, suggesting automated calculation of cut-off thresholds. The same methodology could also be used to study higher taxonomic levels, e.g., separating true from false positive genera, families, etc. by modeling the read mappings across taxa.
Our next steps are to scale the implementation, and to apply it in the food safety as well as in the human health contexts. In both applications, a precise strain level assignment is of paramount importance.