In this section we map the problem to topological data analysis, more specifically to bar diagrams arising from persistent homology.

### Motivation and problem statement

Given a reference database and a collection of metagenomic reads from the sequencing process, we denote by *Z* the set of all possible organisms, resulting from a reads-to-organism-mapping (ROM) pipeline: *Z*={*X*_{1},*X*_{2},...,*X*_{N}}. We treat all the reads of the pipeline output equally, i.e., do not consider the extent of the match. The rationale is that these characteristics have already been used by the pipeline to produce the output. Hence we focus on the subsequent information gathered from the pipeline: the relationship between the reads an the organisms.

Consider the bipartite graph of reads and organisms, where an edge connects a read to the assigned organism, that captures the ROM relationships completely. But a much more natural fit is to a simplicial complex where a *k*-simplex, *k*>0 (analogous to edge in a graph) is a read and the 0-simplex (analogous to a vertex in a graph) is an organism. Furthermore each of the the *k*-simplices are weighted, i.e., the weight is the number of reads that map to exactly these *k* organisms, which is not immediately apparent from the bipartite graph. The next question is whether the topological features of this naturally emerging simplicial complex can be exploited to solve the problem of false positives. In other words, is there some hidden characteristics of the ROM that separate the true from the false organisms.

In this paper, to make it more tractable, we simplify the problem as follows: Given the ROM pipeline output, obtain an order for the elements of *Z* as \(X_{i_{1}}, X_{i_{2}},..., X_{i_{N}}\), such that the top of the list is enriched for true positives (TPs). In a practical scenario, a threshold can be used to snip away the bottom of the list to eliminate potential false positive candidates. Further, the retained true positives can be used to rescue and re-assign the reads that had been misassigned due to the false positives.

### Model I: Barycentric subdivision

Each organism *X*_{j} has a collection of reads mapped to it. For each subset \(Y = \{X_{1},\ldots,X_{k}\} \in 2^{\mathcal {X}}\), we denote by *S*_{Y} the set of reads mapped to the organisms *X*_{1},*X*_{2},...,*X*_{k}, and none other.

For instance, this implies that \( S_{\{X_{1},X_{2}\}} \cap S_{\{X_{1},X_{2},X_{3}\}} = \emptyset \). In general,

$$S_{Y} \cap S_{Y^{\prime}} = \emptyset, \text{whenever} Y \neq Y^{\prime}.$$

###
**Definition 1**

For \(Y \in 2^{\mathcal {X}}\), the area *a*_{Y} is defined to be the sum total of all the lengths of the reads in *S*_{Y}.

###
**Definition 2**

For \(Y \in 2^{\mathcal {X}}\) and *X*_{j}∈*Y* we denote the length of the genome of the organism *X*_{j} covered by the reads in *S*_{Y} by \(l_{Y,X_{j}}\), and define

$$d_{Y}(X_{j}) = \frac{1}{m^{2}}\left(\frac{a_{Y} }{l_{Y,X_{j}}}\right),$$

where *m*=|*Y*|.

This is motivated by the fact that if there is true overlap in the *m* organisms, then the depth *d* of coverage of each organism is magnified by some factor of *m*. Note that it is quite possible that for *Y*={*X*_{1},*X*_{2},…,*X*_{m}}, the values *d*_{Y}(*X*_{1}),*d*_{Y}(*X*_{2}),*d*_{Y}(*X*_{3}), …, *d*_{Y}(*X*_{m}) are all distinct.

###
**Example 1**

Consider a set of 3 organisms \(\mathcal {X}=\{A,B,C\}\). A graphical representation of the underlying mapped reads is shown in Fig. 1 and the values of *d*_{Y} for each set \(Y\in 2^{\mathcal {X}}\) are given in Fig. 2 (top).

The depth values were computed after simulating a small set of reads’ placement on three reference genomes. For example, in this case no reads match both *B* and *C* (and not *A*), hence *d*_{BC}(*B*)=*d*_{BC}(*C*)=0. However, some reads match all three organisms, hence *d*_{ABC}>0. The values *d*_{A}(*A*),*d*_{B}(*B*),*d*_{C}(*C*) indicate the depths of *A, B, C* based on their uniquely mapping reads.

###
**Definition 3**

**(set***Y*^{t}** for a***t***≥0)**For \(Y \in 2^{\mathcal {X}}\) and \(t \in \mathbb {R}, t \geq 0\), we denote

$$Y^{t} = \{ X_{j} \in Y \mid d_{Y}(X_{j}) \geq t\}.$$

Note that the sets (*X*_{i})_{1≤i≤N} give a *cover* of the set of reads, while the sets \((S_{Y})_{Y \in 2^{\mathcal {X}}}\) form a *partition* of the reads. This corresponds to the collection of organisms present at a time *t*, mapped to some stage of a filtration of a simplicial complex (see Definition 5).

###
**Definition 4**

Given any finite set *S*, a *simplicial complex* with vertices in *S*, is a collection *K* of subsets of *S*, with the property that if *A*∈*K*, then for all *B*∈2^{S} with *B*⊂*A*, *B*∈*K*. A subcollection *L* of *K* is a subcomplex of *K* if *L* is a simplicial complex.

In other words, *K* is closed under the operation of “taking subsets". This last property is crucial for the definition of the simplicial homology groups, *H*_{∗}(*K*), associated to a simplicial complex *K*.

###
**Definition 5**

A finite filtered simplicial complex is a finite sequence *K*_{0}⊂*K*_{1}⊂⋯⊂*K*_{N} of simplicial complexes. In other words, *K*_{i} is a subcomplex of *K*_{j} for *i*≤*j*. A filtration of a simplicial complex *K* is a filtered simplicial complex with *K*_{N}=*K*.

This particular setup allows us to apply *persistent homology* to our problem.

To any finite simplicial complex *K* one can associate for each *j*≥0, a finite dimensional vector space *H*_{j}(*K,Q*) (called the *j-th simplicial homology group of K*). Moreover, if we have a filtration of a simplicial complex *K* as above, then each inclusion *i*_{s,t}:*K*_{s}→*K*_{t},*s*≤*t* induces a *linear map*, *i*_{s,t,∗}:*H*_{j}(*K*_{s},*Q*)→*H*_{j}(*K*_{t},*Q*). The images of these linear maps are usually called the *persistent homology groups* of the filtration, and their dimensions (i.e. the ranks of the maps *i*_{s,t,∗}) determine the so called “bar diagram" associated to the filtration. Intuitively, the dimensions of the vector spaces *H*_{j}(*K,Q*) (also called the *j*-th Betti number of the simplicial complex *K*) measure the number of independent *j*-dimensional cycles which are not boundaries of any (*j*+1)-dimensional subcomplex of *K* (so called *j*-dimensional holes), and the *bar diagram* of the filtration is a record of the “times" of the births and deaths of these “homology classes", where we think of the sequence (*K*_{t})_{t∈[0,N]} as a complex growing with time *t*. Each bar in the bar diagram represents the interval in time in which a homology class *persisted*. We refer the reader to the book by Edelsbrunner and Harer [13] for further details about persistent homology and its use in Topological Data Analysis.

However, notice that for each \(t \in \mathbb {R}, t \geq 0\), the set system \((Y^{t})_{Y \in 2^{\mathcal {X}}}\) does not necessarily satisfy the condition of being closed under taking subsets – and hence, does not necessarily form a simplicial complex with \(\mathcal {X}\) as the set of vertices.

Instead, we utilize the following construction (a simplicial subcomplex of a *barycentric subdivision*, detailed at the beginning of this section) that does produce a simplicial complex (in fact a filtration of complexes), and which is also naturally aligned with the various functions *d*_{Y}(·) defined earlier.

Intuitively, the simplices of the barycentric complex correspond to chains of subsets of \(\mathcal {X}\). For example, if \(X_{1},X_{2},X_{3} \in \mathcal {X}\), then the sequence {*X*_{1}}⊂{*X*_{1},*X*_{2}}⊂{*X*_{1},*X*_{2},*X*_{3}} is a chain in the partially ordered set \(2^{\mathcal {X}}\) (ordered by inclusion). Given *t*≥0, we include the chain {*X*_{1}}⊂{*X*_{1},*X*_{2}}⊂{*X*_{1},*X*_{2},*X*_{3}} in the barycentric complex iff {*X*_{1}}^{t}∩{*X*_{1},*X*_{2}}^{t}∩{*X*_{1},*X*_{2},*X*_{3}}^{t}≠*∅*.

#### Barycentric subdivision and its subcomplex of interest

We now define more precisely the barycentric complex, and the subcomplex we will use.

Let \(\Delta _{\mathcal {X}} \) denote the \((|\mathcal {X}|-1)\)-dimensional simplex with vertex set \(\mathcal {X}\).

A sequence \(\boldsymbol {\sigma } = (Y_{0},\ldots,Y_{p}), Y_{i} \in 2^{\mathcal {X}}\) is called a *chain* of the poset \(2^{\mathcal {X}}\) (ordered by inclusion) if \(Y_{0} \subsetneq Y_{1} \subsetneq \cdots \subsetneq Y_{p}\). The first barycentric subdivision of \(\Delta ^{\prime }_{\mathcal {X}}\) of \(\Delta _{\mathcal {X}}\) (see Fig. 3) is then defined by

$$ \Delta^{\prime}_{\mathcal{X}} = \left\{\mathbf{Y} = (Y_{0},\ldots,Y_{p}) \mid \mathbf{Y} \text{is a chain in}\ 2^{\mathcal{X}}\right\}. $$

(1)

Now let \(\mathcal {X}\) be a finite set, and let \(\mathbf {d} = (d_{Y}:Y \rightarrow \mathbb {R})_{Y \in 2^{\mathcal {X}}}\) be a tuple of maps. For each \(t \in \mathbb {R}\), we denote by \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) the subcomplex of \(\Delta _{\mathcal {X}}^{\prime }\) defined as follows.

For \(t \in \mathbb {R}\), and \(\mathbf {Y} = (Y_{0},\ldots,Y_{p}) \in \Delta ^{\prime }_{\mathcal {X}}\), let

$$ \mathbf{Y}_{\mathbf{d}}(t) =\bigcap_{0 \leq i \leq p} Y_{i}^{t} =\bigcap_{0 \leq i \leq p} \{X \in Y_{i} \mid d_{Y_{i}}(X) \geq t \}, $$

(2)

and set

$$ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t) = \left\{\mathbf{Y} = (Y_{0},\ldots,Y_{p}) \in \Delta^{\prime}_{\mathcal{X}} \mid \mathbf{Y}_{\mathbf{d}}(t) \neq \emptyset\right\}. $$

(3)

We continue with Example 1 of three organisms *A,B,C*. In this case, the vertices of \(\Delta ^{\prime }_{\mathcal {X}}\) are {*A,B,C,AB,AC,BC,ABC*}, where we write *A* for {*A*}, *AB* for {*A,B*} and so on. The top part of Fig. 2 implies that we have only 7 different values for our functions *d*_{Y}. So we only need to specify the following collection of such \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) (taking into account Remark 2 below):

$$ {\begin{aligned} \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(0) &= \{[A]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.36) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(0)\cup\{[B],[C]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.68) &=\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.36)\cup \{[AB],[A,AB],[B,AB]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.86) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.68)\cup\{[AC],[A,AC],[C,AC]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.99) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.86)\cup\{[ABC],[C,ABC],[AC,ABC],[C,AC,ABC]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(1) &=\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.99)\cup \{[A,ABC],[B,ABC],[AB,ABC],[A,AB,ABC],\\ & \hspace{3cm} [A,AC,ABC],[B,AB,ABC]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(1.36) &=\Delta^{\prime}_{\mathcal{X}}. \end{aligned}} $$

See Fig. 2 (middle) for a graphic representation of this filtered simplicial complex.

###
**Fact 1**

\(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\)is a simplicial complex with vertices in \(2^{\mathcal {X}}\). Moreover, for any sequence *t*_{0}>*t*_{1}>…>*t*_{N},

$$\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{0}) \subset \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{1}) \subset \cdots \subset \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{N}).$$

###
*Proof*

If \(\mathbf {Y} = (Y_{0},\ldots,Y_{p})\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) is a chain, and \(\mathbf {Z} = (Y_{i_{0}},\ldots,Y_{i_{q}}), 0 \leq i_{0} < \cdots < i_{q} \leq p\), is a subchain, then it is straightforward to check that

$$\mathbf{Y}_{\mathbf{d}}(t) \subset \mathbf{Z}_{\mathbf{d}}(t). $$

Since **Y**_{d}(*t*) is not empty, **Z**_{d}(*t*) is not empty as well, so \( \mathbf {Z}\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\).

Now assume that *t*_{i}>*t*_{j} for some *i,j*∈{0,…,*N*}, and let \(\mathbf {Y}\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t_{i})\). We have that if **Y**_{d}(*t*_{i})≠*∅* then there exists some *X* such that for all *Y*_{i} in **Y** we have \(d_{Y_{i}}(X)\geq t_{i}> t_{j}\), therefore \(Y\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t_{j})\). □

#### Filtration

Armed with Fact 1 above, we have the following definition.

###
**Definition 6**

Define the filtered simplicial complex

$$ \Delta^{\prime}_{\mathcal{X},\mathbf{d}} = \bigcup_{t\in\mathbb{R}} \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t). $$

(4)

This definition means that each of the simplicial complexes \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) can be thought as a particular point in time of the filtration \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).

While the constructions so far deal with theoretical considerations, for actual computations we use the following remarks, which were in fact used to compute the filtration values for Example 1.

###
**Remark 1**

Let \(\boldsymbol {\sigma } = (Y_{0},\ldots,Y_{p}) \in \Delta ^{\prime }_{\mathcal {X}}\), and let

$$\tau = \bigcap_{Y\in\boldsymbol{\sigma}} Y. $$

Define *t*_{0} by

$$ t_{0}= \max_{X\in \tau}\min_{Y\in \sigma} \lbrace d_{Y}(X)\rbrace. $$

(5)

Then *t*_{0} is the time of addition of *σ* to \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).

###
**Remark 2**

To make \( \Delta ^{\prime }_{\mathcal {X},\mathbf {d}} \) covariant with respect to *t*, we use the change of variable *t*^{′}=*m*−*t* where *m* denotes the maximum value attained by the functions *d*_{Y}.

From now on, unless stated otherwise, we will use the aforementioned change of variable.

**Voting Scheme.** Next we develop a voting scheme for all organisms that aims to solve the problem of ordering the organisms from true to false positives, using tools from algebraic topology. We refer the reader to Subsection 2.6 and Section 3 of [14] for the definition of persistent homology of a filtered simplicial complex, and a characterization in terms of \(\mathcal {P}\)-intervals respectively. In particular, using Corollary 3.1 of said article we have the following definition.

###
**Definition 7**

Let *K* be a filtered simplicial complex and *i*≥0 an integer. We define the *i*-th degree bar diagram of *K* as the collection of \(\mathcal {P}\)-intervals associated to the *i*-th degree persistent homology of *K*.

We now consider the persistent homology of the filtered complex \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\). For \(0\leq i \leq |\mathcal {X}| - 1\) let \(\mathcal {B}_{i}\) be a set of generators for the *i*-th degree persistent homology of \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\), with fixed representing cycles, and \(\mathcal {B} = \bigcup {\mathcal {B}_{i}}\). We can put these generators in a bijection with bars of the bar diagram of \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).

To help motivate the definition of the votings for each organism, we first list some of the components of the final formula. At each degree *i*, we assign to each organism *X* a collection of generators in \( \mathcal {B}_{i}\) such that these generators witness *X* as a contributor to the features of our complex. For instance, for each *X*, consider all *B* such that *X* appears in its representing cycle (remember that such cycle is a linear combination of simplices, containing organisms as vertices). This gives rise to functions \(F_{i}:\mathcal {X}\to 2^{\mathcal {B}_{i}}\).

For each generator \(B\in \mathcal {B}\), let start(*B*) and end(B) denote its beginning and end as a bar. Consider now a strictly decreasing function \(f:\mathbb {R}_{\geq 0} \to \mathbb {R}_{\geq 0}\) with *f*(*x*)→0 as *x*→*∞*; this function will modulate the contributions of each bar to the voting value of a given organism along the filtration value, while a decreasing function \(g:\mathbb {R}_{\geq 0} \to \mathbb {R}_{\geq 0}\) will modulate the contributions along homology degree.

With all this in place, the vote *v*(*X*) for organism *X* is computed as follows:

$$\begin{array}{@{}rcl@{}} v(X) &= &\sum_{i} g(i) \left(\sum_{B\in F_{i}(X)} \left(f(\text{start}(B)) - f(\text{end}(B)) \right) \right). \end{array} $$

(6)

Examples of *f,g* include \(f(t) = \frac {1}{t+1}\) and \(g(i) = \frac {1}{i+1}\), which is what we used to obtain the results in this paper.

Continuing our example from Figs. 1and 2 (bottom) shows the resulting bar diagram with associated generators after application of the mentioned procedure on the depth values. In this case there is only non-trivial homology in 0-th degree. After applying the voting scheme described above on the bar diagram shown in Fig. 2, the votes are *v*(*A*)=1.00,*v*(*B*)=0.14,*v*(*C*)=0.20.

### Model II: Čech complex

Recall that *Z*={*X*_{1},*X*_{2},...,*X*_{N}} is the set of all possible organisms, resulting from an ROM pipeline. Let \( \left (Y = S_{X_{1}X_{2}..X_{k}}\right) \in 2^{Z}\) be the set of reads associated with organisms *X*_{1}, *X*_{2},..., and *X*_{k}. Note that for all *i, j, k*,...,

$$ |S_{X_{i}}| \geq |S_{X_{i}X_{j}}| \geq |S_{X_{i}X_{j}X_{k}}| \geq... $$

(7)

The *t* (time) order filtration is *t*_{max} down to *t*_{min} in say steps of 1. A *k*-simplex on *X*_{1},*X*_{2},...,*X*_{k},*X*_{k+1} at time *t* is introduced if \(|S_{X_{1}X_{2}...X_{k}X_{k+1}}| \geq t\). Note that if the *k*-simplex on *X*_{1},*X*_{2},...,*X*_{k},*X*_{k+1} belongs to the complex, then so does each (*k*−1)-simplex on *X*_{1},*X*_{2},...,*X*_{k},*X*_{k+1} since, based on Eq. 7, for 1≤*i*≤*k*,

$$|S_{Y_{1}Y_{2}Y_{3}...Y_{k}}| \!\geq\! |S_{X_{1}X_{2}X_{3}...X_{k}X_{k+1}}|, \text{where}\ \!Y_{i} \!\!\in\! \{X_{1}, X_{2}, X_{3}, \!...,\! X_{k}, X_{k+\!1}\}. $$

Each node or organism *X* gets a true-positive score *v*(*X*) as follows: In the bar diagram, let *b* be the *h*-simplex =*X*_{0}*X*_{1}...*X*_{h} with bar length denoted as len (*b*). In this model, the vote *v*(*X*) for organism *X* is computed as follows:

$$\begin{array}{@{}rcl@{}} v(X)&= &\sum_{h} \left(\sum_{b \in H_{h}, X \in b} h \times \text{len}(b)\right) \end{array} $$

(8)

A simple example with four organisms is shown in Fig. 4.