In this section we map the problem to topological data analysis, more specifically to bar diagrams arising from persistent homology.
Motivation and problem statement
Given a reference database and a collection of metagenomic reads from the sequencing process, we denote by Z the set of all possible organisms, resulting from a reads-to-organism-mapping (ROM) pipeline: Z={X1,X2,...,XN}. We treat all the reads of the pipeline output equally, i.e., do not consider the extent of the match. The rationale is that these characteristics have already been used by the pipeline to produce the output. Hence we focus on the subsequent information gathered from the pipeline: the relationship between the reads an the organisms.
Consider the bipartite graph of reads and organisms, where an edge connects a read to the assigned organism, that captures the ROM relationships completely. But a much more natural fit is to a simplicial complex where a k-simplex, k>0 (analogous to edge in a graph) is a read and the 0-simplex (analogous to a vertex in a graph) is an organism. Furthermore each of the the k-simplices are weighted, i.e., the weight is the number of reads that map to exactly these k organisms, which is not immediately apparent from the bipartite graph. The next question is whether the topological features of this naturally emerging simplicial complex can be exploited to solve the problem of false positives. In other words, is there some hidden characteristics of the ROM that separate the true from the false organisms.
In this paper, to make it more tractable, we simplify the problem as follows: Given the ROM pipeline output, obtain an order for the elements of Z as \(X_{i_{1}}, X_{i_{2}},..., X_{i_{N}}\), such that the top of the list is enriched for true positives (TPs). In a practical scenario, a threshold can be used to snip away the bottom of the list to eliminate potential false positive candidates. Further, the retained true positives can be used to rescue and re-assign the reads that had been misassigned due to the false positives.
Model I: Barycentric subdivision
Each organism Xj has a collection of reads mapped to it. For each subset \(Y = \{X_{1},\ldots,X_{k}\} \in 2^{\mathcal {X}}\), we denote by SY the set of reads mapped to the organisms X1,X2,...,Xk, and none other.
For instance, this implies that \( S_{\{X_{1},X_{2}\}} \cap S_{\{X_{1},X_{2},X_{3}\}} = \emptyset \). In general,
$$S_{Y} \cap S_{Y^{\prime}} = \emptyset, \text{whenever} Y \neq Y^{\prime}.$$
Definition 1
For \(Y \in 2^{\mathcal {X}}\), the area aY is defined to be the sum total of all the lengths of the reads in SY.
Definition 2
For \(Y \in 2^{\mathcal {X}}\) and Xj∈Y we denote the length of the genome of the organism Xj covered by the reads in SY by \(l_{Y,X_{j}}\), and define
$$d_{Y}(X_{j}) = \frac{1}{m^{2}}\left(\frac{a_{Y} }{l_{Y,X_{j}}}\right),$$
where m=|Y|.
This is motivated by the fact that if there is true overlap in the m organisms, then the depth d of coverage of each organism is magnified by some factor of m. Note that it is quite possible that for Y={X1,X2,…,Xm}, the values dY(X1),dY(X2),dY(X3), …, dY(Xm) are all distinct.
Example 1
Consider a set of 3 organisms \(\mathcal {X}=\{A,B,C\}\). A graphical representation of the underlying mapped reads is shown in Fig. 1 and the values of dY for each set \(Y\in 2^{\mathcal {X}}\) are given in Fig. 2 (top).
The depth values were computed after simulating a small set of reads’ placement on three reference genomes. For example, in this case no reads match both B and C (and not A), hence dBC(B)=dBC(C)=0. However, some reads match all three organisms, hence dABC>0. The values dA(A),dB(B),dC(C) indicate the depths of A, B, C based on their uniquely mapping reads.
Definition 3
(setYt for at≥0)For \(Y \in 2^{\mathcal {X}}\) and \(t \in \mathbb {R}, t \geq 0\), we denote
$$Y^{t} = \{ X_{j} \in Y \mid d_{Y}(X_{j}) \geq t\}.$$
Note that the sets (Xi)1≤i≤N give a cover of the set of reads, while the sets \((S_{Y})_{Y \in 2^{\mathcal {X}}}\) form a partition of the reads. This corresponds to the collection of organisms present at a time t, mapped to some stage of a filtration of a simplicial complex (see Definition 5).
Definition 4
Given any finite set S, a simplicial complex with vertices in S, is a collection K of subsets of S, with the property that if A∈K, then for all B∈2S with B⊂A, B∈K. A subcollection L of K is a subcomplex of K if L is a simplicial complex.
In other words, K is closed under the operation of “taking subsets". This last property is crucial for the definition of the simplicial homology groups, H∗(K), associated to a simplicial complex K.
Definition 5
A finite filtered simplicial complex is a finite sequence K0⊂K1⊂⋯⊂KN of simplicial complexes. In other words, Ki is a subcomplex of Kj for i≤j. A filtration of a simplicial complex K is a filtered simplicial complex with KN=K.
This particular setup allows us to apply persistent homology to our problem.
To any finite simplicial complex K one can associate for each j≥0, a finite dimensional vector space Hj(K,Q) (called the j-th simplicial homology group of K). Moreover, if we have a filtration of a simplicial complex K as above, then each inclusion is,t:Ks→Kt,s≤t induces a linear map, is,t,∗:Hj(Ks,Q)→Hj(Kt,Q). The images of these linear maps are usually called the persistent homology groups of the filtration, and their dimensions (i.e. the ranks of the maps is,t,∗) determine the so called “bar diagram" associated to the filtration. Intuitively, the dimensions of the vector spaces Hj(K,Q) (also called the j-th Betti number of the simplicial complex K) measure the number of independent j-dimensional cycles which are not boundaries of any (j+1)-dimensional subcomplex of K (so called j-dimensional holes), and the bar diagram of the filtration is a record of the “times" of the births and deaths of these “homology classes", where we think of the sequence (Kt)t∈[0,N] as a complex growing with time t. Each bar in the bar diagram represents the interval in time in which a homology class persisted. We refer the reader to the book by Edelsbrunner and Harer [13] for further details about persistent homology and its use in Topological Data Analysis.
However, notice that for each \(t \in \mathbb {R}, t \geq 0\), the set system \((Y^{t})_{Y \in 2^{\mathcal {X}}}\) does not necessarily satisfy the condition of being closed under taking subsets – and hence, does not necessarily form a simplicial complex with \(\mathcal {X}\) as the set of vertices.
Instead, we utilize the following construction (a simplicial subcomplex of a barycentric subdivision, detailed at the beginning of this section) that does produce a simplicial complex (in fact a filtration of complexes), and which is also naturally aligned with the various functions dY(·) defined earlier.
Intuitively, the simplices of the barycentric complex correspond to chains of subsets of \(\mathcal {X}\). For example, if \(X_{1},X_{2},X_{3} \in \mathcal {X}\), then the sequence {X1}⊂{X1,X2}⊂{X1,X2,X3} is a chain in the partially ordered set \(2^{\mathcal {X}}\) (ordered by inclusion). Given t≥0, we include the chain {X1}⊂{X1,X2}⊂{X1,X2,X3} in the barycentric complex iff {X1}t∩{X1,X2}t∩{X1,X2,X3}t≠∅.
Barycentric subdivision and its subcomplex of interest
We now define more precisely the barycentric complex, and the subcomplex we will use.
Let \(\Delta _{\mathcal {X}} \) denote the \((|\mathcal {X}|-1)\)-dimensional simplex with vertex set \(\mathcal {X}\).
A sequence \(\boldsymbol {\sigma } = (Y_{0},\ldots,Y_{p}), Y_{i} \in 2^{\mathcal {X}}\) is called a chain of the poset \(2^{\mathcal {X}}\) (ordered by inclusion) if \(Y_{0} \subsetneq Y_{1} \subsetneq \cdots \subsetneq Y_{p}\). The first barycentric subdivision of \(\Delta ^{\prime }_{\mathcal {X}}\) of \(\Delta _{\mathcal {X}}\) (see Fig. 3) is then defined by
$$ \Delta^{\prime}_{\mathcal{X}} = \left\{\mathbf{Y} = (Y_{0},\ldots,Y_{p}) \mid \mathbf{Y} \text{is a chain in}\ 2^{\mathcal{X}}\right\}. $$
(1)
Now let \(\mathcal {X}\) be a finite set, and let \(\mathbf {d} = (d_{Y}:Y \rightarrow \mathbb {R})_{Y \in 2^{\mathcal {X}}}\) be a tuple of maps. For each \(t \in \mathbb {R}\), we denote by \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) the subcomplex of \(\Delta _{\mathcal {X}}^{\prime }\) defined as follows.
For \(t \in \mathbb {R}\), and \(\mathbf {Y} = (Y_{0},\ldots,Y_{p}) \in \Delta ^{\prime }_{\mathcal {X}}\), let
$$ \mathbf{Y}_{\mathbf{d}}(t) =\bigcap_{0 \leq i \leq p} Y_{i}^{t} =\bigcap_{0 \leq i \leq p} \{X \in Y_{i} \mid d_{Y_{i}}(X) \geq t \}, $$
(2)
and set
$$ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t) = \left\{\mathbf{Y} = (Y_{0},\ldots,Y_{p}) \in \Delta^{\prime}_{\mathcal{X}} \mid \mathbf{Y}_{\mathbf{d}}(t) \neq \emptyset\right\}. $$
(3)
We continue with Example 1 of three organisms A,B,C. In this case, the vertices of \(\Delta ^{\prime }_{\mathcal {X}}\) are {A,B,C,AB,AC,BC,ABC}, where we write A for {A}, AB for {A,B} and so on. The top part of Fig. 2 implies that we have only 7 different values for our functions dY. So we only need to specify the following collection of such \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) (taking into account Remark 2 below):
$$ {\begin{aligned} \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(0) &= \{[A]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.36) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(0)\cup\{[B],[C]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.68) &=\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.36)\cup \{[AB],[A,AB],[B,AB]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.86) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.68)\cup\{[AC],[A,AC],[C,AC]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.99) &= \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.86)\cup\{[ABC],[C,ABC],[AC,ABC],[C,AC,ABC]\}\\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(1) &=\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(.99)\cup \{[A,ABC],[B,ABC],[AB,ABC],[A,AB,ABC],\\ & \hspace{3cm} [A,AC,ABC],[B,AB,ABC]\} \\ \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(1.36) &=\Delta^{\prime}_{\mathcal{X}}. \end{aligned}} $$
See Fig. 2 (middle) for a graphic representation of this filtered simplicial complex.
Fact 1
\(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\)is a simplicial complex with vertices in \(2^{\mathcal {X}}\). Moreover, for any sequence t0>t1>…>tN,
$$\Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{0}) \subset \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{1}) \subset \cdots \subset \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t_{N}).$$
Proof
If \(\mathbf {Y} = (Y_{0},\ldots,Y_{p})\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) is a chain, and \(\mathbf {Z} = (Y_{i_{0}},\ldots,Y_{i_{q}}), 0 \leq i_{0} < \cdots < i_{q} \leq p\), is a subchain, then it is straightforward to check that
$$\mathbf{Y}_{\mathbf{d}}(t) \subset \mathbf{Z}_{\mathbf{d}}(t). $$
Since Yd(t) is not empty, Zd(t) is not empty as well, so \( \mathbf {Z}\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\).
Now assume that ti>tj for some i,j∈{0,…,N}, and let \(\mathbf {Y}\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t_{i})\). We have that if Yd(ti)≠∅ then there exists some X such that for all Yi in Y we have \(d_{Y_{i}}(X)\geq t_{i}> t_{j}\), therefore \(Y\in \Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t_{j})\). □
Filtration
Armed with Fact 1 above, we have the following definition.
Definition 6
Define the filtered simplicial complex
$$ \Delta^{\prime}_{\mathcal{X},\mathbf{d}} = \bigcup_{t\in\mathbb{R}} \Delta^{\prime}_{\mathcal{X},\mathbf{d}}(t). $$
(4)
This definition means that each of the simplicial complexes \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}(t)\) can be thought as a particular point in time of the filtration \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).
While the constructions so far deal with theoretical considerations, for actual computations we use the following remarks, which were in fact used to compute the filtration values for Example 1.
Remark 1
Let \(\boldsymbol {\sigma } = (Y_{0},\ldots,Y_{p}) \in \Delta ^{\prime }_{\mathcal {X}}\), and let
$$\tau = \bigcap_{Y\in\boldsymbol{\sigma}} Y. $$
Define t0 by
$$ t_{0}= \max_{X\in \tau}\min_{Y\in \sigma} \lbrace d_{Y}(X)\rbrace. $$
(5)
Then t0 is the time of addition of σ to \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).
Remark 2
To make \( \Delta ^{\prime }_{\mathcal {X},\mathbf {d}} \) covariant with respect to t, we use the change of variable t′=m−t where m denotes the maximum value attained by the functions dY.
From now on, unless stated otherwise, we will use the aforementioned change of variable.
Voting Scheme. Next we develop a voting scheme for all organisms that aims to solve the problem of ordering the organisms from true to false positives, using tools from algebraic topology. We refer the reader to Subsection 2.6 and Section 3 of [14] for the definition of persistent homology of a filtered simplicial complex, and a characterization in terms of \(\mathcal {P}\)-intervals respectively. In particular, using Corollary 3.1 of said article we have the following definition.
Definition 7
Let K be a filtered simplicial complex and i≥0 an integer. We define the i-th degree bar diagram of K as the collection of \(\mathcal {P}\)-intervals associated to the i-th degree persistent homology of K.
We now consider the persistent homology of the filtered complex \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\). For \(0\leq i \leq |\mathcal {X}| - 1\) let \(\mathcal {B}_{i}\) be a set of generators for the i-th degree persistent homology of \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\), with fixed representing cycles, and \(\mathcal {B} = \bigcup {\mathcal {B}_{i}}\). We can put these generators in a bijection with bars of the bar diagram of \(\Delta ^{\prime }_{\mathcal {X},\mathbf {d}}\).
To help motivate the definition of the votings for each organism, we first list some of the components of the final formula. At each degree i, we assign to each organism X a collection of generators in \( \mathcal {B}_{i}\) such that these generators witness X as a contributor to the features of our complex. For instance, for each X, consider all B such that X appears in its representing cycle (remember that such cycle is a linear combination of simplices, containing organisms as vertices). This gives rise to functions \(F_{i}:\mathcal {X}\to 2^{\mathcal {B}_{i}}\).
For each generator \(B\in \mathcal {B}\), let start(B) and end(B) denote its beginning and end as a bar. Consider now a strictly decreasing function \(f:\mathbb {R}_{\geq 0} \to \mathbb {R}_{\geq 0}\) with f(x)→0 as x→∞; this function will modulate the contributions of each bar to the voting value of a given organism along the filtration value, while a decreasing function \(g:\mathbb {R}_{\geq 0} \to \mathbb {R}_{\geq 0}\) will modulate the contributions along homology degree.
With all this in place, the vote v(X) for organism X is computed as follows:
$$\begin{array}{@{}rcl@{}} v(X) &= &\sum_{i} g(i) \left(\sum_{B\in F_{i}(X)} \left(f(\text{start}(B)) - f(\text{end}(B)) \right) \right). \end{array} $$
(6)
Examples of f,g include \(f(t) = \frac {1}{t+1}\) and \(g(i) = \frac {1}{i+1}\), which is what we used to obtain the results in this paper.
Continuing our example from Figs. 1and 2 (bottom) shows the resulting bar diagram with associated generators after application of the mentioned procedure on the depth values. In this case there is only non-trivial homology in 0-th degree. After applying the voting scheme described above on the bar diagram shown in Fig. 2, the votes are v(A)=1.00,v(B)=0.14,v(C)=0.20.
Model II: Čech complex
Recall that Z={X1,X2,...,XN} is the set of all possible organisms, resulting from an ROM pipeline. Let \( \left (Y = S_{X_{1}X_{2}..X_{k}}\right) \in 2^{Z}\) be the set of reads associated with organisms X1, X2,..., and Xk. Note that for all i, j, k,...,
$$ |S_{X_{i}}| \geq |S_{X_{i}X_{j}}| \geq |S_{X_{i}X_{j}X_{k}}| \geq... $$
(7)
The t (time) order filtration is tmax down to tmin in say steps of 1. A k-simplex on X1,X2,...,Xk,Xk+1 at time t is introduced if \(|S_{X_{1}X_{2}...X_{k}X_{k+1}}| \geq t\). Note that if the k-simplex on X1,X2,...,Xk,Xk+1 belongs to the complex, then so does each (k−1)-simplex on X1,X2,...,Xk,Xk+1 since, based on Eq. 7, for 1≤i≤k,
$$|S_{Y_{1}Y_{2}Y_{3}...Y_{k}}| \!\geq\! |S_{X_{1}X_{2}X_{3}...X_{k}X_{k+1}}|, \text{where}\ \!Y_{i} \!\!\in\! \{X_{1}, X_{2}, X_{3}, \!...,\! X_{k}, X_{k+\!1}\}. $$
Each node or organism X gets a true-positive score v(X) as follows: In the bar diagram, let b be the h-simplex =X0X1...Xh with bar length denoted as len (b). In this model, the vote v(X) for organism X is computed as follows:
$$\begin{array}{@{}rcl@{}} v(X)&= &\sum_{h} \left(\sum_{b \in H_{h}, X \in b} h \times \text{len}(b)\right) \end{array} $$
(8)
A simple example with four organisms is shown in Fig. 4.