In the following we provide a detailed description of each component of our workflow (Fig. 1). Our scheme is designed to be modular, and to be used in combination with any variation calling workflow.
The first part of our workflow is the generation of the ad hoc reference. This is done by the preprocessor, using as an input the raw reads of the donor as an input and the pan-genome reference.
The second part is to actually call the variants. We don’t provide any details on how to do it because we resort to a variant calling workflow, using our ad hoc reference instead of the standard one. In our experiments, we resort to GATK [4].
Finally, we need to normalize our variants. After the previous step the variants are expressed using the ad hoc reference instead of the standard. The normalization step uses metadata generated from the preprocessor to project the variants back to the standard reference.
Pan-genome preprocessor
The main role of the pan-genome preprocessor is to extract an ad hoc reference sequence from the pan-genome using the reads from the donor as an input.
Pan-genome representation
Following the literature reviewed in the Background section, the existing pan-genome indexing approaches for read alignment could be classified as follows. Some approaches consider the input as a set of sequences, some build a graph or an automata that models the population, and others consider the specific case of a reference sequence plus a set of variations. However, the boundaries between these categories are loose, as a set of sequences could be interpreted as a multiple sequence alignment, which in turn could be turned into a graph. Our scheme can work with different pan-genome representations and indexes provided that it is possible to model recombinations. The multiple sequence alignment and graph representations are versatile enough, but just a collection of sequences is not.
We consider our input pan-genome as a multiple sequence alignment and we store all the positions with a gap. In this way we decouple the problem of book keeping the structure of the pan-genome (in our case, as a multiple sequence alignment) and the problem of indexing the set of underlying sequences.
To transform one representation into the other and to be able to map coordinates we store bitmaps to indicate the positions where the gaps occur. Consider our running example of a multiple alignment
We may encode the positions of the gaps by four bitvectors:
Let these bitvectors be B1,B2,B3, and B4. We extract the four sequences omitting the gaps, and preprocess the bitvectors for constant time rank and select queries [27–29]: rank1(B
k
,i)=j tells the number of 1s in B
k
[1..i] and select1(B
k
,j)=i tells the position of the j-th 1 in B
k
. Then, for B
k
[i]=1, rank1(B
k
,i)=j maps a character in column i of row k in the multiple sequence alignment to its position j in the k-th sequence, and select1(B
k
,j)=i does the reverse mapping, i.e. the one we need to map a occurrence position of a read to add the sum in the coverage matrix.
These bitvectors with rank and select support take n+o(n) bits of space for a multiple alignment of total size n [27–29]. Moreover, since the bitvectors have long runs of 1s (and possibly 0s), they can be compressed efficiently while still supporting fast rank and select queries [30, 31].
Pan-genome indexing and read alignment
Now, the problem of indexing the pan-genome is reduced to index a set of sequences.
To demonstrate our overall scheme, we first use a naive approach to index the pan-genome as a baseline: we index each of the underlying sequences individually using BWA [1]. This approach does not offer a scalable pan-genome indexing solution, but it provides a good baseline for the accuracy that one can expect from a true pan-genome indexing solution to provide. In our experiments, this approach is labeled MSA
base
.
For a scalable solution that can manage large and highly repetitive set of references we resort to CHIC aligner [23], which combines Lempel-Ziv compression to remove the redundancy with a Burrows-Wheeler index to align the reads. In our experiments, this approach is labeled MSA
chic
.
Heaviest path extraction
After aligning all the reads to the multiple sequence alignment, we extract a recombined (virtual) genome favoring the positions where most reads were aligned. To do so we propose a generic approach to extract such a heaviest path on a multiple sequence alignment. We define a score matrix S that has the same dimensions as the multiple sequence alignment representation of the pan-genome. All the values of the score matrix are initially set to 0.
We use CHIC aligner to find the best alignment for each donor’s read. Then we process the output as follows. For each alignment of length m that starts at position j in the genome i of the pan-genome, we increment the scores in S[i][j],S[i][j+1]…S[i][j+m−1] (adjusting the indexes using the bit-vector representations considered in the previous subsection). When all the reads have been processed we have recorded in S that the areas with highest scores are those where more reads were aligned. An example of this is shown in Fig. 1.
Then we construct the ad hoc reference as follows: we traverse the score matrix column wise, and for each column we look for the element with the highest score. Then, we take the nucleotide that is in the same position in the multiple sequence alignment and append it to the ad hoc reference. This procedure can be interpreted as a heaviest path in a graph: each cell (i,j) of the matrix represents a node, and for each node (i,j) there are N outgoing edges to nodes (i+1,k), k∈{1,…,N}. We add an extra node A with N outgoing edges to the nodes (1,k), and another node B with N ingoing edges from nodes (L,k). Then the ad hoc reference is the sequence spelled by the heaviest path from A to B. The underlying idea of this procedure is to model structural recombinations among the indexed sequences.
A valid concern is that the resulting path might contain too many alternations between sequences in order to maximize the weight.
To address this issue there is a simple dynamic programming solution to extract the heaviest path, constrained to have a limited number of jumps between sequences: Consider a table V[1…L][1…N][0…Z] initially set to 0. The values V[i,j,k] correspond to the weight of the heaviest path up to character i, choosing the last character from sequence j, that has made exactly k changes of sequences so far. The recursion for the general case (k>0, i>1) is as follows: \(\phantom {\dot {i}\!} V[i,j,k] = S[i,j] + max \{ V[i-1,j,k], max_{j' \neq j} V[i-1,j',k-1] \}\), and the base case for k=0, i>1 is: V[i,j,0]=S[i,j]+V[i−1,j], and for k=0, i=1: V[1,k,0]=S1,j.
Once the table is fully computed, the weight of the heaviest path with at most k∗ changes is given by max
j
{V[L,j,k∗]}. To reconstruct the path we need to traceback the solution.
However, in our experiments we noticed that the unconstrained version that just selects a maximum weight path without additional constraints performs better than the constrained version, and so we use the former by default in our pipeline.
It is worth noting that as opposed to a graph representation of the pan-genome where the possible recombinations are limited to be those pre-existing in the pan-genome, our multiple sequence alignment representation can also generate novel recombinations by switching sequences in the middle of a pre-existing variant. This happens in our example in Fig. 1, where the ad hoc reference could not be predicted using the graph representation of the same pan-genome shown in Fig. 2.
Variant calling
Variant calling can be in itself a complex workflow, and it might be tailored for specific type of variants (SNVs, Structural Variants), etc. We aim for a modular and flexible workflow, so any workflow can be plugged in it. The only difference is that we will feed it the ad hoc reference instead of the standard one.
In our experiments, we used GATK [4] version 3.3, following the Best Practices: first we aligned the reads to the reference using BWA, and next we used Picard to sort the reads and remove duplicates. Then we performed indel realignment using GATK RealignerTargetCreator and IndelRealigner, and finally we called variants using GATK HaplotypeCaller using parameters genotyping mode = DISCOVERY, standemit conf = 10 and standcall conf = 30.
Normalizer
Finally we need to normalize our set of variants. To do so we apply the variants to the ad hoc reference, so that we obtain an alignment between the ad hoc reference and the predicted sequence. The metadata generated in the preprocessor stage – while extracting the heaviest path – includes an alignment between the standard reference and the ad hoc reference. Using those, we can run a linear-time algorithm to obtain an alignment between the standard reference and the predicted sequence. From this alignment, we can generate a vcf file that expresses the predicted sequence as a set of variants from the standard reference.
Experimental set-up
Evaluation metric
We separate the single nucleotide variant (SNV) calls from indel calls as the results differ clearly for these two subclasses. A true positive (TP) SNV call is a SNV in the true donor and in the predicted donor. A false positive (FP) SNV call is not a SNV in the true donor but is a SNV in the predicted donor. A false negative (FN) SNV call is a SNV in the true donor but is not a SNV in the predicted donor. A true positive (TP) indel call is either an inserted base in the true donor with an identical inserted base in the predicted donor, or a deleted base in both the true and predicted donor. A false positive (FP) indel call is neither inserted nor deleted base in the true donor but is either inserted or deleted base in the predicted donor. A false negative (FN) indel call is an inserted or deleted base in the true donor but is neither inserted nor deleted base in the predicted donor. We report precision=TP/(TP+FP) and recall=TP/(TP+FN).
Modification to graph representation of pan-genome
In our approach we have used a multiple sequence alignment to represent the pan-genomic reference, but it is relatively easy to to use a graph representation [16] instead. A graph representation of a pan-genome usually use a vertex-labeled directed acyclic graph (labeled DAG), and reads are aligned to the paths of this labeled DAG. After all the reads have been aligned to the pan-genome, instead of our score matrix, we can store for each vertex the number of read alignments spanning it. Then the heaviest path can be easily computed using dynamic programming in a topological ordering of the graph: the weight of the heaviest path h(v) to a vertex v is \(\max _{v' \in N^{-}(v)} h(v')+w(v)\), where w(v) is the weight of a vertex and N−(v) is the set of vertices connected with a in-coming arc to v.
The difference to the multiple alignment heaviest path is that the number of recombinations cannot be limited when using the graph representation.
Another part that is different is the normalizer module to map the variants predicted from the ad hoc reference to the standard reference. For this, the original proposal in [16] already records the path spelling the standard reference, so while extracting the heaviest path one can detect the intersection to the standard reference path and store the corresponding projection as an alignment. Thus, one can use the same evaluation metrics as in the case of multiple sequence alignment -based variation calling.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request; most of the data and scripts to replicate the experiments, as well as a pre-built pan-genome index for the 1000 Human Genomes project data, are available online: https://www.cs.helsinki.fi/gsa/panVC
Code availability
Our tools are open source and available online: https://gitlab.com/dvalenzu/PanVC