We present details of GWAMAR in this section, including the problem setting, the preprocessing of input data and the computation of the association scores between the drug resistance phenotypes and point mutations.
Problem setting
We consider a set of closely related bacterial genomes. Typically, this is a set of strains within the same species of bacteria.
Then, we represent the available drug resistance information as a set of drug-resistance profiles , where each drug resistance profile is represented as a vector:
.
Here, S, I, R denote that a given strain is known to be drug susceptible, intermediate-resistant, or resistant, respectively. Using question mark we indicate that the drug resistance status of a strain is unknown. We call a drug resistance profile complete if it does not contain question marks.
Analogously, we represent the genotype data as a set of mutations , where each mutation is represented as a triple (g, p, v), where g,p,v denote the gene family identifier, the position of the mutation in its corresponding multiple alignment, and the mutation profile, respectively. The mutation profile is represented as a vector:
.
Here Σ denotes the set of amino acides (for protein-coding genes) or nucleotides (for promoters and rRNA coding genes). We also assume that Σ contains the gap '-' symbol. Using question mark we indicate that the corresponding amino acid or nucleotide is unknown. Analogously, we call a mutation profile complete if it does not contain question marks.
It should be noted that potentially multiple mutations at different positions in the genome may have identical mutation profiles. Moreover, it may happen that multiple mutations may correspond to the same set of mutated strains. In that situation the mutations would essentially carry the same information about the profiles. Thus, we also introduce an auxiliary concept called binary mutation profile. Let denote the reference strain and denote any strain. Then, for a given mutation profile v its corresponding binary mutation profile is defined as follows:
Analogous to mutation profiles, we call a binary mutation profile complete if it does not contain question marks.
Finally, we define the aim of the tool as: To produce an ordered list of associations between the phenotype and genotype data (represented as drug resistance and mutation profiles) such that the top-scored associations are the most likely to be real.
The pipeline of GWAMAR
Figure 1 illustrates the overall design of the tool. Data preprocessing for GWAMAR consists of several steps which may be performed by our previously developed tool, eCAMBer [14]. These preprocessing steps comprise:
-
download genome sequences of multiple bacterial strains,
-
unification of gene annotations,
-
identification of homologous gene families,
-
multiple alignments of the gene families (employing MUSCLE),
-
reconstruction of the phylogenetic tree (employing PHYLIP),
-
identification of point mutations based on the multiple alignments.
The input drug-resistance profiles, typically, are collected from the articles or databases which provide drug-resistance information for the strains of interest. The set of identified point mutations, the set of drug-resistance profiles and and the phylogenetic tree constitute the input for GWAMAR.
In the next step, GWAMAR computes binary mutation profiles for each mutation profile. This step significantly reduces the number of genetic profiles to be scored. Finally, GWAMAR computes several statistical scores to associate drug-resistance profiles to the mutation profiles, including mutual information, odds ratio, hypergeometric test, weighted support (which is our previously published approach [13]), and the tree-generalized hypergeometric score (our new approach here).
Tree-generalized hypergeometric score
As a part of this work we also introduce a new association score, called tree-generalized hypergeometric score (TGH ). This score is a modification of the CCTSWEEP score introduced in the paper [16]. In this section, we consider a subset of strains S for which a given drug-resistance profile r and a binary mutation profile b are complete, i.e. do not contain question marks. Moreover, we assume that r does not contain any intermediate-resistant strains. In all our computational experiments we transform the intermediate-resistant strains into resistant strains.
In order to present the formal definition of TGH, we first define an auxiliary concept called coloring. For a given tree T , we call a subset c of its nodes a coloring, if it satisfies the following two conditions:
-
each path from a leaf to the root contains at most one node from c,
-
each internal node in c has a sibling node which does not belong to c.
Here we also introduce a function L which, for each node ω, returns the set of descendants of the node, including the node itself. We say these nodes are visible from ω. Additionally, the function L applied to a coloring c returns the union of all nodes visible from nodes in c.
Let C
T
denote the set of all colorings of T . Then, for each complete drug-resistance profile r there exists exactly one coloring ĉ such that the set of leaves visible from ĉ equals the set of drug-resistant nodes in r. We say this coloring is induced by the drug-resistance profile. Analogously, for each complete binary mutation profile b there is exactly one induced coloring .
Intuitively, the coloring induced by a given complete drug-resistance profile will contain the set of nodes in which drug-resistance was acquired (assuming a model in which drug-resistance cannot be reversed). Analogously, the coloring induced by a given binary mutation profile will contain the set of nodes in which the mutation was acquired.
Figure 2 (A) presents an example of colorings induced by a given drug-resistance profile (large red nodes) and a given binary mutation profile (small orange nodes) for a flat tree. In that situation the colorings may be interpreted as independent drawing of balls as in the standard hypergeometric distribution model. Knowing this property of TGH we proposed its name as it generalizes the standard hypergeometric test in the case of a flat tree. Figure 2 (B) presents another example of colorings induced by the same pair of profiles, but for a tree which is not flat. In this model the dependencies between different strains are captured by the topology of the tree.
Let us now assume we want to compute the TGH score for a pair of complete drug-resistance profile r and complete binary mutation profile b. Let us additionally assume that the size of coloring induced by b equals n. Morover, let the number of nodes in coloring visible from the coloring ĉ equals k. This value can be interpreted as the number of times the considered mutation was acquired not earlier than the resistance was acquired.
Now, let V
T
(n) denote the number of colorings of size n:
V
T
(n) may be interpreted as the total number of binary mutation profiles for which the induced coloring is of the same size as for .
Then, let B
T, ĉ
(k, n) denote the number of colorings of size n, such that exactly k nodes of that coloring are visible from nodes of coloring ĉ.
Here, the value B
T, ĉ
(k, n) may be interpreted as the number of binary mutation profiles such that their induced coloring has n elements, out of which k is visible from the nodes in ĉ.
Finally, for the complete drug-resistance profile r and complete binary mutation profile b, which induce colorings ĉ and , respectively, we define the TGH score as follows:
Here, we take the negative logarithm to have consistent property for all considered scoring methods, such that the higher the score the more likely drug-resistance profile r is associated with binary mutation profile b.
Time complexity
Let D denote the number of drug-resistance profiles considered. Additionally, let N denote the number of considered strains and M denote the number of binary mutation profiles. Finally, let K denote the maximal number of children of an internal node in the tree. Then, the time complexity of the algorithms we implemented to compute the hypergeometic score, the mutual information, odds ratio, and weighted support is O(D N M ). In order to compute TGH, we implement a dynamic programing algorithm which computes the values B
ω,ĉ
(k, n) for each internal node ω, k and n. This strategy gives an algorithm with complexity O(D·N K−1·N2 +D·N·M ). For the brevity of the presentation we skip details of the algorithm.