Inconsistency between CARD models and BLAST homology
To discover ARGs from query DNA sequences, CARD predicts Open Reading Frames from query data using Prodigal [15], and then performs proteinprotein alignment with BLASTP [16]. A critical feature of CARD is that it provides a trained BLASTP alignment bitscore threshold for each type of antibiotic resistance gene. In contrast, other existing databases mainly use an empirical or userset parameter for the discovery of all genes. For example, another popular database Resfinder [17] requires percentidentity and coverage on reference genes as input parameters. The reason the approach of CARD is more appropriate is that ARGs in one category may be almost identical to each other while some categories can contain ARGs with relatively low similarity. We take two types of ARGs which are represented both CARD and Resfinder for illustration – tet(A) [18] and mcr1 [9, 10]. When all sequences of tet(A) in Resfinder are aligned to sequences of the same type in CARD, the mean percent identity is 99.6%. However, for mcr1 the mean percent identity is 47.2%. The degree of similarity inside a type of ARG could be very different, therefore it’s more reasonable to have a specific threshold for a specific type of ARGs. However, the flexible models could give type classifications that are not coherent with BLAST alignment homology relationships. Since the model only considers whether the bit score passes the threshold of a single ARG type, it can happen that the type classification of a query sequence is not the best BLAST hit. For example, if ARG type A reports higher bit score than type B for a given query sequence, but the pretrained threshold of A is much higher than B, then type B could be chosen instead of A. Since BLAST alignment serves as a generally accepted method to evaluate the similarity between genome sequences, we consider the occurrences of incoherence to BLAST homology to be ambiguous cases that need special attention.
Ambiguity in RND efflux pumps
RND efflux pumps [19] are a superfamily of transporters that have garnered intensive research efforts. Studies have revealed that they play critical roles in the development of multidrug resistance in various kinds of bacteria. In CARD databases, a series of ARG types in this family are presented. We notice that one gene (adeF) in this family is given a relatively low threshold – bit score 750 which allows sequences lower than 50% identity to be reported as an instance of this type. However, other genes mainly require much higher identity. MexF, another ARG in RND family, requires bit score 2200, which only allows almost totally identical sequence to be reported. Since genes in the RND family can display a certain level of homology even though they belong to different subtypes, ambiguous cases described in the last section are likely to occur. This can be clearly demonstrated with the help of ARG sequences in SARG [7], another ARG databases that contain ARG protein sequences in the RND family. There are over 300 protein sequences with MexF annotation in SARG. We align these sequences to CARD databases. In result, the MexF entry in CARD is certainly the best BLAST alignment hits for these sequences. However, they will be classified to adeF under the curated model of CARD since their bit score does not reach the threshold of MexF. Instead, their bit score to the adeF entry exceeds the threshold of the ARG type so that MexF sequences in SARG are all classified to the adeF entry by the CARD model.
Describe ambiguity in CARD database
To describe and quantify the ambiguity inside the classification model of CARD in a systematic manner, we define FNambiguity and Coherenceratio.
First of all, we have several basic variables:

1)
N_{i} = the number of prevalence sequences that can align to ARO entry A_{i}

2)
C_{i} = the number of sequences that are currently classified to entry A_{i}
Then we define two indicators with the pretrained bitscore cutoff. One is potential FalseNegatives for some ARO entries, which we would like to reduce, and Coherenceratio with respective to BLAST besthits which we intend to maximize.

A)
FNambiguity:
If a prevalence sequence S_{i} not annotated to the ARG A_{j} has (bitscore, percent identity) both larger than another sequence S_{k} which is annotated to the ARG, then S_{i} is potential FN for this ARO. Let M_{j} = the number of such potential FN sequences, we have:
$$ F{N}_{ratio\left({A}_j\right)}={M}_j/{N}_j $$
(1)
Also, we say that each such (S_{i}, S_{j}) is an FNambiguous pair for ARO A_{j}.
For each potential FP sequence S_{i} respect to ARO entry A_{j}, K_{i} = the number of sequences with lower (bitscore, identity) than S_{i} and annotated to G_{j.} We can calculate the probability of the occurrence of FNambiguous for an ARO A_{j} by:
$$ P\mathrm{FN}\mathrm{ambiguous}\ \mathrm{pair}=\frac{\sum Ki,j}{\left( Nj Cj\right)\ast Cj} $$
(2)
In the worst case, each of the sequences not annotated to the entry (NC) has (bitscore, identity) larger than all sequences annotated to the entry (C), then P = 1.
In the above example of MexF the FNratio is 0.79%, with P_{FNambiguous pair} at 0.07%. Over the whole database, the mean (sequenceambiguityratio, pairambiguityratio) is (3.1, 1.6%). We can see in Fig. 1 that both ratios gather below 5% with a smaller number of exceptions. Ratio coordinates of MexF, adeF, and entries with exceptionally high ratios are shown in Fig. 2.

B)
Coherence Ratio:
For a prevalence sequence S_{i}, suppose its besthit ARO entry Ai (the entry with the highest BLASTP alignment bitscore. If current ARG type annotation of Si is also Ai, we say S_{i} is a coherent instance for A_{i}. Let the number of coherent instances for A_{i} be TP_{i}, and the total number of sequences with A_{j} as the besthit ARO be B_{j}, we define:
$$ Coherence\ \left({A}_j\right)=T{P}_j/{B}_j $$
(3)
Since BLAST is the most wellestablished software to measure the homology between sequences, it’s reasonable to evaluate the coherence of the homology relationships given by CARD ARG models and BLASP alignment. We see that in many occasions that the ARG type annotation of the prevalence sequence is not its besthit ARO entry. Take MexF (ARO: 3000804) mentioned in previous experiments as an example (Fig. 3). We see a large portion of prevalence sequences with MexF as their besthits but annotated to adeF (red points in figure).
Since BLAST is the most recognized tool for evaluating homology between sequences, it’s preferable for the ARG identification models to be more coherent with the homology relationship according to BLAST. Therefore, we will seek to annotate more sequences to its best hit ARO entry. For the above example, it means to “recolor” all or a portion of red points (currently annotated to adeF) to MexF. However, the type change may cause an increase in the number of potential FalseNegative in the space of adeF, shown in Fig. 4. To reflect the tradeoff between, we set our objective function to be:
$$ {L}_{S,A}=\sum\Big( \frac{\mid {B}_i\mid }{\mid s\mid } Coherence\left({A}_i\right)\frac{\mid {A}_i\mid }{\mid s\mid } FN\_ ratio\left({A}_i\right)\Big) $$
(4)
where A_{i} is the number of sequences currently annotated to A_{i,} S is the total number of sequences.
In the next section, we will show how we can largely elevate the coherence ratio while keeping FNratio in a significantly smaller scale.
Resolve ambiguity by recoloring with support vector machine
Given a set of query sequences S, we align them to the representative sequences of all ARO entries in CARD database. For each sequence S_{i}, only the best hit with both highest bitscore and highest percentidentity is kept. If the besthit ARO A_{i} of a query sequence is different from the ARO A_{j} assigned by the CARD model, we include this sequence to the “Problem Set” of A_{i} (denoted by PS_{i}). The key point of this step is: when sequences in the problem set are aligned to two similar ARO entries, we view Align_ARO_A_{i} (S_{i}) as a transform from sequence to 2Dcoordinates space (Percent Identity, Bitscore). For the same set of sequences, if in the transformed space Align_ARO_A_{i} (S_{i}) they are clearly distinguished with other negative hits, but in another space they are mixed together, then it’s reasonable to think that the set of sequences are true positive of A_{i} instead of A_{j}. We can illustrate the argument with the problem set of ARO entry cmeB, which are annotated to adeF (Fig. 5). In the space of cmeB, the best hits of the red points are cmeB but they are annotated to adeF since bitscore cutoff of cmeB set by CARD is much higher than that of adeF. However, when we observe the problem set with the negative hits with respect to either space, we can see that in cmeB space, the problem set is clearly above negatives but in adeF space there are negatives both above the below the problem set. Therefore, it’s reasonable to say these sequences are potential false positive of adeF and true positive of cmeB.
To quantify how far the problem set are divided with the negatives, we compute a support vector machine (SVM) in each space. The idea to use SVM is inspired by the clear lineardivisibility between a part of the problem set and the whole negatives of MexF (Fig. 6). In this situation, it’s reasonable to believe that the upperright part of the problem set are not negatives of MexF (currently they are classified to adeF) and thus should be recoloring to MexF.
Here SVM serves as a measurement for the divisibility of points of different classes in a space. There is an established computational method for evaluating such divisibility of an SVM in python scikitlearn package, namely Platt scaling. The mean probability of the prediction on all these points can be calculated by Platt scaling. The probability computed in this way increases when the point moves away from the division line of the SVM, thus it could be used to determine which space is a better transform.
$$ {P}_{space}\_ ARO\left(y=1X\right)=1/\left(1+\mathit{\exp}\left(A\ast f(x)+B\right)\right) $$
(5)
f(x) = wx + b is the division line of the SVM, and A, B are parameters trained from the prediction data by Plat scaling.
If the space of current ARO (adeF in the above case) of the problem set reports lower Platt probability, we will recolor the portion of problem set above the division line of SVM (Fig. 7) to the ARO of the best hits and update the cutoff of both AROs to their updated lowest bitscore and lowest percent identity. For cmeB where a SVM with high divisibility is computed, we use the decision function of SVM as an extra cutoff method.
Besides cmeB, there are ten other ARO entries with their problem sets containing more than 50 sequences annotated to adeF. These ARO entries are {‘ceoB’, ‘mexY’, ‘cmeB’, ‘mexQ’, ‘mdsB’, ‘oqxB’, ‘MexB’, ‘MexF’, ‘acrD’, ‘adeB’, ‘acrB’, ‘AcrF’}. We compute their problem sets, and then evaluate in which space these sequences are better divided with the negatives compared with adeF. We plot situation of acrB vs. adeF in Fig. 8. In this case, the predicted logprobability of the SVM for acrB is lower than the SVM for adeF, and we can also see from the 2dcoordinates that red points and gray points sequences in acrB displays tendency to mix with each other. Thus, we won’t consider recoloring the problem set of acrB.
In contrast, we can see a clear division between a large portion of the problem set of MexF and its negatives (Fig. 7). After computing the SVM, the righthand portion of the problem set will be recolored to MexF.
Formulation of categorical optimization problem
The last section demonstrated that we can increase the coherence with BLAST homology relationships while maintaining low FN ambiguous rate by “recoloring” a part of sequences. However, the above transform is more an empirical trial than a systematic optimization. Therefore, in this section we will formulate a categorical optimization problem [20] for the recoloring process between two spaces  a fixed “origin” space (adeF in our problem instance) and another “alternative space” (MexQ, MexF, etc). For a set of protein sequences G [1..N], we define a categorical variable X_{i} ∈{O, A, null (neither of the two types)} for G_{i} representing its ARG category classification. Every assignment of X[1..N] is called a “configuration”. The initial configuration is the SVM result in the last section. We have discussed that recoloring a sequence from the origin (adeF) to the alternative (MexF) may increase the coherence ratio of the alternative (MexF) space but add ambiguity to the origin (adeF) space. Suppose P of type O has higher (Percent Identity, Bit Score) than some sequences of type origin. If P is recolored to another type, then there should be a penalty to the confidence of those sequences.
For computational efficiency, we divide the Percent Identity – Bit Score map to a grid of M × N equalsized cells. A point in the cell (x, y) in the origin space with type A will impose one unit of penalty to all points of type O in its leftdown region excluding the cell itself. Let n_{o} be the number of type O points in the rectangle (0,0,x1,y1), N_{O} (N_{A}) be the total number of type O(A) points. For each type A point (x,y) we have:
$$ Penalt{y}_O\left(\mathrm{x},\mathrm{y}\right)={n}_o\left(x1,y1\right) $$
(6)
Penalties of all type A points are added and normalized to get the total penalty on the origin space:
$$ Penalty(O)=\sum \limits_{point\left(x,y\right)\ of\ type\ A\ }{n}_o\left(\mathrm{x}1,y1\right)/\left({N}_A\ast {N}_O\right) $$
(7)
To make the optimization problem more reasonable in the biological meaning, we add an extra restriction such that the alternative space remains lineardivisible, as drawn in Fig. 7. Formally speaking, we require that there exists a line Y = aX + b in the alternative space such that the points of type A are all above the line. We intend to compute the slope and intercept of the optimal division line w. Therefore, our final objective function to maximize is:
$$ f\left(a,b\right)= Coherence(A)+k\ast Penalty(O)\kern0.37em \left(\mathrm{k}>=1\right) $$
(8)
Higher coherence indicates high potential sensitivity for A while higher penalty means potential wrong classification. Therefore, we tend to give larger weights to the later term since usually we prioritize preventing false results. However, the specific value of k depends on the need of the application and also the specific ARG type that we are concerning. Here we use MexQ to set a valid range of k, and then explore the results on MexF for k in that range. The reason we choose MexQ to set the range is that it gives the highest probability calculated by Platscaling in the last section, which means the problem set and the negatives of MexQ are already welldivided in the space so that we can trust the initial configuration of MexQ as the answer. Therefore, k is set to be in (1100) so that the initial configuration for MexQ is optimal.