GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure

Background Horizontal gene transfer contributes to bacterial evolution through mobilising genes across various taxonomical boundaries. It is frequently mediated by mobile genetic elements (MGEs), which may capture, maintain, and rearrange mobile genes and co-mobilise them between bacteria, causing horizontal gene co-transfer (HGcoT). This physical linkage between mobile genes poses a great threat to public health as it facilitates dissemination and co-selection of clinically important genes amongst bacteria. Although rapid accumulation of bacterial whole-genome sequencing data since the 2000s enables study of HGcoT at the population level, results based on genetic co-occurrence counts and simple association tests are usually confounded by bacterial population structure when sampled bacteria belong to the same species, leading to spurious conclusions. Results We have developed a network approach to explore WGS data for evidence of intraspecies HGcoT and have implemented it in R package GeneMates (github.com/wanyuac/GeneMates). The package takes as input an allelic presence-absence matrix of interested genes and a matrix of core-genome single-nucleotide polymorphisms, performs association tests with linear mixed models controlled for population structure, produces a network of significantly associated alleles, and identifies clusters within the network as plausible co-transferred alleles. GeneMates users may choose to score consistency of allelic physical distances measured in genome assemblies using a novel approach we have developed and overlay scores to the network for further evidence of HGcoT. Validation studies of GeneMates on known acquired antimicrobial resistance genes in Escherichia coli and Salmonella Typhimurium show advantages of our network approach over simple association analysis: (1) distinguishing between allelic co-occurrence driven by HGcoT and that driven by clonal reproduction, (2) evaluating effects of population structure on allelic co-occurrence, and (3) direct links between allele clusters in the network and MGEs when physical distances are incorporated. Conclusion GeneMates offers an effective approach to detection of intraspecies HGcoT using WGS data.


Background
Horizontal gene transfer (HGT) or lateral gene transfer accelerates bacterial genome innovation and evolution [1]. It contributes to gene flows across taxonomic boundaries and thus variation in accessory gene content both within and between species [2,3]. Mobile genetic elements (MGEs), such as plasmids, bacteriophages, and transposons, are common vectors of acquired genes in HGT [4]. Particularly, when acquired genes are physically linked (namely, co-localised in an MGE or otherwise physically close in DNA molecules), they can be horizontally co-transferred between bacteria, causing positive gene-gene associations known as genetic linkage [5,6].
The rapid accumulation of bacterial whole-genome sequencing (WGS) data in the most recent two decades [7] enables us to study horizontal gene co-transfer (HGcoT) at the population level. Since bacteria reproduce asexually and HGT can occur across different levels of taxonomic boundaries [2], gene-gene associations that cannot be completely explained by bacterial population structure (which determines the distribution of co-inherited genes) suggests HGcoT [8]. Consequently, it is usually trivial to identify candidates of interspecies HGcoT using simple association tests (such as chisquared tests and simple logistic regression), whereas for detecting intraspecies HGcoT, we must overcome two related challenges arising from the presence of population structure within a species: (1) how to control for population structure in association tests; and (2) how to accurately estimate or represent the population structure to be controlled for.
Univariate linear mixed models (LMMs), which have been widely used in human genome-wide association studies (GWAS) [9] and recently applied to bacterial GWAS [10], provide a solution to address both challenges. Each model explains a response variable using a fixed effect of an independent variable and mixed random effects of population structure and environmental factors. For each LMM, the population structure is represented by a relatedness matrix, whose principal components (PCs) can be used for an orthonormal transformation of genetic variation underlying the population structure [11]. McVean demonstrates that not only do these PCs simplify computations, but also correlate with bacterial genealogies [12]. Compared to phylogeny-based association tests, such as phylogenetically independent contrasts and phylogenetic generalised least squares [13], LMMs do not rely on specific phylogenetic models nor assume that mutation rather than recombination dominates genetic variation.
Here, we introduce R package GeneMates, which implements a novel network approach to the identification of HGcoT within a bacterial species. This approach takes as input specific information extracted from bacterial WGS data and produces a network showing allele-level evidence of HGcoT. We validated GeneMates using published WGS data from two bacterial species, Escherichia coli and Salmonella enterica, with which we identify horizontally co-transferred antimicrobial resistance (AMR) genes. We also provide helper scripts to assist users in preparing necessary input files from standard formats. Since GeneMates is theoretically applicable to any kind of acquired genes in bacteria, it has the potential to also be used for investigating structures and dissemination of other mobile gene clusters of interest, such as physically linked virulence genes.

Implementation
GeneMates consists of R functions performing network construction, topological analysis, and data visualisation. It works at the allele level to track recent HGcoT. Particularly, we assume that bacterial isolates under study will typically be collected in a period that is too short to accumulate any mutation in recently acquired genes of interest, such as AMR genes. In addition, for convenience we assume that every one of these acquired genes has at most one allele in each genome because it is not possible to reliably resolve co-occurring alleles of the same gene using short reads, which by far account for the majority of WGS data.
We developed a network approach to integrate and visualise evidence of physical linkage between alleles of acquired genes in bacteria. In our network, nodes represent alleles and weighted directed edges reflect the strength of evidence. GeneMates produces such a network in each run. Let vector e denote an edge, then it represents a linear model Y ∼ X, where scalar variables Y and X denote presence-absence of alleles Y and X in an isolate, respectively. This model explains the distribution of allele Y (response allele) using that of allele X (explanatory allele) and covariates. For GeneMates, in particular, the covariates are isolate projections from an orthonormal transformation of the population structure based on a core-genome relatedness matrix. Correspondingly, the edge is directed from node X to node Y. Users may filter edges of a resulting network based on an association score s a (e) and a distance score s d (e) in order to identify edges showing strong evidence of physical linkage, and carry out topological analysis on the filtered network afterwards. In following subsections, we describe key elements of our approach. Additional details of implementation are provided in Section 3 of Additional file 1. Figure 1 illustrates our work flow for network construction -the core functionality of GeneMates. In order to integrate functions implementing this work flow, we created a wrapper function findPhysLink (find physical Fig. 1 An overall workflow of function findPhysLink. In this flowchart, cylinders denote both the input WGS data and a non-redundant (namely, with no sequence duplication) reference database of acquired genes; rounded rectangles denote two key outputs -an association network and a linkage network when reliable APDs are provided; ordinary rectangles denote intermediate results, which are matrices or tables; each arrow represents a process of specific data analysis, which starts from the input and ends at its outcome, with the process name labelled besides the line.

Network construction
Steps integrated into function findPhysLink are encircled by the dashed rectangular border. Abbreviations: cgSNPs, core-genome single-nucleotide polymorphisms (SNPs), which are restricted to biallelic SNPs present in all isolates for GeneMates; PAM: a binary presence-absence matrix linkage), which takes as input a binary allelic presenceabsence matrix (PAM) of target genes across genomes, a matrix of biallelic core-genome single-nucleotide polymorphisms (cgSNPs), and optionally, a table of allelic physical distances (APDs) for target genes. These inputs can be extracted from read alignments and genome assemblies using our helper scripts (released with Gen-eMates). Function findPhysLink determines nodes and edges of a resulting network and produces tables that can be exported to Cytoscape [14] as node and edge attributes for network visualisation.

Node generation
Assuming m alleles of target genes are detected in n bacterial isolates, GeneMates function importAllelicPAM imports an n × m binary PAM A = (a ij ), where entry a ij = 1 if the j-th allele is found in the i-th isolate, and a ij = 0 otherwise. This function offers two optional filters to discard alleles of insufficient frequencies and/or co-occurrence frequencies. In order to reduce the number of tests for allele-allele associations, we followed the implementation of R package BugWAS [10] and coded importAllelicPAM to de-duplicate each group of identical columns of PAM into a binary vector called an allelic distribution pattern (Section 3.1.2 of Additional file 1). The function uses column means to apply a column-wise zero-centring to the pattern matrix, which is a common technique used for simplifying matrix algebra without affecting the distribution of data points [15].

Edge weights
GeneMates evaluates two kinds of evidence for inference of horizontal co-transfer of alleles X and Y. The first evidence is a significant positive fixed effect of X on presence-absence of Y when controlling for bacterial population structure. Testing for this effect is an analogue of GWAS, which test for genotype-phenotype associations. Specifically, for edge e in an association network, Gen-eMates function lmm estimates parameters of a univariate LMM using a residual maximum likelihood (REML) approach to test for the fixed effect of X (Section 3.1 in Additional file 1), and another function evalPL transforms the estimated effect sizeβ and its Bonferroni-corrected p-value into an association score s a (e) with possible values 1 (significant positive association), -1 (significant negative association), and 0 (insignificant association). See Figure s1 and Section 3.3.1 in Additional file 1 for details.
The second evidence comes from consistent physical distances between alleles X and Y (namely, APDs) in different bacterial genomes as structural variation is likely to only occur at a limited level within a mobile gene cluster in a short period. For instance, the same AMR gene cluster sul2-strA-strB has been circulating amongst Gramnegative bacteria for decades due to its association with plasmids and transposons [16]. Since APDs are measured in genome assemblies, whose completeness determines the amount of measurable APDs when X and Y are colocalised in the same genomic region, for edge e, we also consider its distance measurability m in (e) -the percentage of genomes in which the APDs between X and Y are actually measurable. This measurability value is calculated by GeneMates function summariseDist, which also evaluates the consistency of APDs included for the distance assessment and assigns a consistency score c(e) with values -1 (evidence against physical linkage), 0 (insufficient evidence) and 1 (evidence supporting physical linkage). Notably, this function estimates the probability of distance identity-by-descent (IBD) and compares it to a user predefined threshold (default: 0.9) for the assignment of each consistency score. (Section 3.3.2 and Figure s2 in Additional file 1) A summary distance score is thereby defined for edge e as the consistency score weighted by measurability: s d (e) = m in (e)c(e). Finally, the association and distance scores are summed to get a linkage score s(e), reflecting the evidence of physical linkage, where − 2 ≤ s(e) ≤ 2 because 0 ≤ m in (e) ≤ 1, and s(e) has five levels (Table s1). Particularly, we define a linkage network as an association network in which weights of each edge consist of a fixed-effect size and a linkage score (Section 3.3 in Additional file 1).

Network visualisation and topological analysis
GeneMates comprises several functions used for displaying resulting networks and exploring network topology for evidence of HGcoT under given conditions. For instance, function mkNetwork (make network) extracts user-specified node attributes (such as the frequency and associated AMR phenotype of each allele) and edge attributes (such as the estimated fixed effect sizeβ and linkage score s of each edge) from result tables of findPhysLink and prints them to Cytoscape-compatible text files for network visualisation. Function extractSubgraphs follows a predefined node list to pull out subnetworks from a parental network produced by find-PhysLink. Particularly, the subnetworks can be maximal cliques identified using function max_cliques of R package igraph [17]. In addition, function countNeighbours lists the number of neighbours per node in a network and function getClusterMemberCooccurrence identifies isolates in which member alleles of a selected subnetwork are co-occurring.

Results
We assessed the performance of GeneMates and validated our methodology using published WGS data sets of two bacterial pathogens of great clinical concern: multidrugresistant E. coli and S. enterica serovar Typhimurium. Genomes in these well studied data sets have distinct population structures, harbour diverse AMR genes and MGEs, and show known gene-gene associations that we expected GeneMates to identify. See Section 4 of Additional file 1 for details of materials and methods.

Characteristics of example data sets
Both the E. coli and Salmonella data sets consisted of paired-end Illumina WGS reads, generated from 169 E. coli isolates collected during the Global Enteric Multicentre Study [18,19] and 359 isolates of typical S. Typhimurium Definitive Type 104 [20], respectively. See Additional file 2 for detailed isolate information.

AMR gene content
In genomes of the 169 E. coli isolates, we identified 178 alleles of 33 AMR genes conferring resistance to eight antimicrobial classes ( Figure s3). The four known intrinsic AMR genes of E. coli (ampH, ampC1, ampC2, and mrdA) displayed higher frequencies (> 87%) than the 29 acquired AMR genes (< 63%). Altogether, we detected 67 alleles of acquired AMR genes, including 45 alleles showing frequencies less than 3%. In genomes of the 359 Salmonella isolates, we identified 57 alleles of 24 AMR genes (conferring resistance to six antimicrobial classes), including a single allele of the known intrinsic AMR gene (aac6-Iaa) of S. enterica and 56 alleles of 23 acquired AMR genes ( Figure s5). Notably, five acquired AMR genes (sul1, aadA, bla CARB , tet(G), and floR) that are known to be frequently present in Salmonella genomic island 1 (SGI1) [21] were only detected in the dominant lineage of collected Salmonella genomes (Figure s6), whereas alleles of other acquired AMR genes were sparsely distributed across a core-genome phylogenetic tree of these Salmonella genomes ( Figure s7). Further, we identified four and one clusters of identically distributed alleles in E. coli and Salmonella genomes, respectively ( Figure s8).

Core-genome SNP sites
We analysed chromosomal single-nucleotide polymorphisms (SNPs) of each species using the method described in Section 4.2 of Additional file 1. Particularly, we define cgSNP sites as SNP sites detected in all chromosomes and outside of repetitive or prophage regions. Altogether, numbers of cgSNP sites used for correcting for population structure (namely, biallelic cgSNP sites) of the 169 E. coli genomes and 359 Salmonella genomes were 209,097 and 2,316, respectively. The percentage of total genetic variation captured by PCs of each cgSNP matrix is illustrated in Figure s9a. Furthermore, the REML estimate of parameter λ (which reflects the effect of population structure on the distribution of the response allele) in the null LMM for presenceabsence of each allele of AMR genes perfectly predicted whether ≥ 5 PCs had significant effects (Bonferronicorrected p-values ≤ 0.05) on the distribution of this allele ( Figure s9b).

Effects of controlling for population structure
In order to evaluate the effect of controlling for population structure using LMMs when measuring associations between alleles of acquired AMR genes, we compared unadjusted p-values of fixed effects estimated using the LMMs to those estimated using simple penalised logistic models (PLMs) [22] (Fig. 2). Particularly, we considered a fixed effect estimated with either kind of models significant if its Bonferroni-corrected p-value was ≤ 0.05. Figure 3 illustrates a directed comparative network for detected alleles of AMR genes in each example data set. In this network, each edge starts from an explanatory allele and terminates at a response allele, representing a significant fixed effect of the explanatory allele in an LMM or PLM or both. For the 67 alleles in E. coli genomes, 3,364 LMMs were applied to 60 allele distribution patterns, and 118 significant pairwise associations were detected. Simple PLMs for the same allele patterns of AMR genes in E. coli genomes identified 70 significant associations, 50 of which overlapped with those from LMMs. The resulting comparative network consisted of 45 nodes, 138 edges, and two connected graph components (Fig. 3, E. coli). Of note, one of these components was comprised of rare alleles and was only identified by LMMs (Figure s10). Regarding the 56 alleles in Salmonella genomes, 2,040 LMMs were applied to 48 allele distribution patterns and 112 significant pairwise associations were detected. Simple PLMs for the same allele patterns of AMR genes in Salmonella genomes  a and b) and Salmonella data set (panels c and d), respectively, a scatter plot and box plot of paired p-values are drawn on square-root transformed axes to compare the p-values. Any p-value less than 2.2 × 10 −16 is rounded to 2.2 × 10 −16 owing to the smallest precise floating number in our computer. In panels a and c, black diagonal lines indicate equality between p-values from these two kinds of linear models, and grey dashed lines indicate the p-value corresponding to the Bonferroni corrected p-value of 0.05, which is used in this study as a cut-off for significant associations. Associations that were only significant in PLMs, only significant in LMMs, significant in both PLMs and LMMs, and significant in neither kind of models, are represented by blue, red, purple, and grey circles, respectively Comparative networks for detected alleles of acquired AMR genes in 169 E. coli and 359 Salmonella genomes. Each node represents an allele or a cluster of identically distributed alleles, with a diameter proportional to the allele frequency and a fill colour indicating the AMR phenotype encoded. The yellow node border denotes an allele or allele cluster detected in one genome. The edge width is proportional to the strength of a significant association (|β|) determined using an LMM. Edge colours indicate significant associations from LMMs or PLMs or both. Solid and dashed edges represent significant positive (β > 0) and significant negative associations (β < 0), respectively, identified using models indicated by edge colours. The Venn diagram of each panel counts edges under each colour. The shaded area (light green) encircles alleles of known co-transferred AMR genes. AMR classes defined by antimicrobials that bacteria were resistant to: AGly (aminoglycosides), Bla (beta-lactams), Flq (fluoroquinolones), MLS (macrolides, lincosamides, and streptogramins), Phe (phenicols), Sul (sulfonamides), Tet (tetracyclines), Tmt (trimethoprim), and Mix (multiple classes of antimicrobials) identified 48 associations, 36 of which overlapped with those from LMMs. The resulting comparative network consisted of 32 nodes, 124 edges, and one connected graph component (Fig. 3, Salmonella). In addition, for both E. coli and Salmonella data sets, estimates of fixed effects in LMMs displayed a complete sign identity to those in PLMs given a maximum type-I error rate of 0.05.
For most allele pairs, correcting for population structure using LMMs yielded greater p-values than did PLMs: the majority of fixed effects (74% for E. coli and 85% for Salmonella) in LMMs became less significant than those in PLMs (Fig. 2b, d). By contrast, associations in some allele pairs became more significant (that is, of smaller p-values) after adjusting for population structure, and in most of these pairs (93% and 99% for E. coli and Salmonella, respectively), LMMs identified a moderate to strong structural random effect (namely, parameter estimate 1 <λ 0 ≤ 10 5 in an LMM under the null hypothesis of no fixed effect from an explanatory allele) underlying presence-absence status of one or both paired alleles across isolates. Further, using one-sided two-proportions Z-tests, we found that LMMs showing moderate to strong structural random effects were less likely to show increased p-values than were other LMMs when compared to PLMs, with contrasts of proportions 70% versus 90% (p-value ≈ 0) in LMMs for the E. coli data set and 85% versus 92% (p-value = 0.06) in LMMs for the Salmonella data set.

Effects of adding APDs to association networks
Every APD was measured by a shortest-path distance (SPD), defined as the smallest distance between two given loci in an assembly graph (constructed using Unicycler [23]). This approach allows us to recover query sequences that are split into adjacent contigs and to measure APDs between loci that are located in different contigs, potentially increasing distance measurability of the graph, defined as the percentage of measured APDs in all possible APDs of a complete circular genome. Nonetheless, since SPDs are affected by the topology of each assembly graph, which is usually tangled and partially resolved when only short reads are used for genome assembly, it is necessary to determine appropriate criteria for filtering out inaccurate SPDs. Therefore, we downloaded from GenBank [24] 10 complete reference genomes of multidrug-resistant E. coli and S. Typhimurium, respectively (Tables s2, s3 and Additional file 2), and compared SPDs measured in de novo assembly graphs (constructed from simulated Illumina reads) to true physical distances extracted from original circularised complete genomes (Section 4.6 of Additional file 1). Specifically, we used Bandage [25] to identify BLAST hits of random coding sequences (CDSs) in each assembly graph and to extract SPDs for each pair of hits. As expected, Bandage always recovered more query CDSs from the assembly graph than it did from contigs of the same bacterial genome (Tables s4 and s5).

Filters determined for removing inaccurate SPDs
We considered an SPD accurate if its error fell within a given tolerance range (for instance, ± 2 kbp), and hence defined the accuracy rate as the percentage of accurate SPDs in all SPDs, either filtered or not. In practice, any parameter for BLAST hits and SPD measurement can be taken into account for excluding SPDs. In this study, we assessed the accuracy rate of SPDs measured within various maximum distances and node numbers in each assembly graph when confining BLAST hits to a minimum query coverage and nucleotide identity of 95%. Across E. coli and S. Typhimurium genomes, we constantly saw that the accuracy rate reached ≥ 90% under an error tolerance ± 1 kbp when SPDs were measured within 250 kbp and no more than two nodes (Figures s11 and s12).
Moreover, since the accuracy rate of SPDs measured within contigs stayed above 90% when tolerating errors within ±1 kbp (Figures s13 and s14), we implemented prioritisation of SPDs based on their sources (namely, contigs or assembly graphs) in order to exclude inaccurate SPDs from assembly graphs where repeats had not been resolved by the genome assembler. Specifically, when the SPD between two CDSs is measurable in both a contig and an assembly graph of the same genome, this method overrides the graph-based SPD with its corresponding contig-based SPD, thereby taking advantage of both the high accuracy rate of contig-based SPDs and high measurability of graph-based SPDs. As shown in Tables s6 and s7), the prioritisation method led to an accuracy rate of 100% for the majority of SPDs measured between acquired AMR genes, which are often embedded within tangled sub-graphs owing to surrounding repeats.
We found that inaccurate SPDs measured across two nodes of Salmonella assembly graphs (Table s7) were caused by chimeric alleles (that is, highly similar alleles of the same gene were mistakenly assembled into one allele) in the assembly, owing to the limited capacity of short reads in resolving repeats. We addressed this issue through increasing the threshold for both the nucleotide identity and query coverage of BLAST hits from 95% to 99% (BLAST hits to chimeric alleles were hence discarded) and thereby improved the accuracy rate from two genomes at the cost of reducing distance measurability (Table s8). Accordingly, we applied this adjusted threshold to the measurement of SPDs between acquired AMR genes in the 358 Salmonella draft genomes. Besides, we directly calculated SPDs for strain DT104, whose complete genome is publicly accessible on GenBank.
SPDs between alleles of acquired AMR genes From de novo assemblies of the 169 E. coli genomes, we obtained 1,550 SPDs from 301 allele pairs that were tested for associations (Hence alleles of each pair did not have identical presence-absence status across genomes) and 20 SPDs from nine pairs of identically distributed alleles. The largest SPD was 59,433 bp (measured across 17 nodes) and the greatest node traversal to measure an SPD was across 39 nodes (yielding an SPD of 4,628 bp). The exclusion of SPDs measured across more than two nodes resulted in 673 (43%) SPDs reliably measured for 163 tested allele pairs and 18 (90%) SPDs for eight pairs of identically distributed alleles. From de novo assemblies of the 359 Salmonella genomes, we obtained 2,880 SPDs from 224 allele pairs tested for associations, including 2,322 SPDs between alleles of five SGI1-borne AMR genes (Table s9), and obtained another 10 SPDs from five pairs of identically distributed alleles. The largest SPD was 710,475 bp (measured across 44 nodes) and the greatest node traversal to measure an SPD was across 57 nodes (yielding an SPD of 687,051 bp). We saw large SPDs (> 56 kbp, which were extraordinarily larger than the others) when the node number exceeded 14. The exclusion of SPDs that were greater than 250 kbp and measured across more than two nodes resulted in 994 SPDs from 59 tested allele pairs and eight SPDs from three pairs of identically distributed alleles.
Overall, positively associated alleles of acquired AMR genes in E. coli and Salmonella genomes showed higher measurability of SPDs than those measured between negatively associated alleles. Specifically, as for pairs of positively associated alleles, 64 (72%) out of 89 pairs in E. coli and 25 (26%) out of 97 pairs in Salmonella had at least two SPDs measured, respectively (Tables s10 and s11). Moreover, after removing SPDs that were greater than 250 kbp and measured across more than two nodes, 31 (35%) out of the 89 allele pairs in E. coli and 10 (10%) out of the 97 allele pairs in Salmonella had distance measurability above 75%. On the contrary, no SPD was measurable between negatively associated alleles as these alleles did not co-occur in any genome.

Linkage networks showing support of SPDs to HGcoT
For each species, we created a linkage network, in which nodes represent alleles or clusters of identically distributed alleles of acquired AMR genes and directed edges indicate significant associations obtained from LMMs (Fig. 4). The edge width is proportional to an estimated effect sizeβ, solid lines and dashed lines indicate positive associations and negative ones, respectively, and the edge colour indicates the distance score s d . No filter was applied to distance scores. The linkage network for E. coli consisted of 122 edges linking 46 nodes corresponding to 52 alleles of 26 acquired AMR genes. The distance score followed a bimodal distribution, with 57 out of 83 allele pairs (69%) having s d = 0 (84 edges) and 23 (28%) allele Linkage networks for detected alleles of acquired AMR genes in 169 E. coli and 359 Salmonella genomes. Each node represents an allele or a cluster of identically distributed alleles, with a diameter proportional to the allele frequency and a colour indicating the AMR phenotype encoded. Every edge is directed, starting from an explanatory allele and terminating at a response allele, representing a significant association determined using an LMM. The edge width is proportional to |β| of the explanatory allele in an LMM. Solid edges represent significant positive associations (β > 0) while dashed edges represent significant negative associations (β < 0). The edge colour follows a gradient of the distance score s d to indicate the strength of evidence for physical linkage. The shaded area (light green) encircles alleles of known co-transferred AMR genes. AMR classes defined by antimicrobials that bacteria were resistant to: AGly (aminoglycosides), Bla (beta-lactams), Flq (fluoroquinolones), MLS (macrolides, lincosamides, and streptogramins), Phe (phenicols), Sul (sulfonamides), Tet (tetracyclines), Tmt (trimethoprim), and Mix (multiple classes of antimicrobials) pairs having s d > 0.5 (38 edges). Considering only the edges that yielded s d > 0.5 and connected alleles encoding distinct kinds of resistance phenotypes, aminoglycoside resistance alleles linked to 14 alleles -the largest number of connections, followed by sulfonamide resistance alleles (linked to 11 alleles). By contrast, bla OXA-1 , the only beta-lactam resistance allele having edges with distance scores above 0.5, was linked to two alleles (aadA1-pm.182 and catA1.215). Notably, as shown in Fig. 4, five alleles (dfrA14.227, strB, sul2.168, strA.173, and sul2) formed a cluster that was interconnected by bidirectional edges with high distance scores (s d > 0.6).
The linkage network for Salmonella consisted of 37 alleles (21 AMR genes) connected by 162 edges (Fig. 4). No identically distributed alleles could be collectively represented by a single node in this network due to absence of perfect measurability or consistency of SPDs. The distance score again followed a bimodal distribution, with 95 out of 104 (91%) pairs having s d < 0.05, and seven (7%) pairs (altogether, nine alleles) having s d > 0.5 (13 edges), all of which corresponded to significant positive associations. Considering only the 13 edges having distance scores above 0.5, strB (aminoglycoside resistance) linked to the largest number of alleles (four, altogether), followed by sul2 and dfrA14.79, each connected to two alleles.

Reasons for inconsistency in measured physical distances
In both the E. coli and Salmonella data sets, a lack of consistency in SPDs (namely, c = 0) measured between several positively associated alleles of acquired AMR genes was observed (Tables s10 and s11). We investigated this issue based on GeneMates outputs and identified two common explanations.
First, in many cases diverse genetic structures were found carrying the same combination of alleles of AMR genes. For example, based on assembly graphs of E. coli genomes, we recovered six distinct genetic structures linking alleles bla TEM-214 .147 and tet(A), which showed significant positive associations in both LMMs and PLMs but had six distinct SPDs (Figure s15). We found that these SPDs followed a lineage-specific distribution. As illustrated in Figure s16, allele bla TEM-214 .147 was carried by transposon Tn2, which was common to all the six structures, either in its complete or truncated form. The variety of insertion sites and orientations of this transposon relative to allele tet(A) in E. coli genomes, as well as plausible gene gain/loss events (inferred from structural comparisons), resulted in differences in SPDs measured between these two alleles.
Second, the GeneMates algorithm depreciates physical distances showing IBD for scoring the distance consistency (Section 3.3.2 of Additional file 1). For instance, two alleles of SGI1-borne AMR genes sul1 and aadA2 in Salmonella were frequent amongst the 359 Salmonella isolates, with an occurrence count of 328 (91%) and 323 (90%), respectively (Table s9). Co-occurrence of these two alleles were also frequent, with a count of 318 (89%) in total. SPDs between sul1 and aadA2 were obtained from genome assemblies of 295 (93%) out of the 318 isolates where the alleles were co-occurring. After removing the only SPD measured across more than two nodes (504 bp, across three nodes), we obtained 294 SPDs, consisting of 293 SPDs measured in either contigs or assembly graphs and one SPD measured in the complete chromosome genome of reference strain DT104. All the filtered SPDs were 504 bp, except the one from the complete genome (9,964 bp). Despite this consistency in SPDs, the consistency score c = 0 as all the 294 SPDs were obtained from the same lineage highlighted in Figures s17a (IBD probability of reliable SPDs: 95%) and were possibly resulted from the same genetic and assembly structures related to these two alleles (Figures s17b-d).

Validation of GeneMates
We validated our approach through identification of known and novel physical clusters of mobile AMR genes in the example data sets. Networks were constructed at the allele level for these genes using GeneMates function findPhysLink.
Identifying known clusters of mobile AMR genes The first set of positive controls in our validation study consisted of 28 pairs of AMR genes that are known to be co-mobilised by MGEs between E. coli genomes [19]. As shown in Table 1, LMMs and PLMs identified significant positive associations at the allele level in 19 (68%) and 16 (57%) pairs, respectively. The second set of positive controls for validation consisted of five AMR genes (aadA2, floR, tet(G), bla CARB-2 , and sul1) that are co-localised in the acquired multidrug-resistant element SGI1 in Salmonella genomes [20]. For these genes, LMMs and PLMs identified significant positive associations between eight and ten allele pairs, respectively ( Table 2). The exclusion of allele pairs having s d < 0.6 led to a substantial reduction of co-mobilisation candidates, with 12 out of 33 (36%) allele pairs in E. coli genomes and two out of 20 (10%) allele pairs in Salmonella genomes passed this filter.
Identifying novel clusters of AMR genes Some edges in linkage networks suggested novel physical linkage between several alleles of AMR genes in E. coli and Salmonella genomes. For instance, 20 novel edges in the linkage network for E. coli (Fig. 4) had the maximum association score s a = 1 (0.12 <β ≤ 1 in LMMs) and high distance scores (0.75 ≤ s d ≤ 1), and these edges formed five maximal cliques each consisting of three alleles. Similarly, a three-allele clique showing high distance Table 1 A comparison of significant positive associations in the linkage network for E. coli genomes to known co-localisation of mobile AMR genes Co-localisation of AMR genes in MGEs were previously determined by Ingle, et al. [19]. Each pair of significantly associated alleles (denoted by alleles 1 and 2, regardless of their roles in a linear model) is also identifiable in the comparative network shown in Fig. 3. Directionality of associations is omitted in this table, hence each pair of alleles only appears once in the table, although the alleles may mutually associated in linear models. Note that an AMR gene may have multiple alleles (whose names are listed in column Allele_1 or Allele_2) or no allele (denoted by a dash sign in the table) present in a network. Abbreviations: LMM, linear mixed model; PLM, penalised logistic model; IBD: identity by descent. Symbols indicating whether a significant association is identified by either an LMM or a PLM: ( ), yes; ( ), no. s d : the distance score, which takes into account the distance consistency, measurability, and the probability of IBD. p IBD : an estimate of the probability that APDs used for calculating the s d are in IBD. This probability does not exist when no APD is available scores (0.7 ≤ s d ≤ 0.9) was identified in the linkage network for Salmonella (Fig. 4). As a further validation of GeneMates, we investigated two maximal cliques through inferring their plausible genetic structures and vectors from genome assemblies.
The first clique consisted of alleles dfrA1, aadA1pm.181, and sat-2A detected in E. coli genomes. Between these alleles, LMMs identified significant positive associations (Figure s18a), and SPDs (Table s12) showed complete identity as well as perfect measurability (namely, s d = 1 for every edge of this clique). Co-occurrence of all three alleles was identified in two distantly related genomes (Figure s18b). Assembly graphs revealed co-localisation of these alleles in the same array of gene cassettes in a class-2 integron (Fig. 5a, b). We confirmed that this Table 2 LMM-based significant associations between five alleles of AMR genes in SGI1 Allele_Y and Allele_X denote the response allele and explanatory allele in an LMM Y ∼ X, respectively. An association is denoted by a filled circle in the column LMM when it is significant, otherwise, an unfilled circle is drawn. Directionality is shown in this table for comparing the value ofλ, which denotes an REML estimate of parameter λ for evaluating structural random effects in an LMM. s d : score of APDs. p IBD : an estimate of the probability that APDs used for calculating the s d are in IBD integron was carried by variants (100% coverage and 99% nucleotide identity) of transposon Tn7 (GenBank accession: KX159451). Moreover, each Tn7 variant was interrupted by a distinct insertion sequence (IS), as illustrated in Fig. 5c. In summary, we saw strong evidence for transposon-mediated co-transfer of these alleles between E. coli lineages.
The second clique consisted of alleles strA.55, strB, dfrA14.79, and sul2 detected in Salmonella genomes (Fig. 4). Both LMMs and PLMs identified that associations between all of these alleles were significantly positive (0.69 ≤β ≤ 0.98 in LMMs). Edges between these alleles had distance scores between 0.7 and 0.9 except edges linking strA.55 (0 ≤ s d ≤ 0.25). As shown in Fig. 6a, these four alleles co-occurred in 13 Salmonella genomes from distantly related clades. Using Bandage and nucleotide BLAST, we confirmed co-localisation of these alleles in a 3,084 bp region that was present in 10 out of the 13 genomes with a 100% nucleotide identity and coverage, and these 10 genomes were sparsely distributed across the phylogenetic tree in Fig. 6a. Furthermore, we saw an insertion of allele dfrA14.79 into strA.55, splitting the latter allele into two segments that covered 65.8% and 34.5% of its original length, respectively. In the assembly graph of one of the 10 genomes, ERR026101, we found the 3,084 bp multidrug-resistance (MDR) region in a 6,790 bp node, which appeared as a self-circularised sequence independent to other graph components (Fig. 6b). Using megaBLAST under its default parameters, a sequence search of this MDR region against the NCBI nucleotide database of the Enterobacteriaceae group (taxid: 91347, Fig. 5 Putative physical linkage between three alleles of AMR genes in two E. coli genomes. a A path constituting an inferred MGE in the assembly graph of genome ERR178189. This plot is drawn in Bandage with a double-strand style, where the orientation of each DNA strand is indicated by an arrow-like node end. The width of each node is proportional to its read depth determined by Unicycler. Some nodes not contributing to any MGE-related path were deleted from the original assembly graph for visual conciseness. b A path constituting the other inferred MGE (following the red dashed line) in the assembly graph of genome ERR178173. This plot was drawn in the same way as panel a. c Alignment of the two Tn7 variants we identified in E. coli genomes (ERR178189 and ERR178173) to a reference Tn7 sequence (GenBank accession: KX159451, denoted by green shaded areas). Two direct repeats flanking the ISs, including inverted repeats, are denoted by green and pink boxes, respectively. Reference DNA sequences of ISKpn26 (1,196 bp) and ISEc23 (2,532 bp) were retrieved from database ISFinder [29] in January 2018. Each IS in the resolved region showed a 100% coverage and 99% nucleotide identity to its reference accessed in April, 2018) showed exact matches (100% nucleotide identity and coverage) to a known and widely distributed MDR plasmid pCERC1 (GenBank accession: JN012467) as well as a number of plasmids widely distributed in bacteria of Enterobacterales (Table s13). Therefore, this MDR region was shared by a great variety of plasmids.

Discussion
GeneMates implements a novel network approach to detection of intraspecies HGcoT between bacteria. Compared to existing methods relying on co-occurrence counts, association coefficients, or simple regression models of bacterial genes, GeneMates enables us to test for gene-gene associations when controlling for Distribution and genetic structure of a four-allele clique in Salmonella genomes. a A ring plot displaying co-occurrence events (red track) of the four alleles (darker boxes in grey tracks) against a midpoint-rooted phylogenetic tree of the Salmonella genomes. Clades are coloured by their top-10 most correlated PCs that significantly explained the presence-absence status of an arbitrarily designated response allele sul2. Tips highlighted with red circles denote genomes where the exact 3,084 bp MDR region harbouring all the four alleles were found using nucleotide BLAST. b A putative 6,790 bp plasmid sequence restored from the assembly graph of genome ERR026101. Alleles of four AMR genes were identified in this plasmid, showing 100% nucleotide identity and coverage to reference sequences of these genes population structure, the main confounding factor in bacterial GWAS [26], through incorporating PCs into LMMs. Results in "Effects of controlling for population structure" section show that LMMs of GeneMates retain statistical power of association tests, which is in line with equivalent LMMs from literature (cf., Section 3.4.1 in Additional file 1) [10,27]. In our examples, association networks constructed using GeneMates reveal extensive associations between alleles of horizontally acquired AMR genes (Fig. 3). Moreover, as expected, the majority of p-values from association tests became greater after correcting for population structure using LMMs, while the other p-values saw a reduction, indicating increased significance of associated alleles (Fig. 2). LMMs provide us with another advantage: we only need to estimate three parameters (β, γ , and τ ) besides the intercept term α to fit a model, thereby circumventing the problem of overfitting as well as relaxing the requirement for sample sizes. Notably, long-tailed curves of cumulative percentages of total genetic variations captured by PCs (Figure s9) indicate the necessity of including all PCs obtained from the cgSNP matrix for accurate modelling in association analysis.
On the grounds of examples where stable structures of acquired AMR genes are shared between bacteria via MGEs in a short period of time [16,19,20,28], we have implemented another innovation in GeneMates -the evaluation of APD consistency for evidence of HGcoT. Since the physical distances between two loci in bacterial genomes are correlated with both locus co-occurrence and PCs representing population structure, APDs cannot be directly incorporated into LMMs or PLMs as additional covariates. Instead, GeneMates scores the variation of APDs while taking the population structure into account. As for APDs, it is self-evident that they can be precisely calculated from genetic coordinates in finished-grade genome assemblies, which remain the minority of available bacterial genome sequences. In order to overcome some of the limitations of measuring APDs in draft genomes, we measured APDs in the form of SPDs in assembly graphs, and developed a simulationbased approach to determining reliability filters of SPDs. Using this approach, the accuracy of SPDs from de novo assembly graphs of reference multidrug-resistant E. coli or S. Typhimurium genomes consistently exceeded 90% when the distances were only measured across one or two nodes (Figures s11 and s12), implying a universal filter for other bacterial genomes. Nevertheless, as summarised in "Effects of adding APDs to association networks" section and displayed in Fig. 4, short-read genome assemblies intrinsically confine both the measurability and accuracy of SPDs, causing a loss of evidence for real physical linkage in HGT.
In "Validation of GeneMates" section, the identification of known and novel physical clusters of mobile AMR genes suggests that maximal cliques in linkage networks are useful starting points for recovering structures of horizontally co-transferred genes. A plausible reason is that loci in strong physical linkage tend to predict the presence of each other (namely, mutual positive associations) in HGT, and strong physical linkage is often related to close physical proximity, which in turn increases the measurability of their physical distances, leading to a greater distance score given the same consistency score. In practice, users may apply other filters to the network to identify edges addressing specific questions.
GeneMates also offers a framework (Fig. 1) for further development, such as adding new modules and introducing other statistical models. Our methodology and analysis demonstrated in example studies are applicable to other kinds of acquired genes in haploid genomes of the same species, as long as we can accurately determine alleles of interested genes in each genome. Since Gene-Mates is designed specifically to overcome challenges of identifying intraspecies HGcoT, we did not test its performance in detecting interspecies HGcoT. GeneMates could theoretically be applied to analyse data sets including multiple species. However, users should consider the input SNP matrix carefully, as it determines what scale of population structure is captured in the covariance matrix used in LMMs. In cross-species analyses, the SNP matrix would necessarily be restricted to core-genome sites shared across species, which may represent only a small fraction of individual genomes (depending on the taxonomic breadth of included species). In such cases the PCs would likely capture differences between species but may not effectively capture intraspecies population structure. Furthermore, the inability of short reads to resolve repeats, either through read mapping or de novo assembly, may cause false negatives and errors in allele calls when homologues of a target gene coexist in a genome. Therefore, we expect a better performance of GeneMates in analysing high-quality complete genomes or long-read sequencing data that are able to resolve at least most of the repeats. Finally, biological experiments are necessary as a gold standard for validating candidates of HGcoT.

Conclusions
We have developed R package GeneMates and helper scripts for analysing associations between alleles of acquired genes and for inferring physical linkage between these alleles in HGT. We have also demonstrated utilities of this package using publicly available WGS data of 169 E. coli isolates and 359 Salmonella isolates. In our study, functions of GeneMates identified clusters of co-transferred AMR alleles in known and novel MGEs, enabling further investigations of HGcoT amongst a wider range of bacterial species. GeneMates differs from contemporary methods for bacterial GWAS in three aspects. First, it focuses on gene-gene associations rather than genotype-to-phenotype associations. Second, it only performs association tests for acquired genes rather than genome-wide SNPs. Third, it evaluates evidence of physical distances between associated loci for inference of physical linkage, although users may opt to turn this utility off to only analyse gene-gene associations. GeneMates offers a scalable and versatile approach that is readily applicable to various kinds of horizontally acquired genes. It is, however, confined by limitations of short-read genome assembly, and its power will increase in the future as we are accumulating complete genomes and enhancing our ability in resolving repeats using sequencing data.