### Generation of the datasets

We used publicly available AMP sequences to train and test AMP predictors. In order to build a non-redundant AMP dataset, we first downloaded all available sequences from two manually curated databases: Antimicrobial Peptide Database [44] (APD3, http://aps.unmc.edu/AP) and Database of Anuran Defense Peptides [39] (DADP, http://split4.pmfst.hr/dadp). Since APD3 is being frequently updated, we used a static version that was scraped from the website on March 20, 2019 comprising 3061 sequences. Version 1.6 of DADP contains 1923 distinct mature AMPs. We concatenated these two sets and removed duplicate sequences, producing a non-redundant (positive) set of 4173 distinct, mature AMP sequences, all 200 amino acid residues in length or shorter. AMPs that are highly similar to each other at the sequence level were kept as separate entries, since small changes in amino acid compositions may lead to large changes in AMP activity [45]. Also, it is important to maintain as big a dataset as possible for better training of a deep learning model [17].

Training and testing binary classification models require a negative set, a collection of peptides known not to have any antimicrobial activity. Since there are no sequence catalogs for peptides devoid of antimicrobial activity, studies in the field typically select their non-AMP sequences from UniProt [46] (https://www.uniprot.org). This may involve excluding several simple keywords (e.g. antimicrobial, antibiotic) to filter out potential AMPs [14, 15], or additionally removing all secretory proteins [26] as AMPs are characteristically secreted peptides [47]. The former proposition is not sufficiently rigorous, because AMP annotation is not consistent and varies between sources. While keyword filtering may leave in the set some differently annotated AMPs, filtering of secretory proteins creates a learning gap for the model regarding such proteins without antimicrobial activities. Thus, it is important to balance these two strategies when selecting non-AMP sequences.

We designed a rigorous selection strategy for our non-AMP sequences (Supplementary Fig. S3), using sequences from the UniProtKB/Swiss-Prot database [46] (2019_02 release), which only contains manually annotated and reviewed records from the UniProt database. First, we downloaded sequences that are 200 amino acid residues or shorter in length (matching the maximum peptide length in the AMP set), excluding those with annotations containing any of the 16 following keywords related to antimicrobial activities: {antimicrobial, antibiotic, antibacterial, antiviral, antifungal, antimalarial, antiparasitic, anti-protist, anticancer, defense, defensin, cathelicidin, histatin, bacteriocin, microbicidal, fungicide}. Second, duplicates and sequences with residues other than the 20 standard amino acids were removed. Third, a set of potential AMP sequences annotated with any of the 16 selected keywords were downloaded and compared with our candidate negative set. We noted instances where a sequence with multiple functions was annotated separately in multiple records within the database, and removed sequences in common between candidate non-AMPs and potential AMPs. The candidate non-AMP sequences were also checked against the positive set to remove AMP sequences that lack the annotation in UniProtKB/Swiss-Prot. Finally, 4173 sequences were sampled from the remaining set of 128,445 non-AMPs, matching the number and length distribution of sequences in the positive set. An exception to the length distribution matching occurred when the length of a particular AMP sequence did not have a perfect match in the set of non-AMP sequences. In these instances, we chose the non-AMP sequence with the closest length. The matched length distributions were selected so that the model did not learn to distinguish classes based on sequence lengths.

The positive and negative sets were both split 80%/20% (3338/835) into training and test sets, respectively. We note that AMP sequences in our test partition have no overlap with the training sets of iAMPpred and iAMP-2L, but do share 196 sequences with the training set of AMP Scanner Vr.2.

### Model implementation

AMPlify is implemented in Python 3.6.7, using Keras library 2.2.4 [48] with Tensorflow 1.12.0 [49] as the backend. The raw output of the model can be interpreted as a probability score, thus sequences with scores > 0.5 are considered as AMPs and those ≤0.5 as non-AMPs. We used binary cross-entropy as the loss function, and the Adam algorithm [50] for optimizing weights. Dropout technique [51] was applied during training to prevent the model from over-fitting. The original positive and negative training sets were both split into sets of {667, 667, 668, 668, 668} sequences, and five training and validation set pairs were constructed by leaving one set out for validation to build five single sub-models. To optimize computational time and avoid overfitting, we applied early stopping during the training of each single sub-model. The validation accuracy was monitored at each training epoch, and the training process was stopped if there appeared to be no improvement for the next 50 epochs. The model weights from the epoch with the best validation accuracy were selected as the optimal weights. The output probabilities from the five single sub-models were averaged to yield an ensemble model.

Reflecting the composition of the sequences in the positive and negative sets, AMPlify only considers sequence lengths of 200 or shorter containing the 20 standard amino acid residues.

### Hyperparameter tuning and model architecture

In deep neural networks, the optimal hyperparameters, unlike model weights, cannot be learned directly from the training process, but they do play an important role in model performance. Thus, various combinations of hyperparameters must be compared in order to select the best set. Here we applied stratified 5-fold cross-validation on the entire training set to tune the model and find the best set of hyperparameters for the model architecture, as well as for training settings, including dropout rates and optimizer settings. For a fair comparison, we kept the same splits of sequences within cross-validation across all hyperparameter combinations. During hyperparameter tuning, we monitored the average performance on the validation partitions of cross-validation. Note that these validation partitions within cross-validation are different from the validation sets for early stopping, while the latter being additionally derived from the training partitions during the cross-validation process. The set of hyperparameters with the highest average cross-validation accuracy was chosen to train the final prediction model.

The AMPlify architecture includes three main components: 1) a bidirectional long short-term memory (Bi-LSTM) layer, 2) a multi-head scaled dot-product attention (MHSDPA) layer, and 3) a context attention (CA) layer (Fig. 1). To convert the original peptides into a mathematically processable format, each sequence is represented by a series of one-hot encoded vectors over an alphabet of 20 amino acids, yielding **x**_{1}, **x**_{2}, …, **x**_{L}, where *L* is the length of the sequence and each **x**_{t} is a 20-dimensional vector of zeros and ones with ‖**x**_{t}‖_{1} = 1 (*t* = 1, 2, …, *L*). The Bi-LSTM layer takes those one-hot encoded vectors as input and encodes positional information for each residue from both forward and backward directions, and the output vector for each residue is represented as a concatenation of the vectors from both directions. The best tuned dimensionality for each direction of Bi-LSTM layer was 512, resulting in the entire Bi-LSTM layer to be *d*_{h} = 512 × 2 = 1024 dimensional. Outputs from all residue positions of the Bi-LSTM layer are returned as the input for the next layer. The best tuned dropout rate of 0.5 was applied to the input of the Bi-LSTM layer. Encoding from the Bi-LSTM layer for residues within a given sequence can be represented as a series of vectors \({\mathbf{h}}_{\mathbf{t}}\in {\mathbb{R}}^{d_h}\) (*t* = 1, 2, …, *L*), and the entire sequence can be represented as a matrix with all **h**_{t} s packed as

$$H={\left({\mathbf{h}}_{\mathbf{1}},{\mathbf{h}}_{\mathbf{2}},\dots, {\mathbf{h}}_{\mathbf{L}}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_h}.$$

Next, the MHSDPA layer searches for relations between different residues in *n* different representation subspaces [30] (i.e. different attention heads) to further encode the sequence, where *n* is a hyperparameter to be tuned. Each residue first gets an intermediate representation within each head by calculating a weighted average over transformed vectors of all residues derived from their Bi-LSTM representations. The results from each head are then concatenated and mapped back to the original dimensionality. We adapted Vaswani and co-workers’ approach [30] to calculate the attention weights and outputs for the MHSDPA layer. The implementation was adapted from the GitHub repository at https://github.com/CyberZHG/keras-multi-head, where rectified linear unit (ReLU) activation functions and bias terms were added to all linear transformations.

In further detail, to obtain attention weights for different residues of a sequence within a head *i*, we calculate a set of queries \({Q}^i\in {\mathbb{R}}^{L\times {d}_k}\), keys \({K}^i\in {\mathbb{R}}^{L\times {d}_k}\), and values \({V}^i\in {\mathbb{R}}^{L\times {d}_v}\) by transforming *H* as follows:

$${Q}^i=\mathrm{ReLU}\left(H{W}^{Q^i}+{B}^{Q^i}\right)$$

$${K}^i=\mathrm{ReLU}\left(H{W}^{K^i}+{B}^{K^i}\right)$$

$${V}^i=\mathrm{ReLU}\left(H{W}^{V^i}+{B}^{V^i}\right)$$

where \({W}^{Q^i},{W}^{K^i}\in {\mathbb{R}}^{d_h\times {d}_k}\) and \({W}^{V^i}\in {\mathbb{R}}^{d_h\times {d}_v}\) are weight matrices, and \({B}^{Q^i}={\left({\mathbf{b}}^{{\mathbf{Q}}^{\mathbf{i}}},{\mathbf{b}}^{{\mathbf{Q}}^{\mathbf{i}}},\dots, {\mathbf{b}}^{{\mathbf{Q}}^{\mathbf{i}}}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_k}\), \({B}^{K^i}={\left({\mathbf{b}}^{{\mathbf{K}}^{\mathbf{i}}},{\mathbf{b}}^{{\mathbf{K}}^{\mathbf{i}}},\dots, {\mathbf{b}}^{{\mathbf{K}}^{\mathbf{i}}}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_k}\) and \({B}^{V^i}={\left({\mathbf{b}}^{{\mathbf{V}}^{\mathbf{i}}},{\mathbf{b}}^{{\mathbf{V}}^{\mathbf{i}}},\dots, {\mathbf{b}}^{{\mathbf{V}}^{\mathbf{i}}}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_v}\) are bias matrices. We set transformation dimensions as *nd*_{k} = *nd*_{v} = *d*_{h} following what has been previously proposed [30]. A square matrix *A*^{i} ∈ *ℝ*^{L × L}, which contains weight vectors to calculate intermediate representations of all residues within head *i*, is computed as:

$${A}^i={\mathrm{softmax}}_{\mathrm{row}}\left(\frac{Q^i{K^i}^{\mathrm{T}}}{\sqrt{d_k}}\right)$$

where dot-product of queries and keys are scaled by a factor \(\frac{1}{\sqrt{d_k}}\), and the softmax function is applied to each row of the matrix for normalization. The intermediate representation of the sequence within head *i* is then computed by:

$${Z}^i={\left({\mathbf{z}}_{\mathbf{1}}^{\mathbf{i}},{\mathbf{z}}_{\mathbf{2}}^{\mathbf{i}},\dots, {\mathbf{z}}_{\mathbf{L}}^{\mathbf{i}}\right)}^{\mathrm{T}}={A}^i{V}^i\in {\mathbb{R}}^{L\times {d}_v}$$

where each single vector \({\mathbf{z}}_{\mathbf{t}}^{\mathbf{i}}\in {\mathbb{R}}^{{\mathrm{d}}_v}\) (*t* = 1, 2, …, *L*) denotes the intermediate representation of each residue of the sequence with dimensionality *d*_{v}. The concatenated matrix \(Z=\left({Z}_{L\times {d}_v}^1,{Z}_{L\times {d}_v}^2,\dots, {Z}_{L\times {d}_v}^n\right)\in {\mathbb{R}}^{L\times {nd}_v}\) is further transformed to get the final output of the current layer as follows:

$$M={\left({\mathbf{m}}_{\mathbf{1}},{\mathbf{m}}_{\mathbf{2}},\dots, {\mathbf{m}}_{\mathbf{L}}\right)}^{\mathrm{T}}=\mathrm{ReLU}\left[Z{W}^O+{B}^O\right]\in {\mathbb{R}}^{L\times {d}_h}$$

where \({W}^O\in {\mathbb{R}}^{nd_v\times {d}_h}\) is a weight matrix and \({B}^O={\left({\mathbf{b}}^{\mathbf{O}},{\mathbf{b}}^{\mathbf{O}},\dots, {\mathbf{b}}^{\mathbf{O}}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_h}\) is a bias matrix. Each vector \({\mathbf{m}}_{\mathbf{t}}\in {\mathbb{R}}^{{\mathrm{d}}_h}\) (*t* = 1, 2, …, *L*) denotes the new representation of the corresponding residue of the sequence with dimensionality *d*_{h}. The best head number tuned for this layer was *n* = 32, with *d*_{k} = *d*_{v} = 32.

Finally, the CA layer gathers information from the MHSDPA layer by reducing *L* encoded vectors into a single weighted average summary vector **s**. We followed Yang and co-workers’ approach [31] to perform this operation, and adapted the implementation from the GitHub repository at https://github.com/lzfelix/keras_attention. The weight vector **α** ∈ *ℝ*^{L} is calculated using

$$\boldsymbol{\upalpha} =\mathrm{softmax}\left(\left(\tanh \left( MW+B\right)\right)\mathbf{u}\right)$$

where \(W\in {\mathbb{R}}^{d_h\times {d}_h}\) is a weight matrix, \(B={\left(\mathbf{b},\mathbf{b},\dots, \mathbf{b}\right)}^{\mathrm{T}}\in {\mathbb{R}}^{L\times {d}_h}\) is a bias matrix, \(\mathbf{u}\in {\mathbb{R}}^{d_h}\) is a context vector, and the softmax function provides weight normalization. The summary vector \(\mathbf{s}\in {\mathbb{R}}^{d_h}\) is then computed as:

$$\mathbf{s}={M}^{\mathrm{T}}\boldsymbol{\upalpha} =\sum_{t=1}^L{\alpha}_t{\mathbf{m}}_{\mathbf{t}}$$

where *α*_{t} denotes each component in the weight vector. Vector **s** summarizes information of the entire sequence into a single vector, and it is passed through the output layer of a single neuron with a sigmoid activation function for classification. The best tuned dropout rate of 0.2 was applied to the input of the CA layer during training.

In addition to the hyperparameters of the model architecture, the hyperparameters of the optimizer were optimized through cross-validation. A batch size of 32 and a default learning rate of 0.001 were found to be the best for the AMP prediction task.

### Model evaluation

The performance of AMPlify was evaluated using five metrics: accuracy, sensitivity, specificity, F1 score and area under the receiver operating characteristic curve (AUROC).

The architecture of AMPlify was compared with its simpler variations with fewer hidden layers using stratified 5-fold cross-validation on the training set to measure the value added by each layer as the architecture grew more complex. The final version of AMPlify trained on the entire training set, as well as its five single sub-models, were compared with three other tools: iAMP-2L [15], iAMPpred [16] and AMP Scanner Vr.2 [26], on the test set we built. All comparators were evaluated with their original models online.

In addition, as the only comparator with methods for re-training, AMP Scanner Vr.2 was cross-validated and re-trained on our training set for a fairer comparison. We note that, since our dataset is slightly different from those used by other methods, the number of epochs required to get a deep learning model well trained on different datasets might differ. Keeping all other hyperparameters the same as the original model, we cross-validated and re-trained AMP Scanner Vr.2 with two different stopping settings: using the optimal fixed number of epochs as reported [26], and using early stopping.

### AMP discovery pipeline

A primarily homology-based approach was used to supply novel candidate AMP sequences to AMPlify for further evaluation. The pipeline and its results are summarized in Supplementary Fig. S4 and are detailed below.

Sequences matching the search phrase “((antimicrobial) AND precursor) AND amphibian” were downloaded from the NCBI Nucleotide database on January 4th, 2016 and aligned to the draft bullfrog genome [33] (version 3) using GMAP [52] (version 20170424) with the following parameters: -A --max-intronlength-ends = 200000 -O -n20 --nofails.

To refine the putative AMP loci, the gene prediction pipeline MAKER2 [53] (version 2.31.8 running under PERL version 5.24.0 with augustus [54] version 3.2.1, exonerate [55] version 2.2.0, genemark [56] version 2.3c, and snap [57] version 2006-07-28) was applied to the 231 genomic scaffolds with alignment hits from GMAP using default settings. The MAKER2 pipeline can use orthogonal evidence from related protein or transcript sequences when available to generate a list of high confidence genes. Protein evidence consisted of three sets of sequences: sequences matching the search phrase “((antimicrobial) AND precursor) AND amphibian” from the NCBI protein database that were downloaded on December 31st, 2015; experimentally validated non-synthetic amphibian antibacterial peptide sequences downloaded from CAMP [13] on March 4th, 2016; and sequences from APD3 [44] downloaded on September 29th, 2017. For transcript evidence, the set of cDNA sequences supplied to GMAP above was supplemented with selected bullfrog transcript sequences from the Bullfrog Annotation Reference for the Transcriptome [33] (BART). Blastn [58] (version 2.31.1) was used to align the initial cDNA sequences from NCBI to BART, and BART sequences with an alignment of greater than 90% identity and 100% coverage were selected. A custom repeat element library was constructed from predicted repeats previously identified in the bullfrog genome [33] and supplied to MAKER2 for use by RepeatMasker [59]. The annotation pipeline was run with the snap hidden Markov model that produced the version 2 bullfrog gene predictions [33].

The MAKER2 gene predictions were filtered in two stages. First, sequences containing the highly conserved lysine-arginine enzymatic cleavage motif were selected and the sequence of the putative mature peptide, produced via in silico cleavage at the C-terminal side of the cleavage motif, was extracted. Second, only putative mature sequences of 200 amino acid residues or less were included. Sequences with non-standard amino acid residues were excluded. The resulting peptide sequences from these filters were fed into AMPlify for prediction. From the predicted putative AMPs, only short cationic sequences with lengths between five and 35 amino acid residues were chosen for synthesis and validation in vitro. We prioritized short cationic sequences as shorter sequences are more structurally stable in various environments (e.g. in vitro and in vivo) [60] lending to easier therapeutic applicability.

### Antimicrobial susceptibility testing (AST)

From the novel candidate AMP sequences predicted by AMPlify, 11 were selected for validation in vitro. Minimum inhibitory concentrations (MIC) and minimum bactericidal concentrations (MBC) were obtained using the AST procedures outlined by the Clinical and Laboratory Standards Institute (CLSI) [40], with the recommended adaptations for testing cationic AMPs described by Hancock [41].

#### Bacterial isolates

A panel of two Gram-positive and four Gram-negative bacterial isolates was generated to test predicted AMPs. *Staphylococcus aureus* ATCC 6538P, *Streptococcus pyogenes* (hospital isolate, unknown strain), *Pseudomonas aeruginosa* ATCC 10148, and *Escherichia coli* ATCC 9723H were obtained and tested at the University of Victoria. Additionally, a multi-drug resistant (MDR), carbapenemase-producing New-Delhi metallobetalactamase (CPO-NDM) clinical isolate of *Escherichia coli* was obtained from the BC Centre for Disease Control. *E. coli* ATCC 29522 was purchased from Cedarlane Laboratories (Burlington, Ontario, Canada) for comparison of AMP activity between a wild type, drug-susceptible control and the MDR strain. The latter two strains were tested at the BC Centre for Disease Control using identical AST procedures.

#### Determination of MIC

Bacteria were streaked onto nonselective nutrient agar from frozen stocks and incubated for 18–24 h at 37 °C. To prepare a standardized bacterial inoculum, isolated colonies were suspended in Mueller-Hinton Broth (MHB) and adjusted to an optical density of 0.08–0.1 at 600 nm, equivalent to a 0.5 McFarland standard and representing approximately 1–2 × 10^{8} CFU/mL (CFU: colony forming units). The inoculum was diluted 1/250 in MHB to the target concentration of (5 ± 3) × 10^{5} CFU/mL. Total viability counts from the final inoculum were examined to confirm the target bacterial density was obtained.

Selected candidate AMPs were purchased from Genscript (Piscataway, NJ), where they were synthesized using the vendor’s Flexpeptide platform. Lyophilized peptides were suspended in sterile ultrapure water or filter-sterilized 0.2% acetic acid as recommended by solubility testing reports provided with the GenScript synthesis. AMPs were diluted from 2560 to 5 μg/mL by a two-fold serial dilution in a 96-well polypropylene microtitre plate before 100 μl of the standardized bacterial inoculum of (5 ± 3) × 10^{5} CFU/mL was added to each well. This generated a final test range of 256 to 0.5 μg/mL. MIC values were reported as the peptide concentration that produced no visible bacterial growth after a 16–24 h incubation at 37 °C.

#### Determination of MBC

For each AMP dilution series, the contents of the MIC well and the two adjacent wells containing two- and four-fold MIC were plated onto nonselective nutrient agar and incubated for 24 h at 37 °C. The concentration which killed 99.9% of the initial inoculum was determined to be the MBC.