### BCR repertoire vaccine study data sets

We use two publicly available data sets, one from a study involving a hepatitis-B vaccine [20] and one from a study on an influenza vaccine [10]. We describe these two data sets below. Both data sets capture the somatically rearranged VDJ region in B cells, in particular the highly variable CDR3 region on which we will focus.

#### Hepatitis B

In the study by Galson and colleagues [20], 5 subjects were given a booster vaccine against hepatitis B (HepB) following an earlier primary course of HepB vaccination. Samples were taken on days 0, 7, 14, 21 and 28 relative to the day of vaccination. Total B cells were sorted and sequenced in all samples. We refer to this data set as the *hepatitis B data set*.

In addition, cells were sorted for HepB surface antigen specificity at the same time points post-vaccination. The mRNA that was reverse transcribed to cDNA in these cells was then amplified using Vh and isotype specific primers and these IgH transcripts were then sequenced. These cells are enriched with those we are seeking to identify using our modelling approach, and provides the nearest available approximation to a truth-set of sequences which are vaccine-specific. We refer to these data as the *HBsAG+ data set*. Both data sets are publicly available on the Short Read Archive (accession PRJNA308641).

Sequences were generated on the Illumina platform using an RNA sequencing protocol, and the nucleotide sequences analysed. Targeting RNA means that highly abundant sequences may derive either from multiple B cells from a clonal subpopulation, or from one or a small number of B cells with high IgH gene expression, such as plasma cells that are actively secreting antibodies. Although we cannot distinguish between these two possibilities, both classes of cells are likely signifiers of immune response, and are therefore of interest.

#### Influenza

We also analyze data from subjects that were vaccinated against influenza in a study by Jackson and colleagues [10]. Samples were taken on days 0, 7 and 21 relative to vaccination. We analyzed a subset of 7 subjects that were deemed to be “seroconverters” who have an increased level of antibodies in response to the vaccine, based on vaccine-specific ELISA assays. This will be referred to as the *influenza data set*.

In addition, the authors also collected plasmablasts on day 7 in 5 of the subjects. These are also likely to be enriched for B cells responding to the vaccine and therefore act as an approximate truth-set providing an additional source of evaluation for our method. The sequences derived from these cells are referred to as the *plasmablast data set*. All data is publicly available on dbGaP (accession phs000760.v1.p1).

The Roche 454 platform was used to perform DNA sequencing of the somatically recombined IgH locus, using primers for the relatively conserved FR2 IgH V gene segment, and a conserved IgH J gene segment [10], and we analyse the amino acid sequences. Targeting DNA ensures that sequences with high abundance are representative of clonally expanded B cells, rather than of cells exhibiting high mRNA expression. However, active plasma cells with high secretion rate would still be counted individually.

### Clonal identification

We combined sequences into *clones* primarily to group together sequences arising from the same clonal expansion, and this also serves to correct for read errors and group together some highly similar sequences that likely target the same epitope. This removes some noise associated with read error and strengthens signals by treating multiple sequences all of which target the same epitope as a single clone, whilst also reducing the computational burden. Each clone consists of a single identifying CDR3 sequence, the *clonal centre*, and its set of neighbouring CDR3 sequences; for two sequences to be considered neighbours, they must be of the same length and be highly similar, which we define as greater than 85% similarity for nucleotide sequences as in the hepatitis B data set, or 90% similarity for amino acid sequences as in the influenza data set. The clonal identification was performed in a greedy manner, by iteratively identifying a clonal centre as the sequence with the greatest number of neighbours from among all unassigned sequences, and assigning it and its unassigned neighbours to a new clone. This is a computationally efficient approach to clonal identification which allows us to process very large data sets. However, the model presented here is not dependent on the clonal identification method used, and any alternative method could also be used as input.

Within each data set, we identified clones using all samples and time points together, but kept track of sample- and time-specific counts to enable the analysis of time dynamics and between-individual sharing. This results in some clones which are present in multiple individuals and therefore considered ‘public’ clones. We now consider each clone to be representative of the BCR sequence *i* at its centre, and make no distinction between clones and the individual sequences which form the clonal centres. In addition we shall use *i* to refer to the B cell(s) that the clone represents. We define the *clonal abundance*, denoted by *x*_{ist}, as the number of sequences assigned to clone *i* for a participant *s* at time point *t*, and the total clonal abundance as the total number of sequences assigned to the clone across all samples, \(\sum _{st}x_{ist}\).

### Model

We introduce a hierarchical Bayesian model to describe the clonal abundance (or alternatively, CDR3 sequences) across individuals inoculated with the same vaccine, and across multiple time points. The data are abundances, *x*_{ist}, as introduced above. The goal of modeling these data is to identify CDR3 sequences of vaccine-specific BCR clones from among a large number of non-vaccine-specific BCRs, while accounting for sparse sampling and for the highly stochastic nature of the biological process that generates them.

One identifying feature of vaccine-specific BCR clones that we want to model is their abundance profile. We expect to observe no vaccine-specific BCRs pre-vaccination (or very few, in the case of a primer-boost design such as for the HepB data set), while post-vaccination we expect to observe high abundances due to clonal expansion of stimulated B cells, the presence of plasma cells with high transcription activity, or both. A second feature that helps to characterise vaccine-specific BCRs is their tendency to be shared across individuals, due to convergent evolution.

To describe the model we introduce some notation. As above let *i* denote a BCR clone, and denote by *Ω* the space of all clones. We partition this set as *Ω*=*Ω*_{bg}∪*Ω*_{vs}∪*Ω*_{ns}, where the disjunct subsets represent background BCR clones not responding to any stimulus; vaccine-specific BCR clones responding to the vaccine stimulus; and BCR clones responding to a non-specific stimulus other than the vaccine respectively. These subsets (and their sizes) are unknown, and the classification of a particular clone *i* is given by a discrete random variable *γ*_{i}∈{*bg*,*vs*,*ns*}, so that \(i \in \Omega _{\gamma _{i}}\).

Next, the presence of a particular B cell clone *i* in a participant *s* is encoded by a second discrete random variable *z*_{is}, which takes on the value 0 when *i* is absent from the BCR repertoire of individual *s* at any time point, and 1 when *i* is present in the individual (though not necessarily present in any sample taken from this individual). The variable *z* aims to account for the sparsity resulting from the diversity of BCR repertoires from different individuals. The distribution of *z*_{is} is dependent on *γ*_{i}, to allow modeling the increased probability that vaccine-specific BCRs are shared between individuals.

The actual abundances *x*_{ist} of clone *i* in individual *s* at a time point *t* are assumed to be independent conditional on *γ*_{i} and *z*_{is}, and are modeled by a mixture of three distributions representing three outcomes, modeled by a third discrete random variable *e*_{ist} whose distribution depends on *γ*_{i}, *z*_{is} and *t*. First, the relevant B cell or cells may be absent from individual *s* (if *z*_{is}=0) or may have escaped sampling. In this case *x*_{ist} is distributed as a point mass at 0. Second, if B cells have been sampled, they may be neither clonal nor plasma B cells, and would therefore contribute a small number of sequences to the data set. In this case *x*_{ist} is modeled as a negative Binomial distribution. The remaining case is that the sampled B cell or cells are either plasma cells, or cells sampled from a large clonal population (or both), in which case they are expected to contribute a large number of sequences. In this case *x*_{ist} is modeled as a discretised generalized Pareto distribution [21]. This distribution of abundances is illustrated in Fig. 5a. The mixture distribution of clonal abundance *x*_{ist} is given by *p*(*x*_{ist}|*e*_{ist},*θ*), where *θ* is the vector of parameters of the negative Binomial and generalized Pareto distributions.

The resulting joint probability for a data set *x*, latent variables *e*,*z* and parameters *γ*,*θ* under this model is given by

$$ {\begin{aligned} p(\boldsymbol{\theta},\boldsymbol{\gamma},\boldsymbol{z},\boldsymbol{e},\boldsymbol{x}) & = p(\boldsymbol{\theta}) \prod_{i} p(\gamma_{i}) \prod_{s}p(z_{is}|\gamma_{i}) \prod_{t} p(e_{ist}|\gamma_{i},z_{is},t)p(x_{ist}|e_{ist},\boldsymbol{\theta}) \end{aligned}} $$

(1)

The relationship between the variables in the model is shown in Fig. 5b. Non-informative priors *p*(*θ*) and *p*(*γ*) are placed on the parameters; this allows these parameters to be learnt from the data, and therefore allows the model to be applied to a range of data sets, for instance RNA sequencing and DNA sequencing. Full details of the model and priors are provided in the “Background” section; Additional file 1.

We restrict *i* to range over only those BCRs which are observed at least once in the data set, rather than the 10^{13} that are theoretically possible. Therefore, for *K* BCR clones, we have that 1≤*i*≤*K*. This simplifies model fitting, but will result in parameter estimates which are specific to each individual data set, and therefore affected by features such as the number of individuals. This should be kept in mind when interpreting the results.

### Inference

The model is fitted to each data set using an Expectation-Maximisation (E-M) algorithm which iteratively maximises the model allocation parameters conditional on the parameters which determine the distribution of each classification, and vice versa; see Additional file 1 for details. Initial parameters were chosen to reflect our prior beliefs that clones responding to the vaccine would be more likely to be present in low abundance pre-vaccination but high abundance post vaccination, and that they are more likely to be seen in multiple individuals, and results were robust to initial values which preserve these beliefs. This approach ensures that the parameters associated with each class are consistent with its biological interpretation and avoids the problem of label switching. Since these data sets are particularly large, and the number of model parameters relatively small, there is little uncertainty in our parameter estimates. Therefore, this approach is a computationally efficient alternative to Markov Chain Monte Carlo (MCMC) approaches, which is able to optimise the posterior.

Restrictions on parameter values allow us to encode additional structure and to link parameters hierarchically. Firstly we assume that there is no time-dependence for the abundances of B cells classified as background or as non-specific responders. We further assume that for the vaccine-specific cells, the pre-vaccination abundances (at *t*=0) follow the same distribution as B cells classified as background, while post-vaccination these cells follow the same abundance distribution as B cells classified as non-specific responders. Third, we assume that the probability of a clone being observed in a subject is the same for B cells classified as background and those classified as a non-specific response. In effect this assumes that non-specific responders are or have been responding to private stimuli, rather than for instance earlier common infections.

The uncertainty in the inferred model parameters is negligible in comparison to the biological noise because of the large amount of data. Rather than reporting this spurious precision, we report the parameter estimates without error bars, but we note that errors due to model misspecification are likely to be substantial. We report the inferred probability of a BCR clone belonging to each category, *Γ*_{class} for \(\phantom {\dot {i}\!}class \in \{\mathsf {bg},\mathsf {vs},\mathsf {ns}\}\). We also report, for each class, the probability that a clone is observed given that a corresponding B cell of that class is present in an individual, *p*_{class}. Finally, we report for each class the inferred probability that a clone is being observed with high abundance, *ω*_{class}.

### Sequence similarity

To compare the within-set similarity of sequences between subsets of sequences of any length, we use the Levenshtein (or “edit”) distance as implemented in [22]. Specifically, given a subset of sequences, we compute a measure of within-set similarity the mean of the Levenshtein distances between all pairs of sequences in the subset. To assess significance we use bootstrapping: we calculate the mean Levenshtein distance between a randomly selected subset of the same size, and compare the resulting null distribution of means to calculate the empirical p-value.

### Thresholding method

Existing methods for identifying vaccine specific BCR clones rely on identifying sequences which are either highly abundant, shared between multiple individuals, or both. Empirical methods are typically used to determine thresholding criteria for abundance and sharing [8, 18, 20], sequences which are above these thresholds are then considred to be likely vaccine specific. Alternatively, statistical significance of sequences in cases relative to controls can be used to determine threshold levels [16], or training and test sets used for validation [19].

We define an abundance threshold above which clones are considered to be highly abundant as a quantile of all abundances in an individual sample [20]. This allows the actual abundance value to change according to sample variability such as sequencing depth. Clones may then be considered vaccine-specific if they are below this threshold pre-vaccination and above this threshold for at least one time point post-vaccination. We also define a sharing threshold as the minimum number of individuals in which a clone must be present in order to be considered vaccine specific [19]. Sequences from individual clones are considered vaccine-specific if both the abundance and sharing criteria are met, and we evaluate a range of different thresholds by comparing them to our truth sets.