Inferring B cell specificity for vaccines using a Bayesian mixture model

Background Vaccines have greatly reduced the burden of infectious disease, ranking in their impact on global health second only after clean water. Most vaccines confer protection by the production of antibodies with binding affinity for the antigen, which is the main effector function of B cells. This results in short term changes in the B cell receptor (BCR) repertoire when an immune response is launched, and long term changes when immunity is conferred. Analysis of antibodies in serum is usually used to evaluate vaccine response, however this is limited and therefore the investigation of the BCR repertoire provides far more detail for the analysis of vaccine response. Results Here, we introduce a novel Bayesian model to describe the observed distribution of BCR sequences and the pattern of sharing across time and between individuals, with the goal to identify vaccine-specific BCRs. We use data from two studies to assess the model and estimate that we can identify vaccine-specific BCRs with 69% sensitivity. Conclusion Our results demonstrate that statistical modelling can capture patterns associated with vaccine response and identify vaccine specific B cells in a range of different data sets. Additionally, the B cells we identify as vaccine specific show greater levels of sequence similarity than expected, suggesting that there are additional signals of vaccine response, not currently considered, which could improve the identification of vaccine specific B cells.

For a dataset x consisting of clonal abundances in subjects, s and at time points, t, the joint probability of the model is built up conditionally, and given by: where γ is the latent allocation vector denoting the allocation of BCR clones to classes (background, vaccine specific or non-vaccine specific); z is a binary variable indicating the presence or absence of a clone within an individual; and e is the latent allocation vector denoting the underlying distribution from which the clonal abundances are generated.
The parameter e is not of primary interest, so we marginalise over it, and obtain a posterior which is equivalent to a mixture model: The vector θ = θ 1,1 , . . . , θ ST contains the sample specific parameters associated with the underlying clonal abundance distributions, where where N B is the density of the negative-binomial distribution and dGP D is the density of the discretised Generalised Pareto Distribution [1]. These parameters are subject and time point dependent allowing for differences between the samples, in particular sequencing depths. The dGPD has a threshold parameter, and only assigns probability to values above this threshold. This ensures that it is only capturing the tail of the distribution (those clones which are seen in high abundance) and provides an intuitive interpretation that only clones seen at abundances above this threshold could be considered clonal.
We adopt a flexible approach allowing the model to be applied to a range of data sets, and therefore we use non-informative priors and seek to learn parameters as much as possible. We choose Dirichlet priors for the distribution Figure 1: Full graphical representation of model using plate notation.
of γ i and e ist , and a Beta prior for z is ; more precisely, p(γ i = class) = Γ class for 1 ≤ i ≤ K; class ∈ {bg, vs, ns} where K is the number of clones and Dir is the symmetric Dirichlet distribution. We set G = W = 1 to give the flat Dirichlet distribution, α = β = 1 to give a uniform distribution, and Θ defines the space of all possible parameter values. The full model is illustrated in plate notation in Figure 1.

Inference
The parameters are fitted to the data sets using an E-M algorithm. Initial parameter values are based on prior belief that vaccine specific clones will be rare, seen at high frequency and shared between multiple samples, and the results are robust to different initial parameter values which maintain these properties. This choice of initial parameters was seen to prevent problems of label switching and to identify clones with properties typically associated with vaccine response, whilst allowing the data to inform the final parameter values.
Restrictions on parameter values allow us to encode additional structure and to link parameters hierarchically. First, we assume no structure in the time profile for the B cell abundances which are not responding to the vaccine, so that ω bg,t = ω bg and ω ns,t = ω ns for all t. The time profile that we assume for the vaccine-specific cells assumes that pre-vaccination the abundances of vaccinespecific cells have the same distribution as the background cells (ω vs,0 = ω bg ), and that post-vaccination they have the same abundance distribution as B cells responding to a stimulus other than the vaccine (ω vs,t = ω ns , for t > 0). We also assume that the probability of a clone being observed in a subject is the same for B cells classified as background and those classified as a non-specific response, that is, p bg = p ns Finally, z is = 0 indicates an absence of B cells in subject, so in this case we restrict the B cell abundance to being generated by the point mass at zero by defining p(e ist = 1|γ i , z is = 0, t) = 1.
In order to prevent convergence to degenerative local maxima we restrict Γ class ≥ .001, so that is there is always some small probability of a clone belonging to any class.