An open protocol for modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences

T cell receptor repertoires can be profiled using next generation sequencing (NGS) to measure and monitor adaptive dynamical changes in response to disease and other perturbations. Genomic DNA-based bulk sequencing is cost-effective but necessitates multiplex target amplification using multiple primer pairs with highly variable amplification efficiencies. Here, we utilize an equimolar primer mixture and propose a single statistical normalization step that efficiently corrects for amplification bias post sequencing. Using samples analyzed by both our open protocol and a commercial solution, we show high concordance between bulk clonality metrics. This approach is an inexpensive and open-source alternative to commercial solutions. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09424-z.


Background
The receptors on the surface of T cells bind to an enormous array of antigens that play a pivotal role in shaping immune response during health and disease. The T cell receptor (TCR) is a heterodimer composed of one alpha and one beta chain which are encoded by the TCRα and TCRβ genes, respectively. To recognize an extremely large antigen space, the TCR genomic loci undergo somatic recombination of variable (V), diversity (D), and joining (J) gene segments, and generate a diverse repertoire of TCRs. The complementarity determining region 3 (CDR3) region present at the D segment of the recombined TCRβ gene is highly diverse in TCR beta chains. Therefore, surveying the recombined TCRβ gene or transcript as a proxy for overall TCR repertoire diversity has emerged as a rational approach to study TCR repertoire dynamics.
Over the last 10 years it became possible to obtain comprehensive profiles of TCR through an array of next generation sequencing (NGS) based approaches [1][2][3][4][5][6][7][8]. These vary based on the experiment type (bulk or singlecell), sample type (RNA or DNA), library preparation (multiplex PCR, bead-based enrichment, 5'RACE) and sequencing platform, each choice presenting a different, and fascinatingly interlocked trade-off. Single cell approaches compared to bulk can be very accurate and unbiased for high frequency clones, but have lower resolution for low frequency clones [9,10]. They are also substantially more expensive and require intact cells. RNA-based approaches are affected by the variability in TCR RNA expression levels, but may better reflect diversity when the sample size is limited [8]. Genomic DNA (gDNA)-based approaches require either multiplex PCR or target enrichment during library preparation which introduces biases [11]. A recent comparison of these two approaches confirmed these issues for both RNA-and DNA-based methods but also found the methodological variability to be smaller than the biological variability [12]. New innovations are being introduced rapidly for all of these approaches but currently there is no established gold standard.
Multiplexed PCR-based bulk sequencing approaches using gDNA, however, have become the standard approach for translational and even clinical applications due to reasonable sample requirements and moderate costs [13]. This is reflected in the fact that all currently available commercial TCR sequencing products offer this as their major DNA option (Adaptive Biotechnologies [14], BGI [15], iRepertoire). Multiplex PCR refers to the usage of multiple forward primers specific for the V segments and multiple reverse primers specific for the J segments in combination during the initial amplification for target enrichment. Since each primer pair will have a different efficiency multiplexing will distort the relative abundances of the VDJ segment combinations [16]. Correcting for this amplification bias is a key challenge for accurate quantification. One approach [14] is using spiked-in oligomers (synthetic templates) for each primer pair to measure primer efficiencies and to carefully control the design of the primers and their concentrations accordingly. Assuming that there is no interaction between the efficiency of primer pairs, amplification bias is reduced by iteratively calibrating the primer concentrations to find the optimal primer mix, and then removing any remaining amplification bias computationally using spiked-in oligomer counts.
The proprietary setup described in [14] is currently available for human and murine samples exclusively through commercial kits. These are difficult and laborintensive to adapt if requiring different settings, e.g. a different mammal or application. Furthermore, synthetic templates when added to all of the samples with the kits can substantially increase preparation and sequencing costs, as well as decrease coverage for clonotypes.
Here we demonstrate, that using a negative binomial mean normalization strategy, it is possible to normalize amplification bias within a reasonable error margin without balancing primer concentrations using synthetic templates. Furthermore, we report that amplification bias parameters are highly preserved between datasets. This observation indicates that one can use synthetic template normalization controls for a small subset of samples, which can then be used to calculate mean scaling factors per batch/experiment, thus substantially reducing the cost and coverage of TCR repertoire analysis.

Amplification bias due to multiplex PCR is reproducible
To amplify all possible TCR somatic recombination products, we performed a multiplex PCR with 20 different V-specific forward primers and 13 different J-specific reverse primers. Since the differences in efficiency of primer pairs can produce significant amplification bias in TCR clonotypes, we spiked-in 260 synthetic TCR templates (ST) in equimolar concentration as internal controls to the multiplex PCR reaction in order to measure and control bias (Fig. 1A). We hypothesized that the estimated mean counts of the 260 ST would scale proportionally across samples and experiments. To test this, we measured the distribution and variation of ST counts across 20 ST-only samples in the absence of genomic DNA (Fig. 1B). Within each sample, ST counts were converted to relative frequencies to reduce the effect of random sample-to-sample variation on the comparisons. For a given ST, deviation of the median relative frequency across all samples from 1/260 is a measure of the PCR amplification bias for the corresponding primer pair, while the spread of relative frequencies is an indication of the random sample-to-sample variation. We observed that the ST-to-ST variation was much larger than the sample-to-sample variation within an individual ST (~ 20-fold difference in respective median IQRs). These differences indicate that the observed differences in ST counts are primarily caused by amplification bias of different primer pairs. The ln (1/260) line indicates the (log) expected ST relative frequency in the absence of amplification bias. To demonstrate that the above observations apply in the presence of TCR clonotypes, we obtained ST counts from samples with genomic DNA extracted from P14 and OT-1 TCR transgenic mice where CD8 T cells primarily recognize OVA 257-264 when presented by the MHC I molecule. A similar relationship was found between ST relative frequencies in the presence of TCR clonotypes across samples, though the sample-to-sample variation is noticeably larger ( Supplementary Fig. 1).

A negative binomial model fits the data
The fit of count data on 260 ST from each of 20 ST-only samples to the negative binomial (NB) with ST-specific means and a common dispersion parameter d can be informally assessed by examining an empirical variance (v) vs mean (m) plot with the line v = m + dm 2 displayed, all with log-log scales.
At the high end, we expect an approximate linear relationship between log mean and log variance for the ST counts, with a slope of about 2, since log v ≈ log d + 2log m for large m. In Fig. 2, we took d = 0.125, as this was the median value of the dispersion estimates found by fitting separate NB distributions to the 260 sets of 20 ST counts.
The fact that the ST counts fit NB distributions with approximately similar overdispersion parameters reassures us the variation we are seeing is in some sense natural and that the system is in control [17].

Scaling factors are relatively stable across experiments
We fit the NB model to three different sets of ST-only observations at different equimolar concentrations, coming from two different batches: (1) a set of 20 samples from one batch, (2) two sets of 10 samples from another batch. To demonstrate that scaling factors were relatively stable across ST-only samples, we compared the sets of 260 (m i /m • ) values based on 20 ST-only samples, to those based on the sets of 10 ST-only samples. We observed the (m i /m • ) values to be highly correlated on a log-log scale (Pearson r = 0.83 and 0.97, p-value < 2 × 10 -16 for both comparisons) (Fig. 3A-B), indicating that scaling factors are relatively stable across experiments.
We then computed estimates of the (m i ) by pooling data across these sets, adjusting for the concentration differences, calling the results the combined estimates. Pooling also deals with errors introduced to the ST-only counts by processing samples in different batches along with different samples. The combined estimates of the (m i ) are therefore used for normalization below (Supplementary Appendix 1).

Normalization considerably reduces amplification bias
We normalized the 20 ST-only measurements with the combined estimates of the (m i ). The spread of the ST counts for 20 samples was considerably reduced after normalization (Fig. 4A). A similar comparison of ST counts for 20 samples, in the presence of genomic DNA derived from mesothelioma tumors, revealed that the reduction in spread of ST normalized counts was present, although less pronounced ( Supplementary Fig. 2). To validate the normalization procedure further, we assessed the observed ratio of monoclonal counts before and after normalization utilizing the 50:50 mixture of P14 and OT-1 TCR transgenic monoclonal DNA. After normalization, the differences between the transgenic TCR counts were substantially reduced for all samples, as represented by the smaller deviations of the proportion of the dominant clonotype from ½ following normalization (Fig. 4B).

Amplification bias reduction benefits from the dependence of primer pairs
A question from data presented in Fig. 4A and B that arose had to do with determining how great a reduction in the spread of the 260 ST counts would be were well approximated by NB distributions with the same dispersion parameters (albeit with quite different means), our conclusion was that the different ST counts were likely not independent. Further, this result supports the normalization scheme, as the dependence aids reduction of the amplification bias below the level that would be expected under independence. Since each primer pair shares one primer with 32 other primer pairs, it is not surprising that the different ST counts are not independent; indeed, patterns of dependence in counts using chi-squared statistics are observed (Fig. 5). The interactions revealed as patches of red and blue colors demonstrate that groups of V primers exhibit positive or negative dependence together with groups of J primers, that is, they interact to become over or under-represented in groups.
In their experimental setup, Carlson et al., using an ANOVA based approach, concluded that primer pairs can be treated independently, and employed primer iteration experiments to find the optimal primer mix [14]. Based on our normalization approach, however, the non-random, non-zero interaction terms revealed by chi-squared statistics substantially complicate preparation of an optimal primer mix through primer iteration experiments, and indicate a limit on the extent to which amplification bias can be addressed experimentally. We note that Carlson et al. also employed a second, computational normalization step, which we believe is primarily due to this limit.

A negative binomial model supports downstream analyses of T cell repertoire dynamics
With the TCR clonotype counts normalized, we wanted to determine if results from these analyses were concordant with results from a commercially available platform (Adaptive Biotechnologies) described by Carlson et al. [14]. To achieve this, we utilized gDNA generated from pancreatic ductal adenocarcinoma specimens described in Byrne et al. [18] containing 17 samples previously sequenced by Adaptive Biotechnologies' platform. Aliquots of samples were sequenced and processed using the protocol described above. We utilized in-house software (see Material and methods), tcR package, and VDJ tools to compute TCR repertoire metrics such as the diversity, clonality, and clonal distribution and refer to this pipeline as Open TCR Sequencing Protocol (OTSP). The 16 samples with enough DNA were sequenced in parallel and results were evaluated for concordance using Spearman correlation analyses for all combinations of: 1) Adaptive Biotech. platform sequences run by OTSP; 2) OTSP sequences run by OTSP; and 3) Adaptive Biotech. platform sequences Normalization reduces amplification bias. A NB normalization reduces amplification bias. Variation in ST counts within samples is apparent before normalization (red), however, the ST-to-ST differences within each sample are reduced to less than two-fold after normalization (cyan). Data derived from twenty ST-only samples described in the ST-only data sets. B Amplification bias of spleen genomic DNA of P14 and OT-1 TCR transgenic mice. To further evaluate normalization parameters, a 50:50 mixture of P14 and OT1 TCR transgenic monoclonal DNA was utilized to examine differences between transgenic TCR counts (red) that were reduced for all samples following normalization (cyan), as observed by the smaller deviations from ½ of the proportions of the dominant clonotype. Red color indicates values from clone counts before normalization, green color indicates values from normalized clone counts. Data derived from 12 50:50 mixture of P14 and OT-1 transgenic mice samples described in the Transgenic TCR data sets run by Adaptive Biotech. Figure 6A and B show concordance between results of OTSP sequences run by OTSP and Adaptive Biotech. sequences run by OTSP for Clonal index (r = 0.7) and Shannon diversity index (r = 0.8). The concordance for other combinations for Clonal Index and Shannon Diversity, as well as concordance between results of OTSP sequences run by OTSP and Adaptive Biotech. platform sequences run by Adaptive Biotech. for the frequency of hyperexpanded clones are shown in Supplementary Fig. 3A-E. Overall, results from the OTSP pipeline demonstrates a strong association with results from the Adaptive Biotech. TCR sequencing platform (p < 0.001 for all comparisons).
The descriptions and formulas related to Clonal index, Shannon diversity index and frequency of hyperexpanded clones are included in [19], which deploys the NB mean normalization methodology described here in a biological context.

Discussion
Measuring and monitoring adaptive dynamics in patient TCR repertoires could have a significant impact on response and resistance monitoring for patients receiving various forms of immunotherapy in the treatment of cancer or auto-immune diseases. To achieve the goal of capturing the diversity, quantifying the abundance of T-cell clones and performing longitudinal comparisons, TCR sequencing has emerged as an approach to monitor T cell responses to therapy and disease progression.
The OTSP pipeline described herein provides a transparent protocol enabling clonality metrics, including evaluating amplification bias specific to each primer pair, and is reproducible across samples. Count variability approximates the negative binomial distribution that can be exploited using estimated NB means. The NB distribution tells us the variation anticipated in ST counts is indicative of a controlled process. OTSP pipeline results were compared to data generated from the commercial platform by fixing 260 ST-specific scaling factors derived from ST-only measurements (without the presence of genomic DNA). We observed a high concordance between bulk clonality metrics across the two platforms. This observation indicates that the OTSP pipeline can be integrated across batches, samples and platforms, further improving utility of TCR clonality measurements, something not generally possible when using commercial platforms as ST-counts are typically not provided.
OTSP is open and freely available; we anticipate this will allow scaling up the number of measurements substantially. Since we only use computational normalization, rather than addressing differing primer efficiencies at the bench level, the OTSP methodology is also less labor-intensive than previous methods. Notably, OTSP avoids the primer iteration experiments needed to address amplification bias problems. Porting of the platform to new organisms or designs will require new calibration. Utilizing the designs presented here for mouse are likely to require minimal optimization.
Since PCR amplification bias is repeatable across samples we have demonstrated the possibility of conducting analyses without the addition of ST beyond the initial calibration. This is achieved by using ST-specific normalization scale factors obtained from independent, ST-only measurements. The idea of using 260 universal ST-specific scaling factors to address amplification bias in mouse models could be explored further, for if deployed this would substantially decrease cost of this methodology as ST are costly. An additional advantage stems from the fact that ST reduce sequencing depth due to competition between genomic DNA and the ST ( Supplementary  Fig. 4). We observed that when gDNA amount was kept constant at 600 ng, increasing concentrations of ST led to decreasing detectability of clonotypes, indicating a competition between the ST and clonotypes during the process. This is especially important as indicated by our results revealing that drop-outs can be frequent, even for the most abundant clonotypes ( Supplementary Fig. 5). For example, when the most frequent 0.3% clonotypes from Wild Type spleen tissue were used, we observed that only 119 distinct clonotypes were detected in the 5 replicates, with 63 clonotypes detected in 4 samples, 69 in 3, and 90 in 2, respectively. These drop-outs stem from the stochastic nature of the sampling and could be reduced by increasing the read coverage by not using ST.

Conclusions
Measuring and tracing changes in the abundance of clones is important for observing the immune response to different therapies such as cancer immunotherapy, leukemia monitoring and predicting patient outcomes.
A high concordance between bulk clonality metrics across Adaptive Biotech., one of the leading commercial companies in the field, and OTSP is observed. Our approach is a less laborious (as OTSP avoids the primer iteration experiments needed to address amplification bias problems), reproducible and an inexpensive alternative for understanding relative abundances of clonotypes. Utilization of STs for every experiment in a batch beyond the initial ST calibration may not be necessary and their utilization in every batch decreases the sequencing depth.
We propose the OTSP as an open-source, transparent protocol that efficiently corrects for amplification bias post sequencing for accurate, reproducible measurement of clonality metrics.

Material and methods
The aim of the methodology is modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences utilizing NGS, multiplex PCR and an NB mean normalization strategy. Some of the experiments were performed in the context of genomic DNA derived from mice.

Mouse handling
To generate the Byrne et al. data set, wild-type C57BL/6 mice were purchased from The Jackson Laboratory and housed at the University of Pennsylvania. Animal protocols were reviewed and approved by the Institute of Animal Care and Use Committee at the University of Pennsylvania. Mice were euthanized in a CO2 chamber using a flow meter to ensure CO2 was displaced at a rate of 30-70% of the chamber volume per minute and maintained for at least 1 min after the loss of righting reflex is observed. Euthanasia was confirmed by bilateral thoracotomy. Animal handling information regarding tumor injections, drug prep and injection are included in Byrne et al. [18].
To generate transgenic TCR data sets, spleen tissue from P14 and OT1 TCR transgenic mice (Nolz lab) were obtained. The spleen tissue was homogenized, treated with RBC lysis buffer and the resulting single-cell suspension was pelleted. Genomic DNA was extracted from the cell pellet using DNeasy blood and tissue kit (Qiagen).
To generate all the other data using animals, wild-type C57BL/6 mice were purchased from The Jackson Laboratory and maintained within the UCSF or OHSU laboratory for animal care barrier facility according to IACUC procedures. To generate the ST dilution series data set with mammary tumor and the Mesothelioma data set, we used mammary tumors from mouse mammary tumor virus (MMTV)-Polyomavirus middle T (PyMT) transgenic FVB/N mice and 40L orthotopic mesothelioma tumors respectively. Mammary tumors were resected from day 95 mice. For 40L orthotopic mesothelioma tumors, 2 × 10 6 cancer cells were injected i.p. into wildtype male C57BL/6 that were 6-12 weeks of age. For both tumor models, all mice were euthanized at a predefined end-stage for tissue harvest by cardiac puncture followed by cervical dislocation. These mice were cardiac perfused, under anesthesia using 1-5% isoflurane, with 20 mL solution of heparin in PBS to clear tissues of residual blood followed by tissue harvest for further analysis including TCR sequencing. Tumor tissue was excised and flash frozen in liquid nitrogen and stored at -80 degree Celsius until further use for extracting genomic DNA for TCR sequencing. Murine SCC tumors were obtained from a previously published study, Medler et al. [19].

Multiplex primers and design of synthetic TCR templates
Multiplex PCR primers previously described by Faham et al. (US patent 8,628,972 B2), for the amplification of murine TCRβ genomic loci were utilized (Supplementary Table 1: primer sequences). The 20 Vβ segment specific primers amplify all the 21 functional Vβ segments, and the 13 Jβ specific primers amplify all the 13 functional Jβ segments. As previously described by Carlson et al., we designed 260 (20 V x 13 J) synthetic TCR templates (ST) to minimize amplification bias due to multiplexing with 20 Vβ forward and 13 Jβ reverse primers [14]. Briefly, ST are 200 bp long double stranded DNA segments that contain partial V segment and J segment sequences encompassing a set of internal barcodes for post-sequencing identification. The internal barcode region contains a 16 bp barcode specific for each VJ combination. This specific barcode is further flanked by a 9 bp barcode that is common for all ST. An equimolar mixture of the 260 ST is added to the genomic DNA samples during PCR as internal controls.

Amplification and deep sequencing of TCRβ genomic locus
Genomic DNA from freshly resected mouse peripheral blood, peritoneal orthotopic mesotheliomas [20], and spleen was isolated using the Qiagen DNeasy Blood and tissue kit. Utilizing the method described in Robins et al. [21], we performed a 2-stage PCR using genomic DNA for TCRβ deep-sequencing library preparation. The 1 st stage involved amplifying the gDNA and the ST using 35 cycles of multiplex PCR with 20 Vβ forward and 13 Jβ reverse primers using the Qiagen multiplex PCR kit. The multiplex PCR primers contain a common 5' overhang, allowing amplification by a single primer pair in the 2 nd stage PCR. Using 2.0% of the purified PCR product from stage 1 as template, a 2 nd stage PCR, including 8 cycles, was performed with universal and indexed Illumina adaptors. Of note, the indexed adaptors contained an 8-base index sequence, providing each sample with a unique sample barcode. Equal volumes of all samples were pooled. Each pool concentration, typically containing PCR mixtures from 70 samples, was measured with a 2200 TapeStation (Agilent), and the concentration determined by real time PCR using a StepOne Real Time Workstation (ABI/Thermo) with a commercial library quantification kit (Kapa Biosystems). Paired-end sequencing was performed with a 2 × 150 protocol using a Midoutput 300 sequencing kit on a NextSeq 500 (Illumina). Target clustering was ~160 million clusters per run. Following the run, base call files were converted to fastq format and demultiplexed by a separate barcode read using the most current version of Bcl2Fastq software (Illumina).

TCR data analysis pipeline
Fastq files were assessed for initial read quality using the FASTQC public tool [22], including the per-base quality scores. Quality paired-end sequences were combined using the PEAR (Paired-End reAd mergeR) algorithm [23]. Merged sequences were then separated into ST and non-ST sequences. ST sequences were identified by searching for the common flanking 9-bp internal barcodes allowing a one-nucleotide mismatch or indel. Sequences flagged as ST via this search were removed from downstream clonotype analyses. The individual ST sequences were distinguished and quantified by searching for the specific 16-bp barcode sequences unique to each ST, again allowing a one-nucleotide mismatch or indel ( Supplementary Fig. 6). Clonotypes were identified from purified (ST-removed) sequences utilizing the MiXCR pipeline [24], which is a two-step alignment and assembly process. First, reads were aligned to reference V, D, and J sequences, using the align module. Next, the assemble module grouped alignments into distinct clonotypes using a hierarchical clustering method based on sequence similarity and relative abundance. Finally, the export module exported alignments as well as assembled clones in tabular format. Raw clonotype counts were normalized using the NB mean normalization strategy described below. Normalized clonotype counts were exported in tabular format for use in downstream analysis. A number of TCR repertoire metrics, including clonality, maximum clonal frequency, and the Shannon diversity index were calculated. Quality control data was recorded in an overall summary table, for reference (Supplementary Material 1).

Data sets ST-only data sets
Twenty samples of an equimolar mixture of ST were sequenced. Two sets of ten samples of equimolar mixtures of ST at different concentrations were also sequenced at a later date. No genomic DNA was present in these samples.

Transgenic TCR data sets
A dilution series was created from spleen genomic DNA of P14 and OT-1 TCR transgenic mice. Three technical replicates of 300, 600, 900, and 1200 ng of DNA from P14, OT1, and a 50:50 mixture of P14 and OT1 DNA were sequenced along with an equimolar mixture of ST for a total of 36 samples. 4 mice were used to create the DNA from P14, another 4 mice were used to create the DNA from OT1 and 8 mice were used to create 50:50 mixture of P14 and OT1 DNA. In total, 16 mice were used to create this data set. Appropriate amplification of the transgenic clones was assessed ( Supplementary  Fig. 7).

ST dilution series data set
A dilution series was created from ST at six levels as below: Three technical replicates of 600 ng of murine blood DNA were added to all levels of dilutions and for no ST samples for a total of 21 samples and three technical replicates of 600 ng of mammary tumor murine DNA were added to all levels of dilutions and for no ST samples for a total of 21 samples. In total, 14 mice were used to create this data set.

WT spleen data set
Wild-type mouse spleen genomic DNA were sequenced for a total of five technical replicates. A single mouse was used to create this data set.

Byrne et al. data set [18]
gDNA from 17 murine pancreatic ductal adenocarcinoma specimens were previously sequenced by a commercially available TCR beta platform. Aliquots of these gDNA samples were obtained and sequenced along with an equimolar mixture of ST using the protocol described above. In total, 17 murine samples were used to create this previously published data set.

Mesothelioma data set
Peritoneal mesothelioma tumors derived from the 40L cell line [20] derived genomic DNA samples derived from 12 syngeneic mice were sequenced along with equimolar mixture of ST. In total, 12 mice were used to create this data set.
In total, 60 mice were used to create above datasets to assess methodological approaches to normalizing TCR repertoires. No groups of animals were compared to each other as the purpose of the study has no biological study endpoint. Therefore, no a priori sample size calculations performed.

Batch mean scaling factors
This method creates a scaling factor for each ST (and therefore for each primer pair) based on that ST's counts among all samples, relative to all ST counts within a batch. Given a matrix C ij of ST counts from a batch, where i = 1,…,260 labels ST and j = 1,…,n labels samples in the batch, we denote the batch mean of the counts for ST i by C i• = n −1 j C ij and the batch mean of all ST means by

Negative binomial means
The above idea of a scale factor is distribution free, but for its use in normalizing counts, would require a full set of ST in every sample. We explored the use of a ST-specific negative binomial model to dispense with the use of synthetic templates. Consider the set (C 1 , …., C n ) of counts for single, fixed ST across a batch of n replicate ST-only samples. A plausible model for these counts is the negative binomial (NB) distribution. We write C ∼ NB(m, d) for this distribution, where m > 0 is the mean parameter and d ≥ 0 is the (over-) dispersion parameter, and refer to [25] for an explicit formula for the NB probability mass function. For present purposes, C has expected value E(C) = m and variance var(C) = m + dm 2 . When d = 0, the negative binomial reduces to the Poisson distribution, for which E(C) = var(C), and thus the use of the term over-dispersion here is relative to the Poisson. Using the methods of generalized linear models [25], we can obtain the maximum likelihood estimates (MLE) m and d of m and d from a replicate set of ST such as (C 1 , …., C n ). In the notation of the previous paragraph, the MLE m i = C i• , that is, the MLE of the ith mean parameter m i of an NB fitted to (C ij ) is the arithmetic mean of the ith set of observed counts (assumed independent and identically distributed across j = 1,…,n with common mean m i and common dispersion parameter. Where no confusion will result in what follows, we will not distinguish the parameters (m i ) from their (maximum likelihood) estimates ( m i ) . These ST or primer-pair-specific means estimated from ST-only data can be used as scaling factors for normalization, even when there are no ST present in samples. Consistent with the notation in the previous paragraph, we write m • = (260) −1 i m i for the average of the 260 (estimated) mean parameters. The 260 NB mean scaling factors are (m i /m • ).

Normalization
To normalize a set of clonotype counts from a single sample, we first calculated the primer-pair totals (C i ), where C i denotes the total count of all clonotypes amplified with primer-pair i, where i = 1,…,260. We then normalize the 260 counts (C i ) using the batch scaling factors (SF i ) by dividing by the corresponding scaling factor: Similarly, we normalize the (C i ) using the (estimated) NB mean scaling factors (m i /m • ) by dividing: C ′′ i = (m • /m i )C i . After these primer-pair totals were normalized, the counts for distinct clonotypes sharing the same primer-pair were normalized: if one such clonotype accounts for a proportion p of the total count C corresponding to its primer pair, then it will be assigned a normalized value equal to the same proportion p of the normalized primer-pair total C' or C".
The same normalization could be used for ST counts if available. That is, divide the observed count of the fragments arising from primer pair i by the SF i or m i /m • for primer pair i for both ST and clonotype counts alike. Since the mean count of ST i will be proportional to m i , normalization should preserve the total count of ST, exactly for batch mean normalization, on average for NB normalization. As long as the observed clonotype counts exhibit the same relative over-and under-representation after amplification as that exhibited by the ST, any bias will be reduced by this normalization.