Skip to main content

An open protocol for modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences


T cell receptor repertoires can be profiled using next generation sequencing (NGS) to measure and monitor adaptive dynamical changes in response to disease and other perturbations. Genomic DNA-based bulk sequencing is cost-effective but necessitates multiplex target amplification using multiple primer pairs with highly variable amplification efficiencies. Here, we utilize an equimolar primer mixture and propose a single statistical normalization step that efficiently corrects for amplification bias post sequencing. Using samples analyzed by both our open protocol and a commercial solution, we show high concordance between bulk clonality metrics. This approach is an inexpensive and open-source alternative to commercial solutions.


The receptors on the surface of T cells bind to an enormous array of antigens that play a pivotal role in shaping immune response during health and disease. The T cell receptor (TCR) is a heterodimer composed of one alpha and one beta chain which are encoded by the TCRα and TCRβ genes, respectively. To recognize an extremely large antigen space, the TCR genomic loci undergo somatic recombination of variable (V), diversity (D), and joining (J) gene segments, and generate a diverse repertoire of TCRs. The complementarity determining region 3 (CDR3) region present at the D segment of the recombined TCRβ gene is highly diverse in TCR beta chains. Therefore, surveying the recombined TCRβ gene or transcript as a proxy for overall TCR repertoire diversity has emerged as a rational approach to study TCR repertoire dynamics.

Over the last 10 years it became possible to obtain comprehensive profiles of TCR through an array of next generation sequencing (NGS) based approaches [1,2,3,4,5,6,7,8]. These vary based on the experiment type (bulk or single-cell), sample type (RNA or DNA), library preparation (multiplex PCR, bead-based enrichment, 5’RACE) and sequencing platform, each choice presenting a different, and fascinatingly interlocked trade-off. Single cell approaches compared to bulk can be very accurate and unbiased for high frequency clones, but have lower resolution for low frequency clones [9, 10]. They are also substantially more expensive and require intact cells. RNA-based approaches are affected by the variability in TCR RNA expression levels, but may better reflect diversity when the sample size is limited [8]. Genomic DNA (gDNA)-based approaches require either multiplex PCR or target enrichment during library preparation which introduces biases [11]. A recent comparison of these two approaches confirmed these issues for both RNA- and DNA-based methods but also found the methodological variability to be smaller than the biological variability [12]. New innovations are being introduced rapidly for all of these approaches but currently there is no established gold standard.

Multiplexed PCR-based bulk sequencing approaches using gDNA, however, have become the standard approach for translational and even clinical applications due to reasonable sample requirements and moderate costs [13]. This is reflected in the fact that all currently available commercial TCR sequencing products offer this as their major DNA option (Adaptive Biotechnologies [14], BGI [15], iRepertoire). Multiplex PCR refers to the usage of multiple forward primers specific for the V segments and multiple reverse primers specific for the J segments in combination during the initial amplification for target enrichment. Since each primer pair will have a different efficiency multiplexing will distort the relative abundances of the VDJ segment combinations [16]. Correcting for this amplification bias is a key challenge for accurate quantification. One approach [14] is using spiked-in oligomers (synthetic templates) for each primer pair to measure primer efficiencies and to carefully control the design of the primers and their concentrations accordingly. Assuming that there is no interaction between the efficiency of primer pairs, amplification bias is reduced by iteratively calibrating the primer concentrations to find the optimal primer mix, and then removing any remaining amplification bias computationally using spiked-in oligomer counts.

The proprietary setup described in [14] is currently available for human and murine samples exclusively through commercial kits. These are difficult and labor-intensive to adapt if requiring different settings, e.g. a different mammal or application. Furthermore, synthetic templates when added to all of the samples with the kits can substantially increase preparation and sequencing costs, as well as decrease coverage for clonotypes.

Here we demonstrate, that using a negative binomial mean normalization strategy, it is possible to normalize amplification bias within a reasonable error margin without balancing primer concentrations using synthetic templates. Furthermore, we report that amplification bias parameters are highly preserved between datasets. This observation indicates that one can use synthetic template normalization controls for a small subset of samples, which can then be used to calculate mean scaling factors per batch/experiment, thus substantially reducing the cost and coverage of TCR repertoire analysis.


Amplification bias due to multiplex PCR is reproducible

To amplify all possible TCR somatic recombination products, we performed a multiplex PCR with 20 different V-specific forward primers and 13 different J-specific reverse primers. Since the differences in efficiency of primer pairs can produce significant amplification bias in TCR clonotypes, we spiked-in 260 synthetic TCR templates (ST) in equimolar concentration as internal controls to the multiplex PCR reaction in order to measure and control bias (Fig. 1A). We hypothesized that the estimated mean counts of the 260 ST would scale proportionally across samples and experiments. To test this, we measured the distribution and variation of ST counts across 20 ST-only samples in the absence of genomic DNA (Fig. 1B). Within each sample, ST counts were converted to relative frequencies to reduce the effect of random sample-to-sample variation on the comparisons. For a given ST, deviation of the median relative frequency across all samples from 1/260 is a measure of the PCR amplification bias for the corresponding primer pair, while the spread of relative frequencies is an indication of the random sample-to-sample variation. We observed that the ST-to-ST variation was much larger than the sample-to-sample variation within an individual ST (~ 20-fold difference in respective median IQRs). These differences indicate that the observed differences in ST counts are primarily caused by amplification bias of different primer pairs. The ln (1/260) line indicates the (log) expected ST relative frequency in the absence of amplification bias. To demonstrate that the above observations apply in the presence of TCR clonotypes, we obtained ST counts from samples with genomic DNA extracted from P14 and OT-1 TCR transgenic mice where CD8 T cells primarily recognize OVA257-264 when presented by the MHC I molecule. A similar relationship was found between ST relative frequencies in the presence of TCR clonotypes across samples, though the sample-to-sample variation is noticeably larger (Supplementary Fig. 1).

Fig. 1
figure 1

Pipeline overview and ST proportion distributions. (A) Overview of the TCR sequencing analysis pipeline: A. gDNA extracted from freshly resected murine peripheral blood, peritoneal orthotopic mesotheliomas, and spleen tissues. B. Equimolar mixture of synthetic TCR templates (ST) were then added, and C. followed by addition of forward and reverse sequences of ST (top) and TCRβ (bottom). For ST, universal 9-bp barcode (grey), unique 16-bp barcode (purple). For TCRβ, V-region (red), VD/DJ-junctions (grey), D-region (purple), J-region (green). D. Samples amplified with multiplex PCR followed by second-stage barcoding PCR. E. Samples were then pooled for sequencing. F. ST and TCRβ were then separated using universal barcodes, and G. ST was quantified using unique barcodes. H. TCRβ clonotypes were quantified with the MiXCR tool suite. I. Negative binomial normalization was used to remove amplification bias. J. Scaling factors were applied to counts, and K. then used to normalize counts for diversity analyses. Figure created with (B) Plot showing stability for relative frequencies of ST counts within individual samples, and amplification bias based on the reproducibility of the multiplex PCR. Median IQR of ST-to-ST variation was ~20-fold greater than the median IQR of experiment-to-experiment relative frequencies of ST counts. Target values aim to be at the ln (1/260) line in the absence of amplification bias. Data derived from twenty ST-only samples described in the ST-only data sets

A negative binomial model fits the data

The fit of count data on 260 ST from each of 20 ST-only samples to the negative binomial (NB) with ST-specific means and a common dispersion parameter d can be informally assessed by examining an empirical variance (v) vs mean (m) plot with the line v = m + dm2 displayed, all with log–log scales.

At the high end, we expect an approximate linear relationship between log mean and log variance for the ST counts, with a slope of about 2, since log v ≈ log d + 2log m for large m. In Fig. 2, we took d = 0.125, as this was the median value of the dispersion estimates found by fitting separate NB distributions to the 260 sets of 20 ST counts.

Fig. 2
figure 2

A negative binomial model fits the data. Observed mean-variance relationships fit as compared to the negative binomial (NB) model. Dots are observed values derived from 260 ST count distributions with a separately fitted dispersion parameter. The red line is the relationship predicted by the NB model with fixed d (0.125). Data derived from twenty ST-only samples described in the ST-only data sets

The fact that the ST counts fit NB distributions with approximately similar overdispersion parameters reassures us the variation we are seeing is in some sense natural and that the system is in control [17].

Scaling factors are relatively stable across experiments

We fit the NB model to three different sets of ST-only observations at different equimolar concentrations, coming from two different batches: (1) a set of 20 samples from one batch, (2) two sets of 10 samples from another batch. To demonstrate that scaling factors were relatively stable across ST-only samples, we compared the sets of 260 \(({m}_{i}/{m}_{\bullet})\) values based on 20 ST-only samples, to those based on the sets of 10 ST-only samples. We observed the \(({m}_{i}/{m}_{\bullet})\) values to be highly correlated on a log–log scale (Pearson r = 0.83 and 0.97, p-value < 2 × 10–16 for both comparisons) (Fig. 3A-B), indicating that scaling factors are relatively stable across experiments.

Fig. 3
figure 3

Stability of scaling factors. A, B Scatter plots (log-log scales) comparing the sets of 260 \(({m}_{i}/{m}_{\bullet})\) values based on 20 ST-only samples to those based on the two sets of 10 ST-only samples (described in the Data sets section). The \(({m}_{i}/{m}_{\bullet})\) values are observed to be highly correlated across different batches and samples. (Pearson r = 0.97 and 0.84, p-value < 2 × 10–16 for both comparisons)

We then computed estimates of the (mi) by pooling data across these sets, adjusting for the concentration differences, calling the results the combined estimates. Pooling also deals with errors introduced to the ST-only counts by processing samples in different batches along with different samples. The combined estimates of the (mi) are therefore used for normalization below (Supplementary Appendix 1).

Normalization considerably reduces amplification bias

We normalized the 20 ST-only measurements with the combined estimates of the (mi). The spread of the ST counts for 20 samples was considerably reduced after normalization (Fig. 4A). A similar comparison of ST counts for 20 samples, in the presence of genomic DNA derived from mesothelioma tumors, revealed that the reduction in spread of ST normalized counts was present, although less pronounced (Supplementary Fig. 2). To validate the normalization procedure further, we assessed the observed ratio of monoclonal counts before and after normalization utilizing the 50:50 mixture of P14 and OT-1 TCR transgenic monoclonal DNA. After normalization, the differences between the transgenic TCR counts were substantially reduced for all samples, as represented by the smaller deviations of the proportion of the dominant clonotype from ½ following normalization (Fig. 4B).

Fig. 4
figure 4

Normalization reduces amplification bias. A NB normalization reduces amplification bias. Variation in ST counts within samples is apparent before normalization (red), however, the ST-to-ST differences within each sample are reduced to less than two-fold after normalization (cyan). Data derived from twenty ST-only samples described in the ST-only data sets. B Amplification bias of spleen genomic DNA of P14 and OT-1 TCR transgenic mice. To further evaluate normalization parameters, a 50:50 mixture of P14 and OT1 TCR transgenic monoclonal DNA was utilized to examine differences between transgenic TCR counts (red) that were reduced for all samples following normalization (cyan), as observed by the smaller deviations from ½ of the proportions of the dominant clonotype. Red color indicates values from clone counts before normalization, green color indicates values from normalized clone counts. Data derived from 12 50:50 mixture of P14 and OT-1 transgenic mice samples described in the Transgenic TCR data sets

Amplification bias reduction benefits from the dependence of primer pairs

A question from data presented in Fig. 4A and B that arose had to do with determining how great a reduction in the spread of the 260 ST counts would be possible, given the variation, even in the absence of amplification bias. A theoretical analysis is presented in Supplementary Appendix 2 under the assumption that counts from equimolar concentrations of the 260 ST are independent NB distributions with the same mean and dispersion parameters gives a lower bound. Seeing that counts from the 20 ST-only samples were well approximated by NB distributions with the same dispersion parameters (albeit with quite different means), our conclusion was that the different ST counts were likely not independent. Further, this result supports the normalization scheme, as the dependence aids reduction of the amplification bias below the level that would be expected under independence. Since each primer pair shares one primer with 32 other primer pairs, it is not surprising that the different ST counts are not independent; indeed, patterns of dependence in counts using chi-squared statistics are observed (Fig. 5). The interactions revealed as patches of red and blue colors demonstrate that groups of V primers exhibit positive or negative dependence together with groups of J primers, that is, they interact to become over or under-represented in groups.

Fig. 5
figure 5

Cluster analysis revealing dependence between forward and reverse primers. A heatmap of interaction terms between forward and reverse primers reveal widespread and reproducible deviations from expected efficiencies under independence. Interaction terms were calculated as signed Pearson residuals (O-E)/√E where O is the observed ST count and E is the expected ST count under independence, calculated as E = row total x column total / grand total. Blue and red colors indicate positive and negative deviation from independence between forward and reverse primers respectively, and white indicates no deviation. Data derived from twenty ST-only samples described in the ST-only data sets

In their experimental setup, Carlson et al., using an ANOVA based approach, concluded that primer pairs can be treated independently, and employed primer iteration experiments to find the optimal primer mix [14]. Based on our normalization approach, however, the non-random, non-zero interaction terms revealed by chi-squared statistics substantially complicate preparation of an optimal primer mix through primer iteration experiments, and indicate a limit on the extent to which amplification bias can be addressed experimentally. We note that Carlson et al. also employed a second, computational normalization step, which we believe is primarily due to this limit.

A negative binomial model supports downstream analyses of T cell repertoire dynamics

With the TCR clonotype counts normalized, we wanted to determine if results from these analyses were concordant with results from a commercially available platform (Adaptive Biotechnologies) described by Carlson et al. [14]. To achieve this, we utilized gDNA generated from pancreatic ductal adenocarcinoma specimens described in Byrne et al. [18] containing 17 samples previously sequenced by Adaptive Biotechnologies’ platform. Aliquots of samples were sequenced and processed using the protocol described above. We utilized in-house software (see Material and methods), tcR package, and VDJ tools to compute TCR repertoire metrics such as the diversity, clonality, and clonal distribution and refer to this pipeline as Open TCR Sequencing Protocol (OTSP). The 16 samples with enough DNA were sequenced in parallel and results were evaluated for concordance using Spearman correlation analyses for all combinations of: 1) Adaptive Biotech. platform sequences run by OTSP; 2) OTSP sequences run by OTSP; and 3) Adaptive Biotech. platform sequences run by Adaptive Biotech. Figure 6A and B show concordance between results of OTSP sequences run by OTSP and Adaptive Biotech. sequences run by OTSP for Clonal index (r = 0.7) and Shannon diversity index (r = 0.8). The concordance for other combinations for Clonal Index and Shannon Diversity, as well as concordance between results of OTSP sequences run by OTSP and Adaptive Biotech. platform sequences run by Adaptive Biotech. for the frequency of hyperexpanded clones are shown in Supplementary Fig. 3A-E. Overall, results from the OTSP pipeline demonstrates a strong association with results from the Adaptive Biotech. TCR sequencing platform (p < 0.001 for all comparisons).

Fig. 6
figure 6

Concordance analysis of T cell repertoire metrics. Concordance analysis comparing commercial and in-house pipelines where samples were evaluated based on the results of OTSP sequences run by OTSP and the Adaptive Biotech platform sequences run by OTSP for Clonal Index (A) and Shannon Diversity Index (B). Spearman r = 0.8 for Clonal Index and r = 0.7 for Shannon Diversity Index. (p < 0.001 for both comparisons). Data derived from samples described in the Byrne et al. data set

The descriptions and formulas related to Clonal index, Shannon diversity index and frequency of hyperexpanded clones are included in [19], which deploys the NB mean normalization methodology described here in a biological context.


Measuring and monitoring adaptive dynamics in patient TCR repertoires could have a significant impact on response and resistance monitoring for patients receiving various forms of immunotherapy in the treatment of cancer or auto-immune diseases. To achieve the goal of capturing the diversity, quantifying the abundance of T-cell clones and performing longitudinal comparisons, TCR sequencing has emerged as an approach to monitor T cell responses to therapy and disease progression.

The OTSP pipeline described herein provides a transparent protocol enabling clonality metrics, including evaluating amplification bias specific to each primer pair, and is reproducible across samples. Count variability approximates the negative binomial distribution that can be exploited using estimated NB means. The NB distribution tells us the variation anticipated in ST counts is indicative of a controlled process.

OTSP pipeline results were compared to data generated from the commercial platform by fixing 260 ST-specific scaling factors derived from ST-only measurements (without the presence of genomic DNA). We observed a high concordance between bulk clonality metrics across the two platforms. This observation indicates that the OTSP pipeline can be integrated across batches, samples and platforms, further improving utility of TCR clonality measurements, something not generally possible when using commercial platforms as ST-counts are typically not provided.

OTSP is open and freely available; we anticipate this will allow scaling up the number of measurements substantially. Since we only use computational normalization, rather than addressing differing primer efficiencies at the bench level, the OTSP methodology is also less labor-intensive than previous methods. Notably, OTSP avoids the primer iteration experiments needed to address amplification bias problems. Porting of the platform to new organisms or designs will require new calibration. Utilizing the designs presented here for mouse are likely to require minimal optimization.

Since PCR amplification bias is repeatable across samples we have demonstrated the possibility of conducting analyses without the addition of ST beyond the initial calibration. This is achieved by using ST-specific normalization scale factors obtained from independent, ST-only measurements. The idea of using 260 universal ST-specific scaling factors to address amplification bias in mouse models could be explored further, for if deployed this would substantially decrease cost of this methodology as ST are costly. An additional advantage stems from the fact that ST reduce sequencing depth due to competition between genomic DNA and the ST (Supplementary Fig. 4). We observed that when gDNA amount was kept constant at 600 ng, increasing concentrations of ST led to decreasing detectability of clonotypes, indicating a competition between the ST and clonotypes during the process. This is especially important as indicated by our results revealing that drop-outs can be frequent, even for the most abundant clonotypes (Supplementary Fig. 5). For example, when the most frequent 0.3% clonotypes from Wild Type spleen tissue were used, we observed that only 119 distinct clonotypes were detected in the 5 replicates, with 63 clonotypes detected in 4 samples, 69 in 3, and 90 in 2, respectively. These drop-outs stem from the stochastic nature of the sampling and could be reduced by increasing the read coverage by not using ST.


Measuring and tracing changes in the abundance of clones is important for observing the immune response to different therapies such as cancer immunotherapy, leukemia monitoring and predicting patient outcomes.

A high concordance between bulk clonality metrics across Adaptive Biotech., one of the leading commercial companies in the field, and OTSP is observed. Our approach is a less laborious (as OTSP avoids the primer iteration experiments needed to address amplification bias problems), reproducible and an inexpensive alternative for understanding relative abundances of clonotypes. Utilization of STs for every experiment in a batch beyond the initial ST calibration may not be necessary and their utilization in every batch decreases the sequencing depth.

We propose the OTSP as an open-source, transparent protocol that efficiently corrects for amplification bias post sequencing for accurate, reproducible measurement of clonality metrics.

Material and methods

The aim of the methodology is modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences utilizing NGS, multiplex PCR and an NB mean normalization strategy. Some of the experiments were performed in the context of genomic DNA derived from mice.

Mouse handling

To generate the Byrne et al. data set, wild-type C57BL/6 mice were purchased from The Jackson Laboratory and housed at the University of Pennsylvania. Animal protocols were reviewed and approved by the Institute of Animal Care and Use Committee at the University of Pennsylvania. Mice were euthanized in a CO2 chamber using a flow meter to ensure CO2 was displaced at a rate of 30–70% of the chamber volume per minute and maintained for at least 1 min after the loss of righting reflex is observed. Euthanasia was confirmed by bilateral thoracotomy. Animal handling information regarding tumor injections, drug prep and injection are included in Byrne et al. [18].

To generate transgenic TCR data sets, spleen tissue from P14 and OT1 TCR transgenic mice (Nolz lab) were obtained. The spleen tissue was homogenized, treated with RBC lysis buffer and the resulting single-cell suspension was pelleted. Genomic DNA was extracted from the cell pellet using DNeasy blood and tissue kit (Qiagen).

To generate all the other data using animals, wild-type C57BL/6 mice were purchased from The Jackson Laboratory and maintained within the UCSF or OHSU laboratory for animal care barrier facility according to IACUC procedures. To generate the ST dilution series data set with mammary tumor and the Mesothelioma data set, we used mammary tumors from mouse mammary tumor virus (MMTV)- Polyomavirus middle T (PyMT) transgenic FVB/N mice and 40L orthotopic mesothelioma tumors respectively. Mammary tumors were resected from day 95 mice. For 40L orthotopic mesothelioma tumors, 2 × 106 cancer cells were injected i.p. into wild-type male C57BL/6 that were 6–12 weeks of age. For both tumor models, all mice were euthanized at a pre-defined end-stage for tissue harvest by cardiac puncture followed by cervical dislocation. These mice were cardiac perfused, under anesthesia using 1–5% isoflurane, with 20 mL solution of heparin in PBS to clear tissues of residual blood followed by tissue harvest for further analysis including TCR sequencing. Tumor tissue was excised and flash frozen in liquid nitrogen and stored at -80 degree Celsius until further use for extracting genomic DNA for TCR sequencing. Murine SCC tumors were obtained from a previously published study, Medler et al. [19].

Multiplex primers and design of synthetic TCR templates

Multiplex PCR primers previously described by Faham et al. (US patent 8,628,972 B2), for the amplification of murine TCRβ genomic loci were utilized (Supplementary Table 1: primer sequences). The 20 Vβ segment specific primers amplify all the 21 functional Vβ segments, and the 13 Jβ specific primers amplify all the 13 functional Jβ segments. As previously described by Carlson et al., we designed 260 (20 V x 13 J) synthetic TCR templates (ST) to minimize amplification bias due to multiplexing with 20 Vβ forward and 13 Jβ reverse primers [14]. Briefly, ST are 200 bp long double stranded DNA segments that contain partial V segment and J segment sequences encompassing a set of internal barcodes for post-sequencing identification. The internal barcode region contains a 16 bp barcode specific for each VJ combination. This specific barcode is further flanked by a 9 bp barcode that is common for all ST. An equimolar mixture of the 260 ST is added to the genomic DNA samples during PCR as internal controls.

Amplification and deep sequencing of TCRβ genomic locus

Genomic DNA from freshly resected mouse peripheral blood, peritoneal orthotopic mesotheliomas [20], and spleen was isolated using the Qiagen DNeasy Blood and tissue kit. Utilizing the method described in Robins et al. [21], we performed a 2-stage PCR using genomic DNA for TCRβ deep-sequencing library preparation. The 1st stage involved amplifying the gDNA and the ST using 35 cycles of multiplex PCR with 20 Vβ forward and 13 Jβ reverse primers using the Qiagen multiplex PCR kit. The multiplex PCR primers contain a common 5’ overhang, allowing amplification by a single primer pair in the 2nd stage PCR. Using 2.0% of the purified PCR product from stage 1 as template, a 2nd stage PCR, including 8 cycles, was performed with universal and indexed Illumina adaptors. Of note, the indexed adaptors contained an 8-base index sequence, providing each sample with a unique sample barcode. Equal volumes of all samples were pooled. Each pool concentration, typically containing PCR mixtures from 70 samples, was measured with a 2200 TapeStation (Agilent), and the concentration determined by real time PCR using a StepOne Real Time Workstation (ABI/Thermo) with a commercial library quantification kit (Kapa Biosystems). Paired-end sequencing was performed with a 2 × 150 protocol using a Midoutput 300 sequencing kit on a NextSeq 500 (Illumina). Target clustering was ~160 million clusters per run. Following the run, base call files were converted to fastq format and demultiplexed by a separate barcode read using the most current version of Bcl2Fastq software (Illumina).

TCR data analysis pipeline

Fastq files were assessed for initial read quality using the FASTQC public tool [22], including the per-base quality scores. Quality paired-end sequences were combined using the PEAR (Paired-End reAd mergeR) algorithm [23]. Merged sequences were then separated into ST and non-ST sequences. ST sequences were identified by searching for the common flanking 9-bp internal barcodes allowing a one-nucleotide mismatch or indel. Sequences flagged as ST via this search were removed from downstream clonotype analyses. The individual ST sequences were distinguished and quantified by searching for the specific 16-bp barcode sequences unique to each ST, again allowing a one-nucleotide mismatch or indel (Supplementary Fig. 6). Clonotypes were identified from purified (ST-removed) sequences utilizing the MiXCR pipeline [24], which is a two-step alignment and assembly process. First, reads were aligned to reference V, D, and J sequences, using the align module. Next, the assemble module grouped alignments into distinct clonotypes using a hierarchical clustering method based on sequence similarity and relative abundance. Finally, the export module exported alignments as well as assembled clones in tabular format. Raw clonotype counts were normalized using the NB mean normalization strategy described below. Normalized clonotype counts were exported in tabular format for use in downstream analysis. A number of TCR repertoire metrics, including clonality, maximum clonal frequency, and the Shannon diversity index were calculated. Quality control data was recorded in an overall summary table, for reference (Supplementary Material 1).

Data sets

ST-only data sets

Twenty samples of an equimolar mixture of ST were sequenced. Two sets of ten samples of equimolar mixtures of ST at different concentrations were also sequenced at a later date. No genomic DNA was present in these samples.

Transgenic TCR data sets

A dilution series was created from spleen genomic DNA of P14 and OT-1 TCR transgenic mice. Three technical replicates of 300, 600, 900, and 1200 ng of DNA from P14, OT1, and a 50:50 mixture of P14 and OT1 DNA were sequenced along with an equimolar mixture of ST for a total of 36 samples. 4 mice were used to create the DNA from P14, another 4 mice were used to create the DNA from OT1 and 8 mice were used to create 50:50 mixture of P14 and OT1 DNA. In total, 16 mice were used to create this data set. Appropriate amplification of the transgenic clones was assessed (Supplementary Fig. 7).

ST dilution series data set

A dilution series was created from ST at six levels as below:

* Stock: 3.2 ng/ul (equimolar mixture of all 260 spikes)

Dil1 = 0.1 of stock

Dil2 = 0.1 of Dil1

Dil3 = 0.01 of Dil1

Dil4 = 0.01 of Dil3

Dil5 = 0.001 of Dil3

Dil6 = 0.001 of Dil5

Three technical replicates of 600 ng of murine blood DNA were added to all levels of dilutions and for no ST samples for a total of 21 samples and three technical replicates of 600 ng of mammary tumor murine DNA were added to all levels of dilutions and for no ST samples for a total of 21 samples. In total, 14 mice were used to create this data set.

WT spleen data set

Wild-type mouse spleen genomic DNA were sequenced for a total of five technical replicates. A single mouse was used to create this data set.

Byrne et al. data set [18]

gDNA from 17 murine pancreatic ductal adenocarcinoma specimens were previously sequenced by a commercially available TCR beta platform. Aliquots of these gDNA samples were obtained and sequenced along with an equimolar mixture of ST using the protocol described above. In total, 17 murine samples were used to create this previously published data set.

Mesothelioma data set

Peritoneal mesothelioma tumors derived from the 40L cell line [20] derived genomic DNA samples derived from 12 syngeneic mice were sequenced along with equimolar mixture of ST. In total, 12 mice were used to create this data set.

In total, 60 mice were used to create above datasets to assess methodological approaches to normalizing TCR repertoires. No groups of animals were compared to each other as the purpose of the study has no biological study endpoint. Therefore, no a priori sample size calculations performed.

Batch mean scaling factors

This method creates a scaling factor for each ST (and therefore for each primer pair) based on that ST’s counts among all samples, relative to all ST counts within a batch. Given a matrix \(\left({C}_{ij}\right)\) of ST counts from a batch, where i = 1,…,260 labels ST and j = 1,…,n labels samples in the batch, we denote the batch mean of the counts for ST i by \({C}_{i\bullet}={n}^{-1}{\sum }_{j}{C}_{ij}\) and the batch mean of all ST means by \({C}_{\bullet\bullet}={(260)}^{-1}{\sum }_{i}{C}_{i\bullet}={(260n)}^{-1}{\sum }_{i,j}{C}_{ij}\). The scaling factor (SF) for ST i in that batch is \({SF}_{i}= {C}_{i\bullet}/{C}_{\bullet\bullet}\).

Negative binomial means

The above idea of a scale factor is distribution free, but for its use in normalizing counts, would require a full set of ST in every sample. We explored the use of a ST-specific negative binomial model to dispense with the use of synthetic templates. Consider the set (C1, …., Cn) of counts for single, fixed ST across a batch of n replicate ST-only samples. A plausible model for these counts is the negative binomial (NB) distribution. We write \(C \sim NB\left(m,d\right)\) for this distribution, where m > 0 is the mean parameter and d ≥ 0 is the (over-) dispersion parameter, and refer to [25] for an explicit formula for the NB probability mass function. For present purposes, C has expected value \({\varvec{E}}\left(C\right)=m\) and variance \({\varvec{v}}{\varvec{a}}{\varvec{r}}\left(C\right)=m+d{m}^{2}\). When d = 0, the negative binomial reduces to the Poisson distribution, for which \({\varvec{E}}\left(C\right)={\varvec{v}}{\varvec{a}}{\varvec{r}}(C)\), and thus the use of the term over-dispersion here is relative to the Poisson. Using the methods of generalized linear models [25], we can obtain the maximum likelihood estimates (MLE) \(\widehat{m} \mathrm{\,and\,} \widehat{d}\) of m and d from a replicate set of ST such as (C1, …., Cn). In the notation of the previous paragraph, the MLE \({\widehat{m}}_{i}={C}_{i\bullet}\), that is, the MLE of the ith mean parameter \({m}_{i}\) of an NB fitted to (Cij) is the arithmetic mean of the ith set of observed counts (assumed independent and identically distributed across j = 1,…,n with common mean \({m}_{i}\) and common dispersion parameter. Where no confusion will result in what follows, we will not distinguish the parameters (mi) from their (maximum likelihood) estimates \(({\widehat{m}}_{i})\). These ST or primer-pair-specific means estimated from ST-only data can be used as scaling factors for normalization, even when there are no ST present in samples. Consistent with the notation in the previous paragraph, we write \({m}_{\bullet}={(260)}^{-1}{\sum }_{i}{m}_{i}\) for the average of the 260 (estimated) mean parameters. The 260 NB mean scaling factors are \({(m}_{i}/{m}_{\bullet})\).


To normalize a set of clonotype counts from a single sample, we first calculated the primer-pair totals (Ci), where Ci denotes the total count of all clonotypes amplified with primer-pair i, where i = 1,…,260. We then normalize the 260 counts (Ci) using the batch scaling factors (SFi) by dividing by the corresponding scaling factor: \({C}_{i}^{^{\prime}}={({SF}_{i})}^{-1}{C}_{i}=({C}_{\bullet\bullet}/{C}_{i\bullet}) {C}_{i}\). Similarly, we normalize the (Ci) using the (estimated) NB mean scaling factors \({(m}_{i}/{m}_{\bullet})\) by dividing: \({C}_{i}^{\prime\prime}=({m}_{\bullet}/{m}_{i}) {C}_{i}\). After these primer-pair totals were normalized, the counts for distinct clonotypes sharing the same primer-pair were normalized: if one such clonotype accounts for a proportion p of the total count C corresponding to its primer pair, then it will be assigned a normalized value equal to the same proportion p of the normalized primer-pair total C’ or C”.

The same normalization could be used for ST counts if available. That is, divide the observed count of the fragments arising from primer pair i by the SFi or \({m}_{i}/{m}_{\bullet}\) for primer pair i for both ST and clonotype counts alike. Since the mean count of ST i will be proportional to \({m}_{i}\), normalization should preserve the total count of ST, exactly for batch mean normalization, on average for NB normalization. As long as the observed clonotype counts exhibit the same relative over- and under-representation after amplification as that exhibited by the ST, any bias will be reduced by this normalization.

Availability of data and materials

The source code for the analysis is available at

The datasets used and/or analyzed during the study are available from the corresponding author on reasonable request.



T Cell Receptor


Next Generation Sequencing


Complementarity Determining Region 3


Synthetic Template


Paired-End reAd mergeR


Negative Binomial


Open TCR Sequencing Protocol


Scaling Factor


Maximum Likelihood Estimation


  1. Baum PD, Venturi V, Price DA. Wrestling with the repertoire: the promise and perils of next generation sequencing for antigen receptors. Eur J Immunol. 2012;42(11):2834–9.

    Article  CAS  PubMed  Google Scholar 

  2. Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology. 2012;135(3):183–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Calis JJ, Rosenberg BR. Characterizing immune repertoires by high throughput sequencing: strategies and applications. Trends Immunol. 2014;35(12):581–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Han Y, Li H, Guan Y, Huang J. Immune repertoire: a potential biomarker and therapeutic for hepatocellular carcinoma. Cancer Lett. 2016;379(2):206–12.

    Article  CAS  PubMed  Google Scholar 

  5. Hou XL, Wang L, Ding YL, Xie Q, Diao HY. Current status and recent advances of next generation sequencing techniques in immunological repertoire. Genes Immun. 2016;17(3):153–64.

    Article  CAS  PubMed  Google Scholar 

  6. Nikolich-Zugich J, Slifka MK, Messaoudi I. The many important facets of T-cell repertoire diversity. Nat Rev Immunol. 2004;4(2):123–32.

    Article  CAS  PubMed  Google Scholar 

  7. Six A, Mariotti-Ferrandiz ME, Chaara W, Magadan S, Pham HP, Lefranc MP, et al. The past, present, and future of immune repertoire biology - the rise of next-generation repertoire analysis. Front Immunol. 2013;4:413.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Woodsworth DJ, Castellarin M, Holt RA. Sequence analysis of T-cell repertoires in health and disease. Genome Med. 2013;5(10):98.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Han A, Glanville J, Hansmann L, Davis MM. Linking T-cell receptor sequence to functional phenotype at the single-cell level. Nat Biotechnol. 2014;32(7):684–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Redmond D, Poran A, Elemento O. Single-cell TCRseq: paired recovery of entire T-cell alpha and beta chain transcripts in T-cell receptors from single-cell RNAseq. Genome Med. 2016;8(1):80.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Dziubianau M, Hecht J, Kuchenbecker L, Sattler A, Stervbo U, Rodelsperger C, et al. TCR repertoire analysis by next generation sequencing allows complex differential diagnosis of T cell-related pathology. Am J Transplant Off J Am Soc Transplant Am Soc Transplant Surg. 2013;13(11):2842–54.

    Article  CAS  Google Scholar 

  12. Liu X, Zhang W, Zeng X, Zhang R, Du Y, Hong X, et al. Systematic comparative evaluation of methods for investigating the TCRbeta repertoire. PLoS One. 2016;11(3):e0152464.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Rosati E, Dowds CM, Liaskou E, Henriksen EKK, Karlsen TH, Franke A. Overview of methodologies for T-cell receptor repertoire analysis. BMC Biotechnol. 2017;17(1):61.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Carlson CS, Emerson RO, Sherwood AM, Desmarais C, Chung MW, Parsons JM, et al. Using synthetic templates to design an unbiased multiplex PCR assay. Nat Commun. 2013;4:2680.

    Article  PubMed  Google Scholar 

  15. Zhang W, Du Y, Su Z, Wang C, Zeng X, Zhang R, et al. IMonitor: a robust pipeline for TCR and BCR repertoire analysis. Genetics. 2015;201(2):459–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Okino ST, Kong M, Sarras H, Wang Y. Evaluation of bias associated with high-multiplex, target-specific pre-amplification. Biomol Detect Quantif. 2016;6:13–21.

    Article  CAS  PubMed  Google Scholar 

  17. Shewhart WA, Deming WE. Statistical method from the viewpoint of quality control. Mineola: Dover Publications; 1986.

  18. Byrne KT, Vonderheide RH. CD40 stimulation obviates innate sensors and drives T cell immunity in cancer. Cell Rep. 2016;15(12):2719–32.

    Article  CAS  PubMed Central  Google Scholar 

  19. Medler TR, Murugan D, Horton W, Kumar S, Cotechini T, Forsyth AM, et al. Complement C5a fosters squamous carcinogenesis and limits T cell response to chemotherapy. Cancer Cell. 2018;34(4):561-78.e6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Goodglick LA, Vaslet CA, Messier NJ, Kane AB. Growth factor responses and protooncogene expression of murine mesothelial cell lines derived from asbestos-induced mesotheliomas. Toxicol Pathol. 1997;25(6):565–73.

    Article  CAS  PubMed  Google Scholar 

  21. Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ, Kahsai O, et al. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood. 2009;114(19):4099–107.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Andrews S. FASTQC. A quality control tool for high throughput sequence data. FASTQC A quality control tool for high throughput sequence data. Accessed 21 Dec 2017.

  23. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. 2014;30(5):614–20.

    Article  CAS  PubMed  Google Scholar 

  24. Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015;12(5):380–1.

    Article  CAS  PubMed  Google Scholar 

  25. McCullagh P, Nelder J. Generalized linear models II. London: Chapman and Hall; 1989.

    Book  Google Scholar 

Download references


The authors thank the Knight Cancer Institute (NCI P30 CA069533), Bioinformatics, and OHSU Massively Parallel Sequencing shared resources. We are grateful to members of the Coussens Lab, Spellman Lab and Speed Lab for critical discussions, to Meghan Lavoie for technical assistance, and Justin Tibbitts and Teresa Beechwood for research regulatory oversight and animal husbandry.


LMC acknowledges support from the NIH/NCI (CA130980, CA155331, CA163123), a DOD BCRP Era of Hope Scholar Expansion Award (W81XWH-08-PRMRP-IIRA), the Susan G. Komen Foundation (KG110560), and the Brenden-Colson Center for Pancreatic Health. AM, RHV, and LMC acknowledge support from a Stand-Up-To-Cancer Lustgarten Foundation Pancreatic Cancer Convergence Dream Team Translational Research Grant (SU2C-AACR-DT14-14), as well as R01 CA217176 (to RHV), and the American Cancer Society (125403-PF-14-135-01-LIB) to KTB.

Author information

Authors and Affiliations



BG analyzed and interpreted all of the ST and clonotype count data described under “Data sets”, and built the statistical models and methods which were used to address the central amplification bias problem resulting from the utilization of TCR sequencing with multiplex PCR. WH processed raw sequencing fastq files and deployed MIXCR software to achieve ST and clonotype counts. DM performed lab experiments to generate ST-only data sets, Transgenic TCR data sets, ST dilution series data set, WT spleen data set and Mesothelioma data set described under “Data sets”. BZ contributed to our understanding of the count dynamics and the platforms by analyzing several datasets. PL supervised WH, and contributed to the initial set up of the fastq file conversion and deployment of software to achieve counts. SK generated the Mesothelioma data set samples. KTB and RHV generated and shared the Byrne et al. data set [18]. AAM outlined the aim of the project and supervised BG, PL and WH. MM built the multi-disciplinary effort setup to achieve statistical methods and analytically reviewed the ms. PTS supervised and mentored BG and reviewed analyses and provided feedback. LMC provided the ultimate goal and resources, built collaborations across different labs and supervised DM. TPS advised BG, WH and BZ on data processing, analysis, interpretation and modeling and devised the NB normalization. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Burcu Gurun, Paul T. Spellman, Lisa M. Coussens or Terence P. Speed.

Ethics declarations

Ethics approval and consent to participate

All animal experiments were performed in compliance with the National Institutes of Health guidelines and were approved by the Institutional Animal Care and Use Committees (IACUC) of the University of California, San Francisco and Oregon Health & Science University.

Consent for publication

Not applicable.

Competing interests

R.L. Vonderheide is an inventor on a licensed patent relating to cancer cellular immunotherapy and cancer vaccines, and receives royalties for a licensed research-only monoclonal antibody. He is a member of the Lustgarten Therapeutics Advisory working group. R.L. Vonderheide reports having received consulting fees or honoraria from Medimmune and Verastem; and research funding from Fibrogen, Janssen, and Lilly.

L.M. Coussens is a paid consultant for Cell Signaling Technologies, AbbVie Inc., and Shasqi Inc., received reagent and/or research support from Plexxikon Inc., Pharmacyclics, Inc., Acerta Pharma, LLC, Deciphera Pharmaceuticals, LLC, Genentech, Inc., Roche Glycart AG, Syndax Pharmaceuticals Inc., Innate Pharma, NanoString Technologies, and Cell Signaling Technologies, is a member of the Scientific Advisory Boards of Syndax Pharmaceuticals, Carisma Therapeutics, Zymeworks, Inc, Verseau Therapeutics, Cytomix Therapeutics, Inc., Hibercell., Inc., Alkermes, Inc., and Kineta Inc, and is a member of the Lustgarten Therapeutics Advisory working group.

P.T. Spellman is a paid consultant from Natera and Foundation Medicine Inc.

A.A. Margolin is a Venture Partner at Khosla Ventures and Chief Executive Officer at NextVivo, Inc. and was previously Chief of Data Science at Sema4, Inc.

No potential conflicts of interest were disclosed by the other authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary Figure 1.

ST Count Distribution in presence of DNA. Supplementary Figure 2. Normalization reduces spread in presence of DNA. Supplementary Figure 3. Concordance analysis of T cell repertoire metrics. Supplementary Figure 4. Competition between gDNA and ST during TCR sequencing. Supplementary Figure 5. Reproducibility analysis: Drop-outs are frequent even for the top clones. Supplementary Table 1. Primer Sequences. Supplementary Figure 6. TCR sequencing pipeline schema. Supplementary Figure 7. Monoclonal amplification check.

Additional file 2.

Additional file 3.

Additional file 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gurun, B., Horton, W., Murugan, D. et al. An open protocol for modeling T Cell Clonotype repertoires using TCRβ CDR3 sequences. BMC Genomics 24, 349 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: