## Abstract

The TCR α-chain is assembled by somatic recombination of variable (V) and joining (J) gene segments at the CD4^{+}CD8^{+} stage of development. In this study, we present the first analytical model for deletional rearrangement and show that it is consistent with almost all available data on V_{α}J_{α} use in mice and humans. A key feature of the model is that both “local” and “express service” models of rearrangement can be obtained by varying a single parameter that describes the number of gene segments accessible at a time. We find that the window is much larger for V_{α} segments than J_{α} segments, which reconciles seemingly conflicting data for the former. Implications for the properties of the repertoire as a whole and experiments that seek to probe them are discussed. Special considerations for allelic inclusion are treated in the Appendices.

To provide protection from pathogens, T lymphocytes must react to an enormous variety of foreign molecules. The specificity of a clone for Ag is determined by the TCRs on its surface, which are typically heterodimers of α- and β-chains. The diversity of the TCR repertoire derives in large part from the fact that both α- and β-chains are generated during intrathymic development by somatic recombination of gene segments encoding the constant and variable portions of these molecules. The latter include variable (V), diversity (D), and joining (J) gene segments in the case of β-chains and V and J gene segments in the case of α-chains.

Obtaining a quantitative understanding of how physical factors impact rearrangement of TCR loci is important for assessing the diversity of specificities in the TCR repertoire. Loci with D segments can only rearrange each chromosome once because all but the used D segment are deleted by the primary rearrangement. In contrast, those without D can rearrange repeatedly; here, we focus on α loci. In mice, there are 104 V_{α} segments and 61 J_{α} segments; in humans, the number of J_{α} segments is about the same, but there are only about half as many V_{α} segments. When a rearrangement brings specific V_{α} and J_{α} segments together, intervening gene segments are deleted rather than inverted due to the orientation of the recombination signal sequences (1).

There is general agreement that J_{α} segments are used in an essentially sequential manner (termed “local service”), starting at the 5′ end (proximal to the V_{α} segments) and proceeding to the 3′ end (2, 3, 4, 5, 6, 7, 8). There is no such consensus concerning the V_{α} segments. One group reported preferential use of 3′ (J_{α} proximal) over 5′ (J_{α} distal) V_{α} segments (8), but another argued for nonsequential use (“express service”) of V_{α} segments based on the fact that there is only a loose correlation in V_{α}/V_{α}-pairs in cells with productive rearrangements on both chromosomes (6). In the present paper, we develop a mathematical model that reconciles these seemingly conflicting observations. We estimated the V_{α} and J_{α} window sizes (*W*_{V} and *W*_{J}, respectively) from the average separations of segments on different alleles in selected cells and then show that, without further adjustment, these parameters yield good agreement with independent experimental data on V_{α} and J_{α} use in the selected TCR repertoire. We find that the window is much larger for V_{α} segments than J_{α} segments (37 segments compared with 13 segments, respectively), which accounts for quasisequential use of V_{α} gene segments (8) with only loose correlation between alleles (6). Beyond shedding light on mechanisms of rearrangement, the model is useful for extrapolating statistics on the TCR repertoire from data obtained with limited numbers of V_{α}-specific mAbs, as demonstrated in the Appendices.

## Materials and Methods

We derived an expression for the number of possible ways each V_{α}J_{α} gene configuration can be generated and then used it to estimate probabilities of observing in the periphery: 1) V_{α} and J_{α} segments irrespective of with which gene they are paired, 2) V_{α}J_{α} pairs, and 3) V_{α}/V_{α} and J_{α}/J_{α} pairs in dual TCR cells.

### Counting paths to V_{α}J_{α} gene configurations

As mentioned in the Introduction, we can treat V_{α} and J_{α} rearrangements with the same expressions by encoding the degree to which each set is used sequentially by a parameter (*W*) that describes the number of accessible gene segments. In other words, we assumed that rearrangements at any time can be made to the *W* most proximal gene segments remaining; this scheme is discussed in further detail below. A small value of *W* corresponds to more sequential use, and a large one to more random use.

For clarity, we numbered the V_{α} and J_{α} segments separately according to their initial (prerearrangement) positions starting from the most proximal ones (9) and refer to the ordered list of (V_{α} or J_{α}) segments sampled in multiple rearrangements of a chromosome as a “path” (Fig. 1⇓). The specific goal of this section is to determine *g*(*n,k,W*), the number of paths of length *k* (i.e., those corresponding to *k* rearrangements) that end in segment *n* subject to the window constraint described above.

To this end, we determined the number of unrestricted such paths and then removed the contribution from those that violate the window constraint. We began by noting that the segment to which the last rearrangement is made is fixed, and the number of ways of choosing the remaining increasing sequence of *k* − 1 segments from the *n* − 1 prior ones was From these unrestricted paths, we subtracted the number of combinations with immediately sequential segments separated by more than *W* (“forbidden jumps”). For this purpose, we defined the quantity where *m* is the number of forbidden jumps in a path. The first binomial coefficient gives the number of ways of placing the *m* forbidden jumps among the *k* rearrangements. The second gives the number of ways of choosing the gene segments for the first *k* − 1 rearrangements given that certain jumps are required to be of length greater than *W*. In the advent that *n* − *mW* < 1, we take the second factor on the right side to be 0 here and below.

It is important to note that the quantity *f*(*n*,*k*,*W*,*m*) is not in itself the number of paths with *m* forbidden jumps because it also counts those with more than *m* forbidden jumps repeatedly. In particular, a path with *q* (*q* ≥ *m*) forbidden jumps contributes *q!/m!*(*q* − *m*)! times to *f*(*n,k,W,m*) because there are that many ways of choosing the minimum of *m* jumps specified to violate the window constraint. To obtain only the number of allowed paths, begin by considering *f*(*n,k,W*,0). By comparing Eqs. 1 and 2, it can be seen that *f*(*n,k,W*,0) is the number of unrestricted paths. It is thus necessary to subtract *f*(*n,k,W*,1) from *f*(*n,k,W*,0). Then paths with one forbidden jump will no longer be counted. However, doing so overcompensates with regard to the paths with more than one forbidden jump because they are counted *q!*/1!(*q* − 1)*!* = *q* times in *f*(*n,k,W*,1). Adding back *f*(*n,k,W*,2) overcompensates for paths with more than two forbidden jumps in the opposite direction, and so it is necessary to continue subtracting and adding terms with increasing numbers of forbidden jumps until the maximum possible (*k*) is reached. Generalizing, in this alternating sum of *f*(*n,k,W,m*) over *m*, the number of times paths with *q* ≥ 1 are counted is where the first equality derives from the binomial expansion. In other words, in an alternating sum of *f*(*n*,*k*,*W*,*m*) over *m*, contributions from paths with forbidden jumps cancel. Thus, the number of paths that satisfy the window constraint is Eq. 4 is the main result of this paper, and all subsequent expressions are derived from it.

### Probability distributions

To use Eq. 4 to estimate the probabilities of observing particular V_{α} and J_{α} segments in various contexts, we assume that the likelihoods of making productive rearrangements (*P*_{r}) and being selected (*P*_{s}) are constants. Clearly, the latter does in fact vary for specific gene segments, but here we are concerned with the overall statistics of the repertoire; moreover, we effectively average over different β-chains and CDR sequences. Given *P*_{r} and *P*_{s} as well as the window sizes, we combine these variables into case-dependent aggregate probabilities for attempting to generate a particular gene configuration and then being or not being selected (*a* and *b*, respectively).

#### V_{α} or J_{α} probability distributions.

Here, we consider the probability of observing a single V_{α} or J_{α} segment (indexed *n*), irrespective of with which gene it is paired [*P*(*n*)]. The likelihood of making a productive rearrangement and then being selected is the product *P*_{r}*P*_{s}, and its complement is 1 − *P*_{r}*P*_{s}. We normalize these expressions by the window size for *a* and *b* because there is an equal chance of picking any accessible gene segment: To obtain the probability of interest, we perform a sum over the numbers of rearrangements (*k* < *N*_{r}, where *N*_{r} is the maximum number of rearrangements, which acts as a surrogate for time in the thymus) weighted according to the number of paths: where *Q* is a normalization factor determined by summing over all possible values of *n*. The factor *b*^{k}^{−1} arises from the fact that to be selected in *k* rearrangements, a clone must fail to be selected in *k* − 1 previous attempts.

#### V_{α}J_{α} pair distributions.

In the case that we want the probability of observing V_{α} segment *n*_{V} with J_{α} segment *n*_{J} [*P*(*n*_{V},*n*_{J})], it is necessary to normalize the aggregate probabilities *a* and *b* instead by the number of possible pairs (*W*_{V}*W*_{J}): where W_{V} and *W*_{J} are sizes of the windows of accessible V_{α} and J_{α} segments. As above, we sum over the number of rearrangements weighted by the numbers of paths to *n*_{V} and *n*_{J}: where again *Q* is a normalization factor, but, in this case, it is computed by considering all possible pairs of V_{α} and J_{α} segments.

#### V_{α}/V_{α} and J_{α}/J_{α} distributions in dual TCR cells.

Statistical data are available for clones with two productively rearranged α loci (6, 10), so it is of interest to determine the likelihood of pairing like types of segments on different alleles. Although the expression for this joint probability is similar to that for a particular V_{α}J_{α} pair, *a* and *b* must be adjusted for dual TCR cells. Assuming for convenience that both alleles are rearranged simultaneously but independently (see Appendix and Ref. 11 for discussions of this simplification), the probability of productively rearranging both chromosomes and then being positively selected is *P*_{r}^{2}*P*_{s}. This product determines *a*. The aggregate probability *b* accounts for clones that edit their gene configurations. It is thus necessary to subtract the contributions from both selected dual and single TCR cells: *P*_{r}^{2}*P*_{s} and *P*_{r}(1 − *P*_{r}) *P*_{s}, respectively (the possibility of cell death is addressed below). Normalizing by the number of gene segment pairs from the two sets of interest: The probability for selecting a cell with segment *n*_{1} from the first chromosome and segment *n*_{2} from the second is which differs from Eq. 8 in that both counting factors use the same window size, and the normalization factor *Q* is adjusted accordingly.

### Evaluation of simplifications

The model described above makes no reference to underlying molecular details and is consistent with any mechanism that only allows a certain number of sequential gene segments to be accessible at a given time. To make the model mathematically tractable, additional simplifications were made. In particular, we assume that rearrangements can always access the *W* most proximal remaining segments. However, data for the Ig H chain locus suggest that the windows of accessible gene segments are predetermined (12). In this case, the number of segments to which rearrangements can be made varies because the ends of the windows are restricted to fixed points along the locus.

To determine how varying the mechanism for making gene segments accessible impacts the results, we performed stochastic simulations for models with a sliding window of constant size (as in the mathematical derivations above) and one with fixed ends but variable size (as mentioned immediately above). The behavior of the latter depends on the number of rearrangements between shifts in the window position and we tried several values. Overall, the results from the two models are similar (Fig. 2⇓). The main difference is that, when the window moves infrequently in the case of predetermined regions of accessible gene segments, segments toward the ends of the windows are used to a somewhat greater degree. Indeed, the abrupt changes in use across the boundaries of the windows are similar in shape to distributions observed for distal J_{α} segments in cells that are forced to edit rather than die due to transgenic expression of Bcl-x_{L} (7).

For simplicity, we also explicitly considered only one allele in deriving the first two probability distributions above. However, positive selection following functional rearrangement of a second chromosome terminates rearrangement of the chromosome of interest. Because the alleles are otherwise independent, we can easily account for both by interpreting *N*_{r} as the number of rearrangements per allele rather than the total. Computer simulations confirmed that, as long as the normalization is treated consistently (see Appendix B), considering one allele with *N*_{r} rearrangements and two alleles with 2*N*_{r} rearrangements yielded identical results to within simulation error (data not shown).

In the derivations above, cell death (due to autoreactivity or neglect) is not considered explicitly for clarity. Accounting for this phenomenon requires subtracting from the numerators of the aggregate probability *b* in Eqs. 5, 7, and 9 products of the form *P*_{r}*P*_{d} where *P*_{d} is the probability of cell death. As mentioned with regard to *P*_{s}, treating *P*_{d} as a constant is expected to be adequate because our focus is on the statistics of the repertoire rather than specific αβ TCR heterodimers.

Lastly, the T early α promoter situated upstream of the J_{α} segments appears to target primary rearrangements to the 5′ end of that locus, but there is also evidence that a second *cis*-regulatory element initiates rearrangements at a point further downstream (13). This possibility can be incorporated into the model by using in place of *g*(*n,k,W*) the weighted average where *P*_{u} is the probability of initiating rearrangement at the upstream targeting element, *W*_{d} is the number of gene segments in the downstream window, and *n*_{d} is the index of the first gene in the downstream window. The downstream promoter targets rearrangements to J_{α}49-J_{α}45 (13), which corresponds to *W*_{d} = 5 and *n*_{d} = 13. In Eq. 11, the first term, which is weighted by *P*_{u}, accounts for the fraction of rearrangements to gene segment *n* starting from the V-proximal end of the *J*_{α} locus as in the original model. The second term, which is weighted by 1 − *P*_{u}, adjusts for the primary rearrangements to the downstream window. The function *h*(*n,k*) counts these targeting events; it is 1 if *k* = 1 and *n* is inside the downstream window and 0 otherwise. The sum counts paths that start in the downstream window and end at gene segment *n*. The arguments to the function *g* in this case can be understood as follows. The first, *n* − *n*_{d} − *i* − 1, is the number of gene segments between the gene segment of interest (*n*) and the initiation point for rearrangement (the *i*-th gene segment of the downstream window), which corresponds to shifting the gene segment indices to count from the initiation point. The second argument, *k* − 1, is the number of secondary rearrangements, which reflects the fact that the primary rearrangement is already determined. The third argument is simply the window size, which we take to be the same for rearrangements starting from either *cis*-regulatory element. For this same reason, one factor of *W/W*_{d} is necessary to correct for the size of the primary rearrangement window, which enters through the composite probabilities *a* and *b* in Eqs. 5–10.

Based on the areas of the peaks in Fig. 3⇓ in Ref. 13 , we estimate *P*_{u} to be ∼0.7. With this choice, we recalculated the curves in Fig. 3⇓ (data not shown) and found that the modified model yields *W*_{J} = 10 gene segments, which is somewhat smaller than our original estimate (more generally, *W*_{J} increases with *P*_{u} until the original model is recovered for *P*_{u} = 1). As a result, secondary rearrangements tend to be shifted upstream slightly; primary rearrangements are shifted downstream slightly due to the second *cis*-regulatory element. Similar agreement with the data is obtained overall. Specifically, the predictions for V_{α}20S1 in Table I⇓ and V_{α}6 in Fig. 4⇓*a* are improved (compare with *inset*), those for V_{α}19 in Table I⇓ and Fig. 4⇓*b* are a bit poorer (data not shown).

### Choice of parameters

There are five parameters in the model: the V_{α} and J_{α} window sizes (*W*_{V} and *W*_{J}), the probability that rearrangements are productive (*P*_{r}), the probability of positive selection following productive rearrangements (*P*_{s}), and the maximum number of rearrangements per allele (*N*_{r}). To estimate values for *W*_{V} and *W*_{J} from a relatively small amount of data, we assume values for *P*_{r} and *P*_{s}. We take *P*_{r} to be 0.3; this choice is somewhat less than the one-in-three chance that a rearrangement is in-frame to account for pseudogenes and the possibility of generating stop codons (14). There is little information from experiments to guide the choice of *P*_{s}, the probability of selection following productive rearrangement. We expect it to be small because <5% of thymocytes are positively selected (Ref. 15 , and references therein). Based on this fact, we take *P*_{s} to be 0.03, which yields an overall selection probability of 1 − (1 − 0.3 × 0.03)^{5} × 100% = 4.4% given the choices of *P*_{r} and *N*_{r} (discussed below). Nearly identical results were obtained with *P*_{r}*P*_{s} values ranging over an order magnitude.

Data from mice incapable of reintroducing recombination activating gene (RAG)^{3} at the double-positive stage suggest that ∼35% of the T cell repertoire is formed by primary α-chain rearrangements (16) (but see Fig. 5⇓ and associated discussion below). Assuming constant *P*_{r} and *P*_{s} and that the number of gene segments is not limiting, it is straightforward to show that this percentage is theoretically This statistic is not very sensitive to the values of *P*_{r} and *P*_{s}, but it does depend strongly on *N*_{r} since it is in the exponent. Reasonable values for *P*_{1°} in the range 20–34% are obtained with 4 ≤ *N*_{r} ≤ 6. We take *N*_{r} = 5, which is consistent with other estimates (15).

## Results

In our model for TCRα rearrangement, the degree to which a set of gene segments (V or J) is used sequentially is encoded in a single parameter that describes the size of the window of accessible gene segments (Fig. 1⇑). Using this idea, we derive in *Materials and Methods* an analytic expression for the number of possible rearrangement “paths” leading to a selected gene configuration (Eq. 4), as well as probabilities for observing specific V_{α} and J_{α} gene segments and their combinations on selected cells (Eqs. 5–10). Here, we estimate the V_{α} and J_{α} window sizes (*W*_{V} and *W*_{J}, respectively) from the average separations of segments on different alleles in selected cells and then show that, without further adjustment, these parameters yield good agreement with independent experimental data on V_{α} and J_{α} use in the overall TCR repertoire.

### Separation of gene segments on different alleles

To estimate the numbers of V_{α} and J_{α} segments available for rearrangement at any given time, we calculate the average separation between like types of segments on different alleles in selected cells as a function of window size (Fig. 3⇑). Consistent with intuition, the separation between segments on different chromosomes increases monotonically with *W*. In other words, the correlation between alleles decreases as use becomes less sequential.

For the 61 mouse J_{α} genes, the experimentally measured average separation is 7.1 (SD 6.7), and, for the 58 human V_{α} genes, it is 13.8 (SD 9.3) (6). Given these data, we can read the V_{α} and J_{α} window sizes off Fig. 3⇑. These averages and SDs correspond to *W*_{J} = 12 or 15 and *W*_{V} = 38 or 31, respectively (Fig. 3⇑); the fact that the averages and SDs yield relatively close estimates for the window sizes suggests that the large SD values observed are inherent to the rearrangement process rather than due to experimental uncertainty.

From the ranges above, we chose *W*_{J} = 13 and *W*_{V} = 37 because the mean generally converges more quickly than the second moment of the distribution and the raw data are quite limited. These values confirm the idea that rearrangement of the J_{α} segments is less random (more sequential) than that of the V_{α} segments. The value *W*_{J} = 13 is also consistent with the observation that mice that lack the T early α promoter are unable to use the 10 most proximal J_{α} segments (5), which suggests a window size of about that number of gene segments.

Due to the limited amount of data, we assume that our values of *W*_{J} and *W*_{V} are common to mice and humans, which appears to be justified given the remarkably good agreement with experiment that we obtain below. Taking the window sizes to be the same in mice and humans also reconciles seemingly conflicting data on V_{α} segment use (M. Krangel, unpublished observation). Because 37 gene segments represent roughly 65% of the human but only 35% of the murine V_{α} segments, the former appear to be used randomly while the latter appear to be used sequentially.

### J_{α} distributions for particular V_{α} genes

We now fix the five parameters in the model at the values estimated above (*P*_{r} = 0.3, *P*_{s} = 0.03, *N*_{r} = 5, *W*_{J} = 13, and *W*_{V} = 37) and compare calculated V_{α}J_{α} pair frequencies with measured ones (8, 17, 18, 19). We begin by considering the data from Ref. 8 for J_{α} segments paired with V_{α}6 (located at the J-proximal end of the V_{α} locus) and V_{α}19 (at the distal end). For these extreme V_{α} segments, the model and experimentally observed probabilities agree well (Fig. 4⇑). The estimated frequency for V_{α}6-J_{α}48 pairs is very sensitive to whether targeting by the *cis*-regulatory element at J_{α}49 is considered because almost all of these pairs derive from primary rearrangements and J_{α}48 falls in the downstream window but not the upstream one. The model predictions compare favorably with the experimental data for V_{α}19 and V_{α}20S1 (located at the proximal end of the locus) studied by Huang and Kanagawa (17) as well (Table I⇑). Again, including the downstream initiation site improves the estimates for pair frequencies involving the proximal V_{α} segment, V_{α}20S1 (data not shown).

Although the same qualitative trends were observed in Ref. 18 , the model cannot reproduce the reported J_{α} use in detail due to the fact that the distributions are not unimodal. However, it is important to note that these data are putatively for individual members of the V_{α}2 superfamily, which are difficult to conclusively identify, as noted by Huang and Kanagawa (17). Additional data for this superfamily was obtained recently for mice with only a single functional J_{α} segment and control animals (19). In this case, reasonable agreement with calculated frequencies for the model in which the ends of the windows are restricted to fixed points along the locus (see Fig. 2⇑), but the model exhibits a greater bias toward J_{α}-proximal V_{α} segments than was observed (data not shown).

### Regulation of secondary rearrangements

The quasisequential scheme on which our model is based is consistent with the results of knockout experiments directed at elucidating the factors that regulate secondary rearrangements. In mice that are unable to reinduce RAG expression at the double-positive stage because they lack a necessary regulatory element, residual RAG from rearrangement of TCRβ at the double-negative stage catalyzes only limited rearrangement of the TCRα locus. J_{α} use is restricted to the 5′ (proximal) segments in T cells from these mice (16). To simulate these experiments, we performed simulations of our model in which there was a large probability of losing residual RAG and stopping rearrangement following each round of rearrangement. We found that probabilities between 60 and 80% give good agreement with the experimental data (Fig. 5⇑). Although there are data suggesting that the half-life of RAG is short (∼10 min), they were obtained for the destruction of RAG at the G_{1}-S phase of the cell cycle (20); that following the formation of the β-chain could be slower. In any event, some RAG must be present in these cells to account for the observed α-chain rearrangements.

## Discussion

In this study, we present a model for TCR gene segment use in which the degree that rearrangement is sequential is determined by the size of a window of accessible gene segments. The model is based only on this notion and the fact that intervening gene segments are deleted when V_{α} and J_{α} are brought together. We deduced the window sizes from data on the correlations between the two V_{α} or J_{α} genes used in selected cells, and showed that without further adjustment, these parameters yield statistics in good agreement between the model and almost all available data on TCRα gene segment use. Although the model cannot reveal the detailed molecular mechanism, it strongly indicates that V_{α} use is quasisequential, which reconciles seemingly conflicting data for mice and humans.

The model is exactly solvable and provides the first expressions that can be used to extract information directly from data on gene segment use in lymphocytes. Previous theoretical studies were limited to simulations (numeric “experiments”) (21, 22) and focused on the use of L chain segments in B cells, in particular J_{κ}. There are only four functional such gene segments, which are used with a slight bias toward the two more proximal segments. By either assigning probabilities to each of the four (21) or varying the ratio of the likelihoods of choosing each gene segment and the one immediately upstream (22), it was found that quasisequential use best explains the available J_{κ} data as well.

One consequence of the mechanism identified is that the most distal V_{α} segments are incapable of pairing with the most proximal J_{α} segments and vice versa, consistent with Refs. 23 and 24 . Although we calculate that 70% of mouse V_{α}J_{α} and 58% of human V_{α}J_{α} pairs are expressed at frequencies within an order of magnitude of the uniform distribution (with the differences arising from the fact that the V_{α} locus in humans is roughly twice the size of that in mice), ∼4% of V_{α}J_{α} pairs are incapable of forming in both species. Thus, the use of specific V_{α}J_{α} pairings can vary dramatically depending on their chromosome locations, and care must be used in extrapolating statistics from experiments specific to particular TCR, as discussed in Appendix A.

In general, as rearrangement becomes increasingly sequential, there is a tradeoff between diversity and cell conservation in the thymus. Consequently, it is natural to ask whether the diversity of the TCR repertoire is limited significantly by the quasisequential nature of rearrangement. If rearrangement were totally uniform, every V_{α}J_{α} pair would be present in the repertoire with probability 1/*N*_{V}*N*_{J}, where *N*_{V} and *N*_{J} are the total numbers of V_{α} and *J*_{α} genes. To compare different mechanisms, we used our model to calculate the SD of pairing frequencies for all possible values of *W*_{V} and *W*_{J} (Fig. 6⇓). A lower SD corresponds to a more uniform distribution. Interestingly, the window sizes estimated from the experimental data (Fig. 3⇑) fall close to the bottom of the very shallow basin around the minimum (Fig. 6⇓). Large SDs are obtained for either a perfectly sequential model (*W*_{V} = 1 and *W*_{J} = 1) or a random deletional one (*W*_{V} = 104 and *W*_{J} = 61) because gene segments from the proximal and distal ends, respectively, tend to be used disproportionately. These calculations lead us to speculate that the quasisequential rearrangement mechanism that we identified evolved to maximize the diversity of the repertoire.

## Acknowledgments

We thank Martin Weigert for helpful discussions and critical reading of the manuscript and Barry Sleckman for providing data on V_{α}2 use in advance of publication.

## Disclosures

The authors have no financial conflict of interest.

## Appendix 1

### Corrections for phenotypic allelic inclusion frequencies

Roughly one-quarter of mature αβ T lymphocytes have productive rearrangements at both their TCR α-chain gene loci (genotypic allelic inclusion) (10 ), but not all of these cells express two Ag receptors on their surfaces (25 ,26 ,27 ,28 ,29 ). Dual TCR cells can thus serve as windows to the mechanisms that regulate T cell surface expression in general and, in turn, how events in the lives of T cells can lead to variations in molecular populations underlying autoimmunity (11 ). To determine the factors that influence phenotypic allelic exclusion, it is important to be able to quantitate its extent accurately. Unfortunately, only a few mAbs for specific V_{α} protein segments are available, so it is necessary to extrapolate from limited FACS data to estimate the total fraction of mature T cells that express two TCR on their surface. In this Appendix, we use our model for TCR V_{α}J_{α} use together with a brief additional counting argument to improve means for interpreting such experiments.

#### Correction for overcounting

To determine the extent of phenotypic allelic inclusion in a population of T lymphocytes, FACS is used to count the number of cells that bind reagents specific for two different V_{α} protein segments. Typically, these data are then used to calculate the fraction where *N*_{i} is the number of cells that are V_{α}*i*^{+}, *N*_{ij} is the number of cells that are V_{α}*i*^{+}/V_{α}*j*^{+}, and *N* is the total number of cells. Although often quoted as such (26 ,27 ,28 ,29 ), *f*_{ij} is not the frequency of phenotypic allelic inclusion in the total population (*F*). Rather, it is an approximation for the fraction of V_{α}*i*^{+} and V_{α}*j*^{+} cells which display two receptors at the cell surface (in other words, *f*_{ij} ≈ *f*_{i} and *f*_{j}, respectively). To relate *f*_{ij} to *F*, it is necessary to avoid double-counting cells that express two different V_{α} protein segments.

The experiments sort cells only according to V_{α}, which effectively averages over J_{α}, CDR, and the β-chain. Consequently, it is not unreasonable to assume *f*_{ij} ≈ *f*_{i} ≈ *f*_{j} ≈ *f* is essentially the same for all V_{α} (discussed below). Then, denoting the number of V_{α}*i*^{+} cells that express two TCR by *d*_{i}, we can write for each V_{α} protein segment an equation of the form *fN*_{i} = *d*_{i}. Summing over segments, The sums above count dual TCR cells with different V_{α} twice since *N*_{ij} contributes to *N*_{i}, *N*_{j}, *d*_{i}, and *d*_{j}. Denoting the total number of dual TCR cells by *d* and the number of such cells with the same V_{α} on both alleles by *d*_{s}, Substituting the expressions in Eq. A3 into Eq. A2 and solving for *F* = *d/N*, we find The second (approximate) equality follows from the fact that *d*_{s} is expected to be much smaller than *N* (using the model described in the main text *d*_{s}/*N* ≈ 0.003), which allows neglect of the term in square brackets. Eq. A4 thus provides a practical means of estimating the fraction of mature T cells that express two TCR on their surfaces from a measurable quantity (*f* ≈ *f*_{ij}, but see below).

#### Correction for biases in gene rearrangement

V_{α} gene segments paired in dual TCR cells are weakly correlated (6 ). As discussed in the main text, such biases in the repertoire come from quasisequential use of the two sets of V_{α} gene segments at the same rate. In particular, application of Eq. 10 shows that the fraction of dual TCR cells observed depends on the separation of the two V segments detected (Fig. A1⇓). In many cases, *f*_{ij} as computed from Eq. A1 will be a poor approximation for *f*. Here, we show how our model of TCR V_{α}J_{α} rearrangement can be used to mitigate the bias that quasisequential deletional rearrangement introduces to interpretation of measured *f*_{ij}.

To this end, we use the model with the parameters given in the main text to calculate the fraction of dual TCR cells as a function of V_{α} gene segment position and the calculated value of *f*_{ij}: where *P*(*i*,*j*) is given by Eq. 10 and *P*(*i*) is computed as explained in Appendix B. Note that *P*(*i*) includes both single and dual TCR cells. We then scale the measured *f*_{ij} by the ratio of *f*_{ij}^{calc} and its average value in the model (〈*f*_{i}^{calc}〉, where the average is weighted by the number of V_{α}*i*^{+} cells) to obtain a corrected fraction for use in Eq. A4: Basing this correction on our model for TCR V_{α}J_{α} rearrangement implicitly assumes that the extent of phenotypic allelic inclusion is directly proportional to the extent of genotypic allelic inclusion. Although this cannot be wholly the case, it is not an unreasonable approximation given, as discussed above, the effective averaging over J_{α}, CDR, and the β-chain.

One feature of Eq. A6 that might be confusing to the reader is that *f*_{i} is a property of one gene segment, but *f*_{ij} is clearly a property of two. However, *f*_{ij} as defined by Eq. A1 (26 ,27 ,28 ,29 ) is an estimate for *f*_{i} ≈ *f*_{j}. Eq. A6 is used to correct for cases when this is a poor estimate due to the separation between gene segments used in the experiment. It is worth noting that the error in *f* could thus be reduced by instead staining for one V_{α} protein segment and CD3 as in Ref. 25 . Because the ratio of total TCR to CD3 is roughly constant, cells with a low ratio of CD3 to the V_{α} protein segment studied express only that V_{α}, while those with a high ratio express another V_{α} as well.

The discussion immediately above leads to a related point. Given the dependence of *f*_{ij} on the separation in position of gene segments V_{α}*i* and V_{α}*j*, it is natural to wonder whether the assumption above that *f*_{ij} ≈ *f* is essentially the same for all V_{α} protein segments is appropriate. In deriving Eq. A4, the *f* used represents the actual fraction of dual TCR cells as a function of V_{α} gene segment position, not the fraction estimated from a specific pair of V_{α}*i* and V_{α}*j*. Due to averaging over V_{α}*j*, *f*_{i} calculated with Eq. A5 is much flatter than *P*(*i*,*j*) calculated with Eq. 10 (Fig. A1⇑, *inset*). The assumption that *f*_{i} is essentially the same for all V_{α} is thus reasonable. Moreover, it improves as the number of rearrangements becomes larger, although a kink persists at the transition from primary to secondary rearrangement.

#### Total fraction of phenotypic dual TCR cells

We use the corrections in Eqs. A4 and A6 to re-evaluate previously published data. The reagents used were for the V_{α}2, V_{α}8, and V_{α}11 gene families. In these cases, *P*(*i,j*) is the probability that any gene segments from family *i* is paired with any member from family *j* and *P*(*i*) is the probability that any member of gene family *i* is used. Generally speaking, these gene families are distributed throughout the chromosome; the V_{α}2 family is slightly biased toward the 3′ end whereas the V_{α}11 family is slightly biased toward the 5′ end. The V_{α}8 family is nearly evenly distributed. Thus, it is to be expected that V_{α}2/V_{α}8 and V_{α}8/V_{α}11 experiments need little correction, while V_{α}2/V_{α}11 experiments significantly underestimate the fraction of dual TCR cells. In agreement with this expectation, the calculated correction factors for experiments with V_{α}2/V_{α}8, V_{α}8/V_{α}11, and V_{α}2/V_{α}11 are 1.02, 1.03, and 1.86, respectively. Because the experiments overestimate the fraction of dual TCR cells by nearly a factor of 2 due to overcounting, the data from V_{α}2/V_{α}11 experiments remains nearly the same after applying both corrections. However, since no genotypic bias was introduced in V_{α}2/V_{α}8 and V_{α}8/V_{α}11 experiments, there was significant overestimation of the fraction of dual TCR cells (Table AI⇓). The separation of V_{α}2 and V_{α}11 and the consequent biases in their statistics are likely to account for the fact that the same authors generally found a higher fraction of dual TCR cells in V_{α}2/V_{α}8 and V_{α}8/V_{α}11 experiments with the exception of two aberrant data points (18–21% in Ref.28 ; 31.0% in Ref.27 , the latter of which may have been due to gating on CD8). Thus, the corrected statistics suggest that the overall rate of phenotypic allelic inclusion is 2–11%, which supports the idea that posttranslational control mechanisms regulate TCR surface expression (see Refs. 11 , 30 , and 31 for discussion).

## Appendix 2

### Normalization for dual TCR cells

We explicitly considered only one allele in deriving Eq. 6. However, it is important to treat both chromosomes to normalize the probabilities in Eq. A5 consistently. Specifically, in Eq. A5, for the probability of a cell expressing a receptor which uses the *m*^{th} gene segment regardless of whether it is a single or dual TCR cell, we have The first term represents paths in which both alleles have rearranged to the *m*^{th} gene segment and at least one of them is in-frame. The second term represents paths in which only one of the two alleles has rearranged to the *m*^{th} gene segment, in which case that allele must be in-frame and the other can be either in- or out-of-frame. The aggregate probabilities are thus with *b* given by Eq. 9. Finally, we set the normalization *Q* such that

## Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked

*advertisement*in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.↵1 A.W. was supported by a National Science Foundation graduate research fellowship.

↵2 Address correspondence and reprint requests to Dr. Aaron R. Dinner, The University of Chicago, Gordon Center for Integrative Science, 929 E. 57th Street, Chicago, IL 60637. E-mail address: dinner{at}uchicago.edu

↵3 Abbreviation used in this paper: RAG, recombination activating gene.

- Received March 24, 2006.
- Accepted June 29, 2006.

- Copyright © 2006 by The American Association of Immunologists