Abstract
Analysis of somatic mutations in V regions of Ig genes is important for understanding various biological processes. It is customary to estimate Ag selection on Ig genes by assessment of replacement (R) as opposed to silent (S) mutations in the complementarydetermining regions and S as opposed to R mutations in the framework regions. In the past such an evaluation was performed using a binomial distribution model equation, which is inappropriate for Ig genes in which mutations have four different distribution possibilities (R and S mutations in the complementarydetermining region and/or framework regions of the gene). In the present work, we propose a multinomial distribution model for assessment of Ag selection. Sidebyside application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection. We suggest the use of the multinomial model for all future analysis of Ag selection.
Functional Ig genes are created by an ordered process of gene rearrangement. The large diversity of the primary Ab repertoire, which is independent of prior exposure to Ag, is achieved by combinatorial permutation of the Ig heavy and light chain V, D, and J segments and by addition or deletion of short coding sequences at the VD and DJ joints in heavy chains and VJ joints in the light chains. Following Ag encounter, the affinity of the Ab for the Ag increases during a process of affinity maturation (1). Affinity maturation results from a combination of somatic hypermutation of the rearranged V segments and Ag selection of mutants with improved binding properties. This process leads to preferential accumulation of replacement (R)^{3} as opposed to silent (S) mutations in the complementarydetermining regions (CDRs), which form the Ag binding sites. Concomitantly, S as opposed to R mutations, tend to cluster in the framework regions (FRs), which are required to maintain structural integrity. Initial estimates of Ag selection assumed a random pattern of R and S mutations and assumed they would localize to a region proportional to the relative size of the CDR and FR (2). Therefore, in the case of Ig heavy chain V region, mutations would localize three times more frequently to the FRs than to the CDRs and a CDR:FR ratio >0.3 would indicate Ag selection (3). Subsequently, Shlomchik et al. (4) proposed the use of a binomial distribution model for assessment of Ag selection. However, this method failed to consider the intrinsic properties of the CDR and the FR. The codon compositions of the CDR and the FR have mutational biases, since the CDRs generally consist of codons which are more susceptible to R mutations than those in the FRs. To account for the inherent susceptibility of the CDR and FR to R mutations, Chang and Casali (5) calculated the relative tendencies of the V regions of individual Ig germline genes to accumulate R mutations (Rf), and used these Rf values to estimate the expected frequency of R and S mutations, in particular CDR and FR, for a given total number of mutations. They used a binomial distribution model proposed by Shlomchik et al. (4) to determine the probability that a particular number of R mutations occurred by chance. This model, by definition, is applicable to variables that have only two distribution possibilities, whereas the total Ig mutations have four different distribution possibilities (R and S mutations in CDR and/or FR of the gene), thus requiring application of a multinomial distribution model (6). Moreover, previous methods failed to account for all the statistical possibilities to obtain a certain observed number of R mutations. Their equation consists of a single binomial probability whereas the correct version should consist of a sum of all the binomial probabilities, which include the observed value.
In the present work, we propose a new method for estimation of Ag selection pressure on Ig genes that corrects the pitfalls mentioned above and apply this method to previously published Ig gene sequences.
Materials and Methods
Multinomial distribution model for estimation of excess or scarcity of R mutations in the Ig gene
The probability that an excess or scarcity of R mutations in V_{H} CDR or FR occurred by chance was calculated by a multinomial distribution model (6). The total number of mutations in each V_{H} gene is denoted by n = r_{1} + s_{1} + r_{2} + s_{2}, in which r_{1} and r_{2} are R mutations in the FR and CDR, respectively, and correspondingly, s_{1} and s_{2} are S mutations in the FR and CDR. The theoretical probabilities for r_{1}, s_{1}, r_{2}, and s_{2} mutations are denoted by p_{1}, q_{1}, p_{2}, and q_{2}, respectively. These probabilities were calculated using the following equations: p_{1} = Rf_{FR} × Lr_{FR}; q_{1} = (1− Rf_{FR}) × Lr_{FR}; p_{2} = Rf_{CDR} × (1− Lr_{FR}); and q_{2} = (1− Rf_{CDR}) × (1− Lr_{FR}) in which Lr_{FR} is a relative size of the FR, and Rf_{FR} and Rf_{CDR} are the inherent susceptibility to R mutations of the FRs and CDRs, respectively. Rf_{FR} and Rf_{CDR} were calculated for each of the identified human germline genes and were based on the chance of the occurrence in each codon of an amino acid replacement given any single nucleotide change not resulting in a termination codon.
The probability of observing r_{1} or fewer R mutations in FRs is given by the multinomial tail probability: The sum is taken over values of k ranging from 0 to r_{1} and all combinations of S_{1}, R_{2}, S_{2} such that k + S_{1} + R_{2} + S_{2} = n. To compute the P value of an observed number r_{1}, it is customary to split the probability at r_{1}: P = P(R_{1} < r_{1}) + 0.5 × P(R_{1} = r_{1}). It should be noted that P(R_{1} = r_{1}) = P(R_{1} ≤ r_{1}) − P(R_{1} ≤ r_{1} −1), and P(R_{1} < r_{1}) = P(R_{1} ≤ r_{1}) − P(R_{1} = r_{1}).
The probability of observing r_{2} or more R mutations in CDRs is similarly computed using the following equation: And the P value is computed using the formula: P(R_{2} > r_{2}) + 0.5 × P(R_{2} = r_{2}). For both FR and CDR, we used onesided P values.
There is an approximate method for computing P values for this problem. We calculate the expected number of R mutations in the FR: E = p_{1} × n. Then we compute the standardized deviation: Under the usual Poisson model, this quantity should have approximately a standard normal distribution and can be compared with a normal table. However, we found that this approximation can be quite poor, and hence do not recommend it.
Assessment of the equation
To assess the applicability of this equation and to compare the results obtained by this method to those previously reported by application of the Chang and Casali equation (5), we evaluated Ag selection pressure on 7 autoantibodies evaluated by Chang and Casali (5), 24 autoantibodies and lymphomaderived V_{H} genes randomly selected from previously published articles (7, 8, 9, 10, 11, 12), and 55 V_{H} gene sequences derived from diffuse large B cell lymphoma cases established in our laboratory (13). For this comparison, we used the Rf_{FR} and Rf_{CDR} values implied in these articles. Recalculation of these values resulted in slightly different Rf values for some of the V_{H} genes. The discrepancies most probably result from the use of slightly different germline sequences before the final sequence of the V_{H} gene locus was established and from the use of sequences in which there are polymorphic variations. For the future calculation of the Rf values, we suggest using our JAVA applet, available at http://wwwstat.stanford.edu/imuunoglobin, which calculates the Rf values for imported germline sequences.
Results
A total of 86 V_{H} gene sequences were analyzed using the multinomial distribution model equation for the presence of Ag selection as demonstrated by the conservation of the FR sequence and/or excess of R mutations in CDR (Table I⇓). These results were compared with the results obtained by the binomial distribution equation suggested by Chang and Casali (5). A total of eight discrepancies leading to opposite statistical conclusions were observed. These included six V_{H} gene sequences in which an excess of R mutations in the CDR (five sequences) or a scarcity of R mutations in the FR (one sequence) were suggested by the binomial distribution equation, but rejected by the multinomial distribution model equation. In an another two V_{H} gene sequences, evidence for scarcity of R mutations in the FR was obtained using the multinomial but not by the binomial distribution model equation. In the majority of the remaining V_{H} gene sequences, the P values obtained using the two equations differed in magnitude but did not lead to discrepant statistical conclusions. The similarity cannot be explained mathematically, but is quite fortuitous. There is no guarantee in general that the binomial formula will give a good approximation. The multinomial distribution model equation suggested an excess and/or a scarcity of R mutations in the CDR and FR, respectively, in 13 of the 14 V_{H} gene sequences derived from highaffinity autoantibodies. By contrast, the binomial distribution suggested Ag selection in only 11 of these sequences. One of the autoantibody sequences did not fulfill the statistical criteria for Ag selection by either of the analytical models.
Discussion
Analysis of somatic mutations in V regions of Ig genes is important for studying the evolution of the Ab response, for assessment of the molecular features of autoantibodies, and for the investigation of lymphoma pathogenesis. Analysis of mutations in V_{H} genes can provide insights regarding the role of Ag before or during lymphoma clonal outgrowth. In the absence of Agpositive or negative selective pressure on Ig V regions, a random mutational process would result in an even distribution of R and S mutations throughout the coding sequence. However, Agselected Abs demonstrated a higher frequency of R mutations in CDRs than in FRs, whereas preservation of a functional Ig molecule is associated with a higher frequency of S mutations and scarcity of R mutations in FRs. In the past, such an evaluation was performed using the binomial distribution model equation, as suggested by Shlomchik et al. (4) and further modified by Chang and Casali (5). The necessity for such an evaluation and its wide usage are demonstrated by the fact that the Chang and Casali article was cited 130 times (The Web of Science^{SM} on the Internet). However, their formula is incorrect and application of the binomial distribution model to variables (mutations) that have more than two distribution possibilities is incorrect. It should be viewed as the application of an improper statistical method for the data analysis, similar, for example, to the use of parametric statistical methods for the analysis of the nonparametric variables. Moreover, Chang and Casali (5) considered in their equation a single binomial probability, while correct statistical analysis should consist of a sum of all the observed probabilities, as is proposed in the new method presented herein. Consequently, incorrect biological conclusions may have been reached, as indeed had happened in eight tested V_{H} gene sequences (Table I⇑). Fortuitously, the magnitude of the observed difference between the two statistical methods in the present study was relatively small. However, we would argue that a proper statistical method should require the application of a multinomial distribution equation for all future estimates of Ag selection.
In the present work, we propose a new statistical method for estimation of Ag selection. It corrects the pitfalls present in the previous method while still taking into account the inherent susceptibility of the codons of the CDR and the FR to R mutations. One consideration not addressed here is the known propensity for certain positions to mutate—hot spots (14). Our equation assumes that mutations in V_{H} genes occur randomly, thus disregarding the possible contribution from intrinsic biases in the hypermutation mechanisms due to the presence of mutational hot spots. Since the hot spots are located in CDRs but not in FRs, the assumption that mutations in FRs occur randomly is absolutely correct. Regarding the CDRs, to consider mutational hot spots, one would need to know all the hot spots in each V_{H} gene sequence and their relative propensity to undergo mutations in comparison to each remaining nonhot spot codon in the sequence. Consideration of these hot spots may require a custom equation for each V_{H} gene sequence, thus precluding its wide applicability. Until the data required to construct the custom equation for each germline V_{H} gene sequence exists, our model can provide good approximation of the Ag selection on Ig genes.
In conclusion, we suggest the use of the multinomial model for all future analysis of Ag selection. The investigators should compare the tested Ig gene sequence to the most similar germline sequence, with particular attention to the presence of known polymorphic variants. The JAVA applet for computing the multinomial P values and Rf values of CDRs and FRs is available at http://wwwstat.stanford.edu/immunoglobin. Usage of this applet will allow uniform analysis of Ig sequences and prevent possible errors that may occur while calculating the Rf values.
Footnotes

↵1 This study was supported by Grants CA33399 and CA34233 from the U.S. Public Health ServiceNational Institutes of Health. I.S.L. is a Harold Dobbs Oncology Fellow. R.L. is an American Cancer Society Clinical Research Professor.

↵2 Address correspondence and reprint requests to Dr. Ronald Levy, Division of Oncology CCSR 1126, Stanford University School of Medicine, Stanford, CA 943055306. Email address: levy{at}leland.stanford.edu

↵3 Abbreviations used in this paper: R, replacement; S, silent; CDR, complementarydetermining region; FR, framework region.
 Received December 16, 1999.
 Accepted August 9, 2000.
 Copyright © 2000 by The American Association of Immunologists