Abstract
We have compared the microsequence specificity of mutations introduced during somatic hypermutation (SH) and those introduced meiotically during neutral evolution. We have minimized the effects of selection by studying nonproductive (hence unselected) Ig V region genes for somatic mutations and processed pseudogenes for meiotic mutations. We find that the two sets of patterns are very similar: the mutabilities of nucleotide triplets are positively correlated between the somatic and meiotic sets. The major differences that do exist fall into three distinct categories: 1) The mutability is sharply higher at CG dinucleotides under meiotic but not somatic mutation. 2) The complementary triplets AGC and GCT are much more mutable under somatic than under meiotic mutation. 3) Triplets of the form WAN (W = T or A) are uniformly more mutable under somatic than under meiotic mutation. Nevertheless, the relative mutabilities both within this set and within the SAN (S = G or C) triplets are highly correlated with those under meiotic mutation. We also find that the somatic triplet specificity is strongly symmetric under strand exchange for A/T triplets as well as for G/C triplets in spite of the strong predominance of A over T mutations. Thus, we suggest that somatic mutation has at least two distinct components: one that specifically targets AGC/GCT triplets and another that acts as true catalysis of meiotic mutation.
During the first 2 wk of infection, the primary Ig repertoire is diversified by a hypermutation mechanism that introduces mutations at a rate ∼6 orders of magnitude above background (1). Although some properties of somatic hypermutation (SH)^{3} have been well characterized, the mechanism by which Ig DNA is modified remains unknown and the molecules involved unidentified. Many different models have been proposed, including those involving gene conversion (2), reverse transcription (3), asymmetric errorprone replication (4, 5), errorprone repair (6), transcriptioncoupled repair (7), and strandbreak repair (8) but none has yet proven convincing. Recent attempts to implicate specific gene products known to be involved in DNA metabolism using knockout mice have produced largely negative results (9, 10, 11, 12, 13) or have shown small effects (14, 15, 16). Similarly, studies involving human patients with identified DNA metabolism deficiencies (17, 18, 19) had negative results (for review, see 20).
Examination of the mutations introduced during SH has led to the formulation of complicated models involving multiple targeting mechanisms, including different mutators for AT and GC bp and multiple stages of processing (8, 16, 21, 22, 23, 24, 25, 26).
It has been recognized that SH exhibits microsequence dependence in both its targeting (27) and spectra (25). Similar microsequence dependence of mutation frequency and spectra has been shown to occur during neutral evolution (28, 29). The purpose of the present study was to investigate the relationships between the mechanisms underlying the accumulation of mutations during germline evolution and those accumulated during SH by comparing the characteristics of mutation targeting and spectra under meiotic mutation and under SH. A previous study (30) found differences in the T:A to C:G transition frequency and in the mutability^{4} of G between SH and meiotic processes and thus concluded that the mechanism introducing somatic mutations is different from that responsible for germline evolution. We have shown previously that the spectra of SH and meiotic mutation are different (25). We are here undertaking a more comprehensive study that might reveal similarities undetectable in previous studies and further characterize the differences.
Materials and Methods
DNA sequence data: SH
We collected a data set comprised of 1721 mutations accumulated in nonfunctionally rearranged human Ig genes, murine 3′ Ig Vflanking region DNA, and murine J–C intron DNA (31, 32, 33, 34, 35, 36, 37, 38). In all cases, the germline sequence is known; mutations were identified by comparison of each sequence with its corresponding germline sequence. Insertions and deletions were not treated in our analysis. Further details regarding this sequence collection can be found elsewhere (25, 31).
DNA sequence data: meiotic mutations
We collected a set of processed human pseudogenes by searching GenBank, release 111.0. Processed pseudogenes result from reverse transcription of mRNA from functional genes and the integration of the reverse transcribed DNA into new chromosomal positions. These pseudogenes are usually integrated far from the parent gene and are therefore not transcribed and do not participate in gene conversion events (28, 39, 40). We then used a locally built version of the BLASTALL algorithm from National Center for Biotechnology Information to search the primate DNA database for sequences with homology to the processed pseudogenes. Only the pseudogenes for which the functional ortholog was unambiguously identified were kept for further analysis. When multiple pseudogenes of the same gene were available, we only used one in the analysis. We searched GenBank (using the BLAST program) for an ortholog of each gene in a species other than Homo sapiens. The accession of numbers of the genes in the final data set are given in Table I⇓. Each group of two functional genes and a processed pseudogene we subjected to sequence alignment using the ClustalW program (http://www2.ebi.ac.uk/clustalw). From the obtained alignments, we inferred the state in the ancestor of the human gene and processed pseudogene at each nucleotide position according to the following rules (41): wherever the two human genes agreed, we assumed that they carry the ancestral state; where they did not agree, we turned to the second ortholog. If this ortholog agreed with any of the human genes, the ancestral state was assumed to be the one carried by two of the three genes. If the nucleotide was different in all three genes, we declared the ancestral state ambiguous and excluded that nucleotide position from the analysis. We also discarded positions where an insertion or deletion was identified in any of the three genes.
Having identified the ancestral state, we then traversed the alignment and counted the number of occurrences of each of the 64 nucleotide triplets in the ancestral gene, as well as the number of instances in which the pseudogene carried a mutation at the central nucleotide of a triplet.
A given number of mutations in a triplet in a given pseudogene is the result of its intrinsic propensity to mutate as well as the divergence time between the gene and the pseudogene. A pseudogene may have a high mutation count because it contains highly mutable triplets or because it is very old. To account for these factors, we determined the relative age of the genes and adjusted the total triplet count in each pseudogene by the relative age of the pseudogene (see below).
The pooled mutation and adjusted total counts were used in the study of strand symmetry of the mutational mechanism and of the potential relation between triplet targeting in somatic and meiotic mutation. There were 2,261 mutations in 53,479 triplets.
Statistical models and methods
Our analyses are based on models for the acquisition of mutations in which the mutability of a given nucleotide depends on the microsequence motif that contains it. We consider two motif sizes: singlets and triplets. Models based on singlets account only for the identity of the target nucleotide itself, i.e., whether it is A, G, C, or T. Models based on triplets account for the identity of the target nucleotide and its immediate neighbors. In other words, we consider the mutability of XYZ where the target nucleotide Y is flanked by nucleotide X (5′) and nucleotide Z (3′).
Every nucleotide in the database is characterized by three factors: the type of mutation to which it has been exposed (somatic or meiotic), the sequence in which it is located, and the motif in which it is found. Each nucleotide, therefore, has probability p_{ijk} of being mutated, where the indices i, j, and k identify the mutational set, sequence number within the set, and motif, respectively. This probability is modeled as: where θ_{ij} is the effective time of exposure to mutation, or age, of the jth sequence in the ith set and μ_{ik} is the mutability of the kth motif under the mutational process i. Although the times θ are not of interest to us, it is necessary to include them in the model for consistent comparison among sequences from different sources and for consistent pooling of data from diverse sources. We denote the total nucleotide count in class (i, j, k) by n_{ijk} and the number of mutations among those by m_{ijk}. Our analyses are based on the likelihood model given by Specific hypotheses within the context of this model are expressed as constraints on the mutabilities μ. For example, the hypothesis that the mutabilities under meiotic and somatic processes are the same is expressed as μ_{1k} = μ_{2k}. The parameters θ and μ were estimated by maximizing the log likelihood, Eq. 1, subject to the constraints for the hypothesis under consideration and to the identifiability constraint on θ: ∑_{jk}θ_{ij}n_{ijk} = ∑_{jk}n_{ijk}, for both i. This constraint ensures that the mean “time of exposure” is normalized between sets.
Analyses using contingency tables or correlation tests (where counts over all sequences in a set are needed) were performed using pooled counts derived from the likelihood model and adjusted as follows. The total counts (mutated plus unmutated) for each motif, denoted ñ_{i · k}, are adjusted for consistent estimation: ñ_{i · k} = ∑_{j}θ̂_{ij}n_{ijk}, where θ̂_{ij} is the maximum likelihood estimate for the effective time of exposure, θ_{ij}.
We applied correlation tests designed to infer the correlation coefficient among the binomial parameters (proportions or probabilities) that underlie our count data. The data themselves also have binomial sampling variability, which is not correlated. Therefore, the task is somewhat more complicated than an ordinary (Pearson) correlation test, which, in addition, assumes normality and equality of variances. We have used two types of estimators: those that are designed to diminish the bias induced by the presence of binomial sampling by accounting for the excess variance and those that do not make this correction. The results of hypothesis testing, where the null hypothesis is that the correlation coefficient is zero, do not depend on this choice, but the numerical value of the estimated correlation coefficient does. All estimators use the fact that the triplets with greater total counts provide more reliable estimates of the underlying binomial parameter and must be weighted more heavily than those with few total counts. See Appendix for the formula defining the estimators.
We carried out the hypothesis testing on these estimators by randomly permuting the triplet labels on one of the sets in the paired data and reporting (as p) the quantile of the real estimated correlation coefficient among the estimators obtained using the permuted data.
Results
Mutation targeting: complementation symmetry
To investigate the presence of strand bias in the mechanisms responsible for introducing mutations, we compared the mutabilities of motifs with those of their complements. The firstorder model, in which mutability depends on the identity of the base itself but not on its neighbors, shows that the somatic set is highly asymmetric, with mutability at A almost twice that at T (Table II⇓). The G:C ratio is not nearly as high as that for A:T but is also significantly different from 1. The meiotic set does not show any evidence of complementation asymmetry. This result holds even when we exclude from the computation the sites that span CG dinucleotides.
For the model in which both neighbors influence the mutability, we performed correlation tests comparing the mutability of a given triplet with that of its complement. Triplets were classified according to their central bases and analyzed separately to remove the clear asymmetry of the singlenucleotide rates.
We find that the correlations between triplets and their complements are extremely high under SH (Fig. 1⇓) but not meiotic mutation. Tests of the correlation coefficient bear this out (Table III⇓). Note, however, that if we include triplets spanning CG dinucleotides in the calculation of correlation coefficients for the germline set, we obtain a significant correlation for this set as well. We obtain similar results when we account for the binomial variance, although the values of the correlation coefficients are (as expected) higher: r = 0.83 (p < 10^{−4}) for the somatic set with AGC/GCT excluded, and r = 0.74 (p = 0.12) for the meiotic set with CG dinucleotidecontaining triplets excluded. The correlation becomes significant for the meiotic set as well if we include these motifs.
Mutation targeting: meiotic/somatic comparison
To compare the microsequence mutability patterns in meiotic and somatic processes, we computed the loglikelihood differences between two models: one is the fully parameterized model in which the mutability for each triplet in each of the two sets is separately estimated, for a total of 128 mutability parameters (plus age parameters; see Materials and Methods). In the second model, all triplet mutabilities are assumed to be identical between the somatic and meiotic sets. The age parameters are still assumed independent and take up any differences in overall mutation rate.
Each nucleotide triplet contributes a term to the loglikelihood difference; the larger the term, the more poorly the assumption of equality between somatic and meiotic data sets accommodates that triplet (Fig. 2⇓). We find that almost threefourths of the loglikelihood difference is due to the following triplets (or motifs): triplets containing CG dinucleotides, AGC, and its complement GCT, and triplets of the form WAN, where W is T or A, N is any nucleotide. We estimated the contributions of each of these classes by amending the model to recognize the appropriate number of triplet classes. For example, to estimate the contribution of CG dinucleotides, the amended model recognizes two classes of triplets: those containing CG dinucleotides and those that do not. All of the triplets within a class are constrained to have the same ratio of somatic mutability to meiotic mutability. Each of the above classes therefore uses 1 df. The increase in log likelihood produced by the serial inclusion of each of these classes is: NCG/CGN, 115.5; AGC/GCT, 40.8; WAN, 49.7, out of a total likelihood difference (largest minus smallest model) of 291.3 (63 df). In sum, these 3 df (of 63) account for 206 of the total 291.3 loglikelihood difference.
Scatterplot comparisons of triplet mutabilities between somatic and meiotic data sets largely corroborate these results and provide additional insights; the correlation between somatic and meiotic mutabilities stands out quite clearly (Fig. 3⇓). For central nucleotide A, when triplets are grouped as above with WAN and SAN (S = G or C), the withingroups correlation stands out strongly. The observed patterns are further confirmed by computation of the linear correlation coefficients (Table IV⇓). These were performed both for the complete data sets and as modified by the above considerations to remove the effects of those triplets that are clearly involved in processes unique to one set or the other and without taking into account binomial sampling variance. If we account for the binomial sampling, the estimated correlation coefficients become higher, but the p values are similar: r = 0.73(0.0004) and r = 0.55(0.01), depending on whether we do or we do not divide the triplets with A as the central nucleotide into WAN and SAN classes. Inspection of Fig. 3⇓ also reveals that, consistent with our findings of complementation symmetry, the triplets NTW, complementary to WAN, also have mutability higher than the triplets NTS. The effect is not as marked for T as it is for A, but this may be due to the smaller number of mutations at T.
Mutation spectrum: complementation symmetry
We tested the complementation symmetry of the mutation spectrum conditioned only on the identity of the mutating base. For both the somatic and meiotic data, we constructed 2 × 2 × 3 contingency tables with mutating base classified as purine/pyrimidine and weak/strong, and resulting nucleotide as the transition partner, complement or transition partner’s complement (31), and tested for independence of the purine/pyrimidine classification and the resulting nucleotide (complementation symmetry). Both χ^{2} tests failed to provide any evidence for departures from complementation symmetry (meiotic: χ^{2} = 7.53, if we do not include mutations at CG dinucleotides and 8.20 if we do; somatic: χ^{2} = 6.14; none of these values is significant at the 0.05 level).
The microsequence dependence of the spectrum under somatic hypermutation is symmetric: the estimated common correlation coefficient for the rate of transitions and of transversions to the complement of the mutating base between a triplet and its complement is r = 0.43 (p = 0.001). This result also holds if we do not include the triplets that span CG dinucleotides; these triplets are extremely rare and their mutation counts are also very low. For the meiotic set, the estimated correlation coefficient with CG dinucleotides excluded is r = 0.23 (p = 0.12). Similar to what we observed in mutational targeting, if we include CG dinucleotides, the spectrum becomes symmetric in the meiotic case as well (r = 0.36, p = 0.003).
Mutation spectrum: meiotic/somatic comparison
When represented in terms relative to the mutating base, the mutation spectrum is strikingly consistent regardless of which base is mutating, for both meiotic and somatic processes (Fig. 4⇓). The spectra are not the same between somatic and meiotic processes however (Fig. 4⇓). Direct test of the spectrum conditional on the mutating base only shows very strong differences between meiotic and somatic mutation (χ^{2} = 14.42 (A), 35.68 (G), 22.02 (T), and 7.82 (C); with the exception of C, all other values are significant at the 0.01 level).
The correlation coefficient between somatic and meiotic sets (computed as the combined triplet correlations as above for the symmetry tests) is not significantly different from zero (r = −0.03, p = 0.78).
Discussion
We compared the characteristics of mutations introduced by SH to those introduced meiotically. To ensure that observed characteristics are due to the mutational process itself, we have minimized the effects of selection by choosing, where possible, DNA sequences that are not subject to selection. The SH data are from nonproductively rearranged Ig V genes and from introns flanking rearranged V genes. For the meiotic mutations, we have used processed pseudogenes. For these, we do not completely eliminate selection, since there is uncertainty in assignment of observed nucleotide differences to the pseudogene or its ortholog. We attempt to minimize this uncertainty by considering the state of each nucleotide site in a second ortholog, from a species other than human.
A marked asymmetry between the mutability under SH of thymidine and that of adenine has been noted previously and taken as evidence for strand bias of the hypermutation mechanism (42). We also find a higher mutability at A than at T and that this asymmetry is much greater than any singlet asymmetry under meiotic mutation. But we also find that when this overall mutability difference is factored out, the microsequence specificity at A is very similar to that at T (Fig. 1⇑ and Table III⇑). Similar findings have been reported (23, 24) and used to justify the conclusion that both strands are targeted by SH and that two mechanisms, one strandunbiased mutating G and C and the other strandbiased acting on A and T, operate. We find, however, that the triplet mutabilities are surprisingly complementation symmetric for both A/T and G/C mutations. In fact, once the singlenucleotide mutabilities have been taken into account, the triplet symmetry is evident for SH. The triplet symmetry appears in meiotic mutation depends strongly on whether the triplets that span CG dinucleotides are included in the calculation of the correlation coefficient. Thus, although we also conclude that there are two distinct components of SH targeting, we find that they share similar strand symmetry.
With certain welldefined exceptions, the sequence specificity of mutational targeting underlying meiotic and somatic mutations are significantly correlated. This is quite remarkable since the time scales over which these changes have accrued differ by about 7 orders of magnitude (about 1 mo for SH and on the order of a million years for meiotic mutation). This would be expected if mutations under SH are introduced by catalytic enhancement of the processes responsible for meiotic mutations. Thus, if a major proportion of mutations introduced during evolution occur at strand breaks, then SH hastens the introduction of these breaks, but they are introduced in the same places. In this sense, the reaction resembles true catalysis.
The differences in the triplet mutabilities between somatic and meiotic mutation are largely attributable to three effects: 1) The mutability of triplets containing CG dinucleotides is much higher under meiotic mutation than under SH. The mutability of CG dinucleotides is a wellunderstood consequence of the methylation of such dinucleotides (43). This excess mutability has been seen in studies of pseudogeneortholog pairs (29) and in surveys of genetic lesions associated with human genetic disease (44). 2) The mutability of the triplet AGC and its complement GCT is considerably higher under SH than under meiotic mutation. This is the wellknown serine hot spot (27, 36). 3) The mutability of triplets of the form WAN is higher under SH. The mutabilities of the triplets within each of the two subsets (WAN, SAN) are correlated with those in the meiotic data set. Although the pattern is weaker for T mutating, the complementary triplets NTW also segregate at higher mutability from the triplets NTS and both sets are correlated with the meiotic mutabilities. The overarching similarities between somatic and meiotic mutation targeting, punctuated sharply by specific differences suggests that two components are involved in the targeting: a “background” mechanism that has recruited and modified components of the DNA repair machinery, and a mechanism, perhaps novel, specific to AGC/GCT triplets (see below).
We also investigated the relationships between the mutation spectra under somatic and meiotic mutation. It was previously suggested that the two processes may be related because both result in an excess of transitions over transversions (22). We find, however, that the proportion of transitions is significantly smaller under SH. The effect of this is that the rate of replacement mutations is higher under SH and, consequently, so is the net rate of diversification. Both of these effects are consistent with diversification under SH being advantageous whereas mutations under meiotic mutation presumably are merely unavoidable.
We have previously shown that the mutation spectrum under SH is microsequence dependent: what a nucleotide mutates to is influenced by what its neighbors are (25). We compared this spectrum to that previously inferred from a set of meiotic mutations and found no correlations. That meiotic data set, however, combined information from triplets and their complements; furthermore, the mutations were inferred by a somewhat different process than the one we use here. The more comprehensive comparison here confirms the previous result: although there are significant effects of neighboring nucleotides on the mutation spectrum in both meiotic and somatic processes, the triplet dependencies are uncorrelated.
The following model is consistent with the findings thus far, though it is certainly not uniquely so. An initial lesion is created in the dsDNA. The targeting at this point is symmetric: sense strand XAZ is affected just as frequently as sense strand ZT̄X̄. This occurs naturally if the lesion is a doublestrand break, consistent with the findings of Sale and Neuberger (8). In fact, the complementation symmetry of targeting even suggests a staggered cut. In a blunt cut, the complementary nucleotides are not in equivalent states: one is 3′ of the break and the other is 5′ of it. A staggered cut that also breaks the base pairing leaves the two nucleotides both 5′ or both 3′ of the break, though now on opposite sides of it. Furthermore, both are unpaired and overhanging. Note that now the apparent strand asymmetry can now be viewed as the asymmetry between the DNA 5′ and 3′ of the break. The probability that religation is mutagenic now depends on which side of the break the purine is on, with the probability of mutagenic repair higher if the purine is on the plus strand. This would result if, for example, purines are more susceptible to excision when overhanging and gaps in the plus strand (or 5′ of the doublestranded break) are less likely to be repaired correctly.
Several studies have found reduced mutation rates in mismatch repairdeficient mice (11, 14, 16) and relative enhancement of mutations at the AGC/GCT hot spots (16) or at G and C bases (13, 15). Rada et al. (16) inferred from this observation that the mutator has two components, one that is dependent on the mismatch repair protein MSH2 and another that is MSH2 independent. We concur and suggest that MSH2 is responsible for introducing lesions as described above and leaves the signature of catalytically enhanced meiotic mutation. A second component, as yet unidentified, is targeted specifically at AGC/GCT triplets or at the palindomic quadruplet AGCT (L. G. Cowell and T. B. Kepler, manuscript in preparation), which contains both triplet motifs, and introduces lesions preferentially at these sites. One candidate for the unknown molecule is a modified sitespecific methylase. Other groups have hypothesized the presence of a twocomponent mutator (21, 22, 23, 24), consistent with the observation that G and C are mutated more frequently in the murine cell line 1881 (26) and the Burkitt lymphoma line Ramos (8). Furthermore, the G · Ctargeting component is argued to have arisen first (or been coopted first by SH) (22), consistent with the observations that AGC/GCT or G and C are preferentially targeted in shark (45) and Xenopus (46).
The identity of the molecules involved in somatic hypermutation will surely be revealed soon, but even after their names are known, it will remain to learn how they do what they do. For this task, careful analysis of the mutation patterns will be essential.
Appendix 1
Estimator for the correlation coefficient among binomial proportions
The model underlying the data analysis is that of two sets of mutabilities which are linearly correlated and which give rise to binomial (count) data. The task is to estimate the linear correlation coefficient. The difficulty is that the binomial sampling variability is independent (i.e., uncorrelated); it is only the indirectly observed mutabilities that are correlated. The estimation is as follows.
The adjusted counts for each motif k are designated by n_{ik} where i = 1, 2 is the group index (somatic or meiotic; triplet or complement), and k designates the motif. Similarly, m_{ik} denotes the number of mutated occurrences of motif k in group i. For each of the four nucleotides, the number of triplets is denoted by K. The dot denotes summation over the respective coefficient.
The estimators for the correlation coefficients are computed as: where and
Acknowledgments
We thank Claudia Berek and Latham Claflin for sharing unpublished data.
Footnotes

↵1 Partially supported by National Science Foundation Award MCB 9357637 (to T.B.K.) and by National Institutes of Health Grant AI28433 (to A.P. and M.O.). Portions of the work were done under the auspices of the U.S. Department of Energy. During much of this work T.B.K. and L.G.C. were in the Biomathematics Program, Department of Statistics, North Carolina State University, Raleigh, NC.

↵2 Address correspondence and reprint requests to Dr. Thomas B. Kepler, The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501. Email address: KEPLER{at}SANTAFE.EDU

↵3 Abbreviation used in this paper: SH, somatic hypermutation.

↵4 We use the term “mutability” rather than “mutation rate” to emphasize its role as a property of the DNA sequence itself.
 Received December 8, 1999.
 Copyright © 2001 by The American Association of Immunologists