|
|
||||||||


*
Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM 87545;
Department of Immunology, Duke University Medical Center, Durham, NC 27710; and
The Santa Fe Institute, Santa Fe, NM 87501
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
6
orders of magnitude above background (1). Although some
properties of somatic hypermutation
(SH)3 have been well
characterized, the mechanism by which Ig DNA is modified remains
unknown and the molecules involved unidentified. Many different models
have been proposed, including those involving gene conversion
(2), reverse transcription (3), asymmetric
error-prone replication (4, 5), error-prone repair
(6), transcription-coupled repair (7), and
strand-break repair (8) but none has yet proven
convincing. Recent attempts to implicate specific gene products known
to be involved in DNA metabolism using knockout mice have produced
largely negative results (9, 10, 11, 12, 13) or have shown small
effects (14, 15, 16). Similarly, studies involving human
patients with identified DNA metabolism deficiencies
(17, 18, 19) had negative results (for review, see 20). Examination of the mutations introduced during SH has led to the formulation of complicated models involving multiple targeting mechanisms, including different mutators for A-T and G-C bp and multiple stages of processing (8, 16, 21, 22, 23, 24, 25, 26).
It has been recognized that SH exhibits microsequence dependence in both its targeting (27) and spectra (25). Similar microsequence dependence of mutation frequency and spectra has been shown to occur during neutral evolution (28, 29). The purpose of the present study was to investigate the relationships between the mechanisms underlying the accumulation of mutations during germline evolution and those accumulated during SH by comparing the characteristics of mutation targeting and spectra under meiotic mutation and under SH. A previous study (30) found differences in the T:A to C:G transition frequency and in the mutability4 of G between SH and meiotic processes and thus concluded that the mechanism introducing somatic mutations is different from that responsible for germline evolution. We have shown previously that the spectra of SH and meiotic mutation are different (25). We are here undertaking a more comprehensive study that might reveal similarities undetectable in previous studies and further characterize the differences.
| Materials and Methods |
|---|
|
|
|---|
We collected a data set comprised of 1721 mutations accumulated in nonfunctionally rearranged human Ig genes, murine 3' Ig V-flanking region DNA, and murine JC intron DNA (31, 32, 33, 34, 35, 36, 37, 38). In all cases, the germline sequence is known; mutations were identified by comparison of each sequence with its corresponding germline sequence. Insertions and deletions were not treated in our analysis. Further details regarding this sequence collection can be found elsewhere (25, 31).
DNA sequence data: meiotic mutations
We collected a set of processed human pseudogenes by searching
GenBank, release 111.0. Processed pseudogenes result from reverse
transcription of mRNA from functional genes and the integration of the
reverse- transcribed DNA into new chromosomal positions. These
pseudogenes are usually integrated far from the parent gene and are
therefore not transcribed and do not participate in gene conversion
events (28, 39, 40). We then used a locally built version
of the BLASTALL algorithm from National Center for Biotechnology
Information to search the primate DNA database for sequences with
homology to the processed pseudogenes. Only the pseudogenes for which
the functional ortholog was unambiguously identified were kept for
further analysis. When multiple pseudogenes of the same gene were
available, we only used one in the analysis. We searched GenBank (using
the BLAST program) for an ortholog of each gene in a species other than
Homo sapiens. The accession of numbers of the genes in the
final data set are given in Table I
. Each
group of two functional genes and a processed pseudogene we subjected
to sequence alignment using the ClustalW program
(http://www2.ebi.ac.uk/clustalw). From the obtained alignments, we
inferred the state in the ancestor of the human gene and processed
pseudogene at each nucleotide position according to the following rules
(41): wherever the two human genes agreed, we assumed that
they carry the ancestral state; where they did not agree, we turned to
the second ortholog. If this ortholog agreed with any of the human
genes, the ancestral state was assumed to be the one carried by two of
the three genes. If the nucleotide was different in all three genes, we
declared the ancestral state ambiguous and excluded that nucleotide
position from the analysis. We also discarded positions where an
insertion or deletion was identified in any of the three genes.
|
A given number of mutations in a triplet in a given pseudogene is the result of its intrinsic propensity to mutate as well as the divergence time between the gene and the pseudogene. A pseudogene may have a high mutation count because it contains highly mutable triplets or because it is very old. To account for these factors, we determined the relative age of the genes and adjusted the total triplet count in each pseudogene by the relative age of the pseudogene (see below).
The pooled mutation and adjusted total counts were used in the study of strand symmetry of the mutational mechanism and of the potential relation between triplet targeting in somatic and meiotic mutation. There were 2,261 mutations in 53,479 triplets.
Statistical models and methods
Our analyses are based on models for the acquisition of mutations in which the mutability of a given nucleotide depends on the microsequence motif that contains it. We consider two motif sizes: singlets and triplets. Models based on singlets account only for the identity of the target nucleotide itself, i.e., whether it is A, G, C, or T. Models based on triplets account for the identity of the target nucleotide and its immediate neighbors. In other words, we consider the mutability of XYZ where the target nucleotide Y is flanked by nucleotide X (5') and nucleotide Z (3').
Every nucleotide in the database is characterized by three factors: the
type of mutation to which it has been exposed (somatic or meiotic), the
sequence in which it is located, and the motif in which it is found.
Each nucleotide, therefore, has probability pijk
of being mutated, where the indices i, j, and k
identify the mutational set, sequence number within the set, and motif,
respectively. This probability is modeled as:
![]() | (1) |
ij is the effective time of
exposure to mutation, or age, of the jth sequence in the
ith set and µik is the mutability
of the kth motif under the mutational process i.
Although the times
are not of interest to us, it is necessary to
include them in the model for consistent comparison among sequences
from different sources and for consistent pooling of data from diverse
sources. We denote the total nucleotide count in class (i, j,
k) by nijk and the number of mutations
among those by mijk. Our analyses are based on
the likelihood model given by
![]() | (2) |
and µ were estimated by
maximizing the log likelihood, Eq. 1
:
jk
ijnijk =
jknijk, for both i. This
constraint ensures that the mean "time of exposure" is normalized
between sets.
Analyses using contingency tables or correlation tests (where
counts over all sequences in a set are needed) were performed using
pooled counts derived from the likelihood model and adjusted as
follows. The total counts (mutated plus unmutated) for each motif,
denoted ñi · k, are adjusted for
consistent estimation: ñi · k =
j
ijnijk, where
ij is the maximum likelihood estimate
for the effective time of exposure,
ij.
We applied correlation tests designed to infer the correlation coefficient among the binomial parameters (proportions or probabilities) that underlie our count data. The data themselves also have binomial sampling variability, which is not correlated. Therefore, the task is somewhat more complicated than an ordinary (Pearson) correlation test, which, in addition, assumes normality and equality of variances. We have used two types of estimators: those that are designed to diminish the bias induced by the presence of binomial sampling by accounting for the excess variance and those that do not make this correction. The results of hypothesis testing, where the null hypothesis is that the correlation coefficient is zero, do not depend on this choice, but the numerical value of the estimated correlation coefficient does. All estimators use the fact that the triplets with greater total counts provide more reliable estimates of the underlying binomial parameter and must be weighted more heavily than those with few total counts. See Appendix for the formula defining the estimators.
We carried out the hypothesis testing on these estimators by randomly permuting the triplet labels on one of the sets in the paired data and reporting (as p) the quantile of the real estimated correlation coefficient among the estimators obtained using the permuted data.
| Results |
|---|
|
|
|---|
To investigate the presence of strand bias in the mechanisms
responsible for introducing mutations, we compared the mutabilities of
motifs with those of their complements. The first-order model, in which
mutability depends on the identity of the base itself but not on its
neighbors, shows that the somatic set is highly asymmetric, with
mutability at A almost twice that at T (Table II
). The G:C ratio is not nearly as high
as that for A:T but is also significantly different from 1. The meiotic
set does not show any evidence of complementation asymmetry. This
result holds even when we exclude from the computation the sites that
span CG dinucleotides.
|
We find that the correlations between triplets and their complements
are extremely high under SH (Fig. 1
) but not meiotic mutation.
Tests of the correlation coefficient bear
this out (Table III
). Note, however, that
if we include triplets spanning CG dinucleotides in the calculation of
correlation coefficients for the germline set, we obtain a significant
correlation for this set as well. We obtain similar results when we
account for the binomial variance, although the values of the
correlation coefficients are (as expected) higher: r =
0.83 (p < 10-4) for the somatic
set with AGC/GCT excluded, and r = 0.74
(p = 0.12) for the meiotic set with CG
dinucleotide-containing triplets excluded. The correlation becomes
significant for the meiotic set as well if we include these motifs.
|
|
To compare the microsequence mutability patterns in meiotic and somatic processes, we computed the log-likelihood differences between two models: one is the fully parameterized model in which the mutability for each triplet in each of the two sets is separately estimated, for a total of 128 mutability parameters (plus age parameters; see Materials and Methods). In the second model, all triplet mutabilities are assumed to be identical between the somatic and meiotic sets. The age parameters are still assumed independent and take up any differences in overall mutation rate.
Each nucleotide triplet contributes a term to the log-likelihood
difference; the larger the term, the more poorly the assumption of
equality between somatic and meiotic data sets accommodates that
triplet (Fig. 2
). We find that almost
three-fourths of the log-likelihood difference is due to the following
triplets (or motifs): triplets containing CG dinucleotides, AGC, and
its complement GCT, and triplets of the form WAN, where W is T or A, N
is any nucleotide. We estimated the contributions of each of these
classes by amending the model to recognize the appropriate number of
triplet classes. For example, to estimate the contribution of CG
dinucleotides, the amended model recognizes two classes of triplets:
those containing CG dinucleotides and those that do not. All of the
triplets within a class are constrained to have the same ratio of
somatic mutability to meiotic mutability. Each of the above classes
therefore uses 1 df. The increase in log likelihood produced by the
serial inclusion of each of these classes is: NCG/CGN, 115.5; AGC/GCT,
40.8; WAN, 49.7, out of a total likelihood difference (largest minus
smallest model) of 291.3 (63 df). In sum, these 3 df (of 63) account
for 206 of the total 291.3 log-likelihood difference.
|
|
|
We tested the complementation symmetry of the mutation spectrum
conditioned only on the identity of the mutating base. For both the
somatic and meiotic data, we constructed 2 x 2 x 3
contingency tables with mutating base classified as purine/pyrimidine
and weak/strong, and resulting nucleotide as the transition partner,
complement or transition partners complement (31), and
tested for independence of the purine/pyrimidine classification and the
resulting nucleotide (complementation symmetry). Both
2
tests failed to provide any evidence for departures from
complementation symmetry (meiotic:
2 = 7.53, if we
do not include mutations at CG dinucleotides and 8.20 if we do;
somatic:
2 = 6.14; none of these values is
significant at the 0.05 level).
The microsequence dependence of the spectrum under somatic hypermutation is symmetric: the estimated common correlation coefficient for the rate of transitions and of transversions to the complement of the mutating base between a triplet and its complement is r = 0.43 (p = 0.001). This result also holds if we do not include the triplets that span CG dinucleotides; these triplets are extremely rare and their mutation counts are also very low. For the meiotic set, the estimated correlation coefficient with CG dinucleotides excluded is r = 0.23 (p = 0.12). Similar to what we observed in mutational targeting, if we include CG dinucleotides, the spectrum becomes symmetric in the meiotic case as well (r = 0.36, p = 0.003).
Mutation spectrum: meiotic/somatic comparison
When represented in terms relative to the mutating base, the
mutation spectrum is strikingly consistent regardless of which base is
mutating, for both meiotic and somatic processes (Fig. 4
). The spectra are not the same between
somatic and meiotic processes however (Fig. 4
). Direct test of the
spectrum conditional on the mutating base only shows very strong
differences between meiotic and somatic mutation (
2
= 14.42 (A), 35.68 (G), 22.02 (T), and 7.82 (C); with the exception of
C, all other values are significant at the 0.01 level).
|
| Discussion |
|---|
|
|
|---|
A marked asymmetry between the mutability under SH of thymidine and
that of adenine has been noted previously and taken as evidence for
strand bias of the hypermutation mechanism (42). We also
find a higher mutability at A than at T and that this asymmetry is much
greater than any singlet asymmetry under meiotic mutation. But we also
find that when this overall mutability difference is factored out, the
microsequence specificity at A is very similar to that at T (Fig. 1
and
Table III
). Similar findings have been reported (23, 24)
and used to justify the conclusion that both strands are targeted by SH
and that two mechanisms, one strand-unbiased mutating G and C and the
other strand-biased acting on A and T, operate. We find, however, that
the triplet mutabilities are surprisingly complementation symmetric for
both A/T and G/C mutations. In fact, once the single-nucleotide
mutabilities have been taken into account, the triplet symmetry is
evident for SH. The triplet symmetry appears in meiotic mutation
depends strongly on whether the triplets that span CG dinucleotides are
included in the calculation of the correlation coefficient. Thus,
although we also conclude that there are two distinct components of SH
targeting, we find that they share similar strand symmetry.
With certain well-defined exceptions, the sequence specificity of mutational targeting underlying meiotic and somatic mutations are significantly correlated. This is quite remarkable since the time scales over which these changes have accrued differ by about 7 orders of magnitude (about 1 mo for SH and on the order of a million years for meiotic mutation). This would be expected if mutations under SH are introduced by catalytic enhancement of the processes responsible for meiotic mutations. Thus, if a major proportion of mutations introduced during evolution occur at strand breaks, then SH hastens the introduction of these breaks, but they are introduced in the same places. In this sense, the reaction resembles true catalysis.
The differences in the triplet mutabilities between somatic and meiotic mutation are largely attributable to three effects: 1) The mutability of triplets containing CG dinucleotides is much higher under meiotic mutation than under SH. The mutability of CG dinucleotides is a well-understood consequence of the methylation of such dinucleotides (43). This excess mutability has been seen in studies of pseudogene-ortholog pairs (29) and in surveys of genetic lesions associated with human genetic disease (44). 2) The mutability of the triplet AGC and its complement GCT is considerably higher under SH than under meiotic mutation. This is the well-known serine hot spot (27, 36). 3) The mutability of triplets of the form WAN is higher under SH. The mutabilities of the triplets within each of the two subsets (WAN, SAN) are correlated with those in the meiotic data set. Although the pattern is weaker for T mutating, the complementary triplets NTW also segregate at higher mutability from the triplets NTS and both sets are correlated with the meiotic mutabilities. The overarching similarities between somatic and meiotic mutation targeting, punctuated sharply by specific differences suggests that two components are involved in the targeting: a "background" mechanism that has recruited and modified components of the DNA repair machinery, and a mechanism, perhaps novel, specific to AGC/GCT triplets (see below).
We also investigated the relationships between the mutation spectra under somatic and meiotic mutation. It was previously suggested that the two processes may be related because both result in an excess of transitions over transversions (22). We find, however, that the proportion of transitions is significantly smaller under SH. The effect of this is that the rate of replacement mutations is higher under SH and, consequently, so is the net rate of diversification. Both of these effects are consistent with diversification under SH being advantageous whereas mutations under meiotic mutation presumably are merely unavoidable.
We have previously shown that the mutation spectrum under SH is microsequence dependent: what a nucleotide mutates to is influenced by what its neighbors are (25). We compared this spectrum to that previously inferred from a set of meiotic mutations and found no correlations. That meiotic data set, however, combined information from triplets and their complements; furthermore, the mutations were inferred by a somewhat different process than the one we use here. The more comprehensive comparison here confirms the previous result: although there are significant effects of neighboring nucleotides on the mutation spectrum in both meiotic and somatic processes, the triplet dependencies are uncorrelated.
The following model is consistent with the findings thus far, though it
is certainly not uniquely so. An initial lesion is created in the
dsDNA. The targeting at this point is symmetric: sense strand
XAZ is affected just as frequently as sense strand

. This occurs naturally if the lesion is a
double-strand break, consistent with the findings of Sale and Neuberger
(8). In fact, the complementation symmetry of targeting
even suggests a staggered cut. In a blunt cut, the complementary
nucleotides are not in equivalent states: one is 3' of the break and
the other is 5' of it. A staggered cut that also breaks the base
pairing leaves the two nucleotides both 5' or both 3' of the break,
though now on opposite sides of it. Furthermore, both are unpaired and
overhanging. Note that now the apparent strand asymmetry can now be
viewed as the asymmetry between the DNA 5' and 3' of the break. The
probability that religation is mutagenic now depends on which side
of the break the purine is on, with the probability of mutagenic repair
higher if the purine is on the plus strand. This would result if, for
example, purines are more susceptible to excision when overhanging and
gaps in the plus strand (or 5' of the double-stranded break) are less
likely to be repaired correctly.
Several studies have found reduced mutation rates in mismatch repair-deficient mice (11, 14, 16) and relative enhancement of mutations at the AGC/GCT hot spots (16) or at G and C bases (13, 15). Rada et al. (16) inferred from this observation that the mutator has two components, one that is dependent on the mismatch repair protein MSH-2 and another that is MSH-2 independent. We concur and suggest that MSH-2 is responsible for introducing lesions as described above and leaves the signature of catalytically enhanced meiotic mutation. A second component, as yet unidentified, is targeted specifically at AGC/GCT triplets or at the palindomic quadruplet AGCT (L. G. Cowell and T. B. Kepler, manuscript in preparation), which contains both triplet motifs, and introduces lesions preferentially at these sites. One candidate for the unknown molecule is a modified site-specific methylase. Other groups have hypothesized the presence of a two-component mutator (21, 22, 23, 24), consistent with the observation that G and C are mutated more frequently in the murine cell line 18-81 (26) and the Burkitt lymphoma line Ramos (8). Furthermore, the G · C-targeting component is argued to have arisen first (or been co-opted first by SH) (22), consistent with the observations that AGC/GCT or G and C are preferentially targeted in shark (45) and Xenopus (46).
The identity of the molecules involved in somatic hypermutation will surely be revealed soon, but even after their names are known, it will remain to learn how they do what they do. For this task, careful analysis of the mutation patterns will be essential.
| Appendix 1 |
|---|
|
|
|---|
The model underlying the data analysis is that of two sets of mutabilities which are linearly correlated and which give rise to binomial (count) data. The task is to estimate the linear correlation coefficient. The difficulty is that the binomial sampling variability is independent (i.e., uncorrelated); it is only the indirectly observed mutabilities that are correlated. The estimation is as follows.
The adjusted counts for each motif k are designated by nik where i = 1, 2 is the group index (somatic or meiotic; triplet or complement), and k designates the motif. Similarly, mik denotes the number of mutated occurrences of motif k in group i. For each of the four nucleotides, the number of triplets is denoted by K. The dot denotes summation over the respective coefficient.
The estimators for the correlation coefficients are computed as:
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
| Acknowledgments |
|---|
| Footnotes |
|---|
2 Address correspondence and reprint requests to Dr. Thomas B. Kepler, The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501. ![]()
3 Abbreviation used in this paper: SH, somatic hypermutation. ![]()
4 We use the term "mutability" rather than "mutation rate" to emphasize its role as a property of the DNA sequence itself. ![]()
Received for publication December 8, 1999.
| References |
|---|
|
|
|---|
RNA
DNA information loop. Mol. Immunol. 24:667.[Medline]
and H alleles and is unevenly distributed in rearranged alleles. J. Immunol. 146:3218.[Abstract]
genes: unequal distribution of mutation in 5' and 3' flanking regions. Int. Immunol. 5:255.
and its 5' flanking sequences determines the location of somatic mutations in the J
locus. J. Immunol. 146:3652.[Abstract]
This article has been cited by other articles:
![]() |
J. Zheng, J. Huang, Y. Mao, S. Liu, X. Sun, X. Zhu, T. Ma, L. Zhang, J. Ji, Y. Zhang, et al. Immunoglobulin Gene Transcripts Have Distinct VHDJH Recombination Characteristics in Human Epithelial Cancer Cells J. Biol. Chem., May 15, 2009; 284(20): 13610 - 13619. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Hershberg, M. Uduman, M. J. Shlomchik, and S. H. Kleinstein Improved methods for detecting selection by mutation analysis of Ig V region sequences Int. Immunol., May 1, 2008; 20(5): 683 - 694. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Hershberg and M. J. Shlomchik Differences in potential for amino acid change after mutation reveals distinct strategies for {kappa} and {lambda} light-chain variation PNAS, October 24, 2006; 103(43): 15963 - 15968. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Yang, G. C. Waldbieser, and C. J. Lobb The Nucleotide Targets of Somatic Mutation and the Role of Selection in Immunoglobulin Heavy Chains of a Teleost Fish J. Immunol., February 1, 2006; 176(3): 1655 - 1667. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. B. Rogozin and M. Diaz Cutting Edge: DGYW/WRCH Is a Better Predictor of Mutability at G:C Bases in Ig Hypermutation Than the Widely Accepted RGYW/WRCY Motif and Probably Reflects a Two-Step Activation-Induced Cytidine Deaminase-Triggered Process J. Immunol., March 15, 2004; 172(6): 3382 - 3384. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. I. Pavlov, I. B. Rogozin, A. P. Galkin, A. Y. Aksenova, F. Hanaoka, C. Rada, and T. A. Kunkel Correlation of somatic hypermutation specificity and A-T base pair substitution errors by DNA polymerase eta during copying of a mouse immunoglobulin kappa light chain transgene PNAS, July 23, 2002; 99(15): 9954 - 9959. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |