|
|
||||||||
Biogen Idec Inc., Cambridge, MA 02142
| Abstract |
|---|
|
|
|---|
4% for core residues,
5% for surface residues, and
12% for residues in common Ag-contacting positions, excepting the those coded by the D gene. | Introduction |
|---|
|
|
|---|
Much of somatic hypermutation research to date has focused on defining the molecular mechanism. This aspect has been reviewed recently (2, 3, 4, 5) and, although much remains to be learned, certain sequence-related aspects are clear. The enzyme AID (activation-induced cytidine deaminase) deamidates cytidine residues, converting them to uracil and initiating the somatic hypermutation process (6). Further processing of the local DNA may involve excision of the uracil base and repair by certain error-prone polymerases (see reviews in Refs. 3, 4, 5). The overall process favors single-base transitions over transversions at an
3:1 ratio (7). Insertions and deletions also occur but are considerably less common (8, 9). Certain four-base DNA sequence motifs, called hotspots, are correlated with the mutation locations. The two most commonly cited four-base motifs are RGYW (10) and its inverse repeat, WRCY (11), where R denotes a purine base, Y a pyrimidine base, and W an A or a T base.
It is not always clear how the mutations selected during the affinity maturation process contribute to improving binding affinity or other properties such as selectivity or stability. Only a few studies exist that look at the effect of single somatic hypermutations (12). One expects the residues in contact with the Ag to be the most critical, but often large affinity improvement come from nonsurface mutations. Daugherty et al. (13) used phage display on Escherichia coli to improve the affinity of a single-chain Fv molecule to cardiac glycoside digoxigenin and found that all of the affinity enhancements occurred at non-Ag-contacting residues. Similarly, Zahnd et al. (14) used a ribosome display to improve single-chain Fv binding to a peptide Ag and found that the most effective mutation was not at the interface with the Ag. From our own work redesigning Ab-Ag interfaces (34), we find that even Ag-contacting mutation effects are not always rationalizable or predictable from three-dimensional structure-based energetic calculations. Thus, there is great potential to learn from the natural affinity maturation process.
The work presented here seeks to better define the results of the somatic hypermutation process and to investigate sequence-related aspects of protein-protein recognition. Given a mature sequence, a probable germline sequence resulting from the recombination of species-specific V, D, and J genes (15, 16) can be derived using sequence matching and consideration of the known mechanistic rules of V(D)J recombination (see Materials and Methods). Once the germline chain sequence is known, simple comparison with the aligned mature sequence yields the position and type of mutation that occurred during the affinity maturation process.
Tomlinson et al. (17) have previously used a similar type of analysis to analyze the diversity of amino acids at specific positions in the germline and mature Ab sequences. They found that the frequency of somatic hypermutation and the diversity of the germline sequences is highest in the CDRs. Rather than focus on the mutation frequencies, we examine the type of mutation and its functional implications deduced from the location in the structure. The results indicate that residue type changes during the somatic hypermutation process are significant and have underlying functional rationales.
| Materials and Methods |
|---|
|
|
|---|
The V(D)J germline fitting algorithm is intended to provide reasonable solutions for the germline Ab sequence based on the mature protein sequence for the large majority of cases. It is recognized that no such fitting procedure can be completely free of ambiguities. In some cases, the length of the D region or the lack of residues in many publicly available sequences gives insufficient information for a certain fit. The flexible joining mechanisms of the natural process also limit the reliability of the method. For example, the natural N nucleotide addition process inserts random bases into the sequence, making the concept of a precursor gene at those positions inapplicable. In all situations, the default algorithm behavior is to assume that there was no mutation and that the germline sequence at ambiguous positions is identical with the mature sequence.
The gene fitting algorithms for the H and L chains share some basic features. In each case, the mature protein sequence is converted to a most probable DNA sequence using mouse or human codon probability tables. Comparisons with the germline V(D)J gene sequences must be done with the DNA sequence, because the germline genes may be read in multiple reading frames (only D and J genes) depending on their position in the final sequence and because some sequences, especially D genes, are too short to reliably match using the reduced information in the protein sequences. Readers interested in the most exhaustive germline fitting procedure for the D gene region should consult the work of Monod et al. (18). Once the most probable mature DNA sequence is determined, V, J, and, if appropriate, D genes from the known library of genes are sequentially fit to the sequence. V(D)J genes are taken from the ImMunoGeneTics (IMGT) database (15, 16). The fitting is done exhaustively for individual gene possibilities by sliding the library gene along the mature DNA sequence and recording the number of matching bases at each offset. Although simple and easy to code, this approach is more reliable than a BLAST search because of the short sequences involved. It is affordable because of the small number of possible germline V(D)J genes; analysis of a L/H chain pair takes a couple minutes. Fitting for the whole data set can be done in 12 days on a computer cluster and could be done considerably faster if the method and code were to be optimized. It should be noted that a useful, fast, and accessible algorithm for blasting against the V genes (IgBLAST) is available on the National Center for Biotechnology Information website (www.ncbi.nlm.nih.gov/igblast).
The L and H chain matching is handled using a similar exhaustive procedure. In each case the V genes are matched first near the N terminus of the sequence, and the best match is retained. The unmatched portion of the sequence is then extracted, and the best J gene match is found using a similar procedure. For the L chains, this process is repeated for each possible V gene to account for the possibility that an overly long V gene was chosen because it simply adds extra base matches near the C-terminal end rather than improving the final fit. There are cases where a V gene with a lower match score leads to a better overall fit because it allows room for a better-fitting J gene. The H chain is handled similarly; for each possible V gene match the entire set of possible J genes is placed at the end of the sequence. For each V-J combination, the remaining unmatched sequence is fit to the D gene library. Most but not all H chains have a D gene segment inserted between the V gene and J gene portions.
Variability in the natural joining mechanisms is accounted for by allowing two amino acids worth of flexibility during fitting. For example, after the V gene segment is fit, the unmatched portion to which the J gene will be fit is extended two residues (six bases) of mature DNA sequence into the V gene region. This is done to allow for the possibility that part of the germline V gene was deleted during splicing. For a similar reason the unfit stretch of DNA in the D gene region is extended six bases into the bordering V and J gene regions. Even with this flexibility, many D genes are often excluded because the gap between the V and J genes is too small. Conversely, if there is a gap between segments, then bases unmatched by the best gene are always assumed to be identical with those in the mature sequence, and no mutations are recorded.
Algorithm testing was performed in cases where the V(D)J fitting procedure was performed manually. Additionally, D gene fitting was tested on cases where the DNA sequence was known. In the small number of cases examined, translating from the protein sequence to the most probable DNA sequence was sufficient to distinguish between available D gene sequences. Typically, one or two related D genes had match scores clearly separated from all other lower-scoring D gene matches
In some cases the algorithm clearly returns nonsensical matches, as indicated by long stretches of poor base matches. The failures can be traced back to humanized Abs, to Abs from species outside the human and mouse libraries used, or to insertions/deletions. Germline sequence determinations requiring >20-amino acid mutations relative to the mature sequence are discarded on the assumption that the problem is inappropriate to the sequence or that the algorithm has failed. Sequences with more than five consecutive mutated residues are discarded for similar reasons. Some sequences have so little information in the D and J gene regions that fitting is impractical. For this reason, solutions having mutations at >40% of the D and J gene region positions are also discarded.
Calculation of mutation probabilities
A simple estimate for the probability of making a transition from residue type i to residue type j (Pij) with a single base change can be calculated using the codon definitions and their known usage frequencies. A given number of codons, M (N) code for each residue type i (j). For a given i to j transition, the probability is the sum of individual possible codon m to n transitions (Equation 1).
![]() | (1) |
mn is unity. This simple transition probability (Pij) can be renormalized using the residue type usage in the germline genes to form the background of Fig. 1A.
|
![]() | (2) |
0.04) of residue pairs have C
-C
separation distances of <6Å. This leads to an estimate of 68 disulfide-producing events (Pds
0.003) in the
22,320 sequences examined. Regional assignments from sequence position
All Ab sequences were first aligned to the set (AAAAA, Ahos Amazing Atlas of Antibody Anatomy) provided by Honegger et al. (19). We use this alignment and numbering system because it leaves gaps for almost all loop lengths, gives a structural correspondence between light and heavy positions, and provides numerical values for all positions. The Kabat system uses alphanumeric numbering for some loop positions and can be cumbersome in some cases.
Position definitions were derived from Protein Data Bank structures of Abs in contact with peptide or protein Ags. Ab-Ag interface positions are occupied by residues that face the Ag and are within 12 Å of the nearest Ag atom for at least one Protein Data Bank structure. Because the positions were assigned based on a small set of structures, residues occupying Ab-Ag positions do not always contact the Ag for every sequence and may be solvent exposed. Residue positions on the VL-VH interface were defined similarly using an 8-Å cutoff but were required to meet this requirement in at least half of the structures. Core positions are those that point inward to the core of the variable domain and typically expose <10 Å2. The surface positions are taken to be the remainder of the occupied positions. Some positions may have more than one assignment, particularly those that may both contact the Ag and be present at the VL-VH interface.
| Results |
|---|
|
|
|---|
or
variable domain sequences were fit to known germline V, D, and J genes using the methodology described in the Materials and Methods section. Seventy-two percent of the light and 62% of the heavy sequences yielded plausible (see Materials and Methods section for definition) germline solutions and contributed to the results presented in this section. Approximately half of the sequences that contribute to the final dataset are human. All results are for the combined mouse and human datasets unless noted. Due to the higher uncertainty of fitting the short D gene segments to the mature sequence, all mutation information from the D gene region is omitted unless specifically noted. Full dataset: mutation type and location
A comparison of the residue mutation type frequency with that observed in regular evolutionary relationships shows some basic similarity. Fig. 1A shows the prevalence of residue type change mutations in the full dataset (circles) using a color scale. For comparison, the expected mutation frequency has been calculated assuming a single DNA base change per codon and human codon usage. These expected frequencies are visible as the background color in each box. There is a clear qualitative relationship between expected and observed frequencies. An additional comparison with expected mutation frequencies comes from sequence alignment scoring matrices. Mutation frequencies can be derived from a multiple sequence alignment by observing residue type variability at a given position. If this information is compiled in a residue type-specific fashion and averaged over all positions, then a global view of how easily a given residue type can substitute for another can be compiled. The BLOSUM62 data provides such mutation frequencies derived from protein sequences with >62% sequence homology (23).
There is a similarly loose relationship between the observed frequency and the mutation frequency expected from the BLOSUM62 data as shown in Fig. 1B. From these tests we conclude that the residue type changes during somatic hypermutation are similar overall to those seen during evolution but may show deviation in the finer details.
In this overall dataset, there is some conservation of hydrophobicity (clustering around the diagonal in Fig. 1A), a relatively small number of mutations involving proline, cysteine, and tryptophan amino acids, and a number of scattered higher-probability mutation types. Comparison of observed Ab (Fig. 1A, overlaid circles) with the expected (Fig. 1A, background squares) mutations indicates that at least some of the higher-probability mutation types, such as T
S and A
V, have similar frequencies. Others, such as F
L and Y
G transitions, are less frequently and more frequently seen, respectively, in Abs. Further trends will become more pronounced in region-specific subsets of the data.
There is often a clear imbalance in the frequency of mutations from a given residue type to another (X
Y) compared with the reverse (Y
X). This imbalance is apparent in the lack of symmetry in the matrix and is presented in Fig. 2, which shows the same data processed to show deviations from the average of X
Y and Y
X entries. Part of the imbalance can be accounted for by differences in codon usage. For example, K
R mutations are 3.3 times more common than R
K mutations, and this difference could be explained by the 6:2 ratio of R to K codons and slightly lower relative usage of arginine in the germline. Other imbalances, such as A
V, have no codon number or germline usage bias. For comparison, the BLOSUM matrices and other residue substitution scoring matrices are symmetric by construction.
|
|
|
Region-specific mutation type and location
The region of the variable domain where a mutation occurs strongly determines its effect, if any, on the function of the Ab. To a first approximation, the residues on the Ab-Ag interface determine specificity and affinity, whereas those in the core partially determine stability. Mutations at the interface between the two variable domains will also contribute to stability (24) and binding affinity (25) and may contribute to the relative orientation of the domains. Surface mutations outside the interfaces may affect solvation properties and aggregation propensity but are expected to have little effect as long as polar surface area is maintained.
Fig. 5 gives the distribution of number of mutations within each region. The residue positions were distributed into the Ab-Ag interface, VL-VH interface, core, and surface regions using analysis of Protein Data Bank Ab-Ag complex structure as described in the Materials and Methods section. It can be seen that the mutations on the Ab-Ag interface span the broadest range, with a mean of 3.1 mutations per chain. Surface and VL-VH mutations are less frequent at 2.3 and 1.5 mutations per chain on average. Core mutations are least frequent at 1.3 mutations per chain. The distributions for the L and H chains are qualitatively similar (not shown). For the H chain, region-specific averages are
30% higher than those in the L chain. On average, L and H chains have 6.8 and 8.6 mutations per chain, respectively.
|
|
|
|
80% (31% in mouse and 104% in human) from the germline composition. Proline usage also increases during maturation by 42% (32% in mouse and 50% in human), as does histidine (36% in mouse and 82% in human). The residues showing these large increases in mature Ab use are almost absent in the germline but occur significantly in mature sequences (see Fig. 6).
Proline usage changes in L and H chains are seen mainly in turn regions and are qualitatively very similar between chains. Residue positions 15, 48, and 98 (Kabat light, 14, 40, and 80; Kabat heavy, 14, 41, and 84) are in the
-hairpin turns nearest the constant domain and show pronounced mutation activity. The turn in the L3 and H3 loops tends to gain prolines at position 135 (Kabat light, 95e; Kabat heavy, 100h). Similarly, prolines tend to appear at position 52 (Kabat light, 44; Kabat heavy, 45), which is a kinked position in the strand that forms part of the VL-VH interface.
Other interesting composition trends occur during maturation. The most striking is an increase in the relative number of cysteine residues on the surface. The low abundance of cysteines on the surface means that 181 net mutation events in the human dataset translate to a 178% increase (see Fig. 8B). One expects that additional cysteines could lead to creation of nonnative disulfide bonds and inhibit expression. Presumably, the new cysteines observed in this work do not have a destabilizing effect large enough to be selected against. In a small number (<50) of cases, two cysteines are apparently created, and in 12 cases they have sequence positions that could lead to the formation of an additional stabilizing disulfide bond. A probability estimate of this event occurring randomly indicates that six times more potential disulfide bonds should have been observed (see Materials and Methods). The discrepancy could be partially attributed to the observed order of magnitude lower probability of mutating to a cysteine than the random mutation probability used in the calculation. Histidine residues are gained consistently in each of regions for both species except on the surface of the mouse Abs. There is an
80% increase in histidine usage on both the surface and Ab-Ag of human Abs during maturation. For the mouse, the increase on the Ab-Ag interface is less pronounced at 36%. This increase in the number of histidines is the largest of the core residue position changes (Figs. 7C and 8C), where comparatively little composition change is seen.
It is reasonable to ask whether the residue types in which the hotspots occur (RGYW or WRCY) bias the residue composition changes. According to Fig. 4, the highest probability location for a mutation is inside a hotspot. Consequently, one might expect a tendency for residue types that are consistently mutated to be more frequently present in hotspots. Fig. 9 shows the fraction of each germline residue type in each of the RGYW or WRCY motifs. More than 50% of the tyrosine residues in the germline are located in WRCY motifs. Generally, the residue types that are lost on the Ab-Ag interface (glutamine, serine, tyrosine, and tryptophan) all have some of the highest propensities to be coded in hotspots.
|
D and K
R, which are charge conserving. Others, such as N
D and Q
E, conserve size but change charge or change predominantly the hydrophobicity, such as S
T and F
L.
|
| Discussion |
|---|
|
|
|---|
One of the most striking results shown is the overall tendency for germline tyrosine, serine, and tryptophan residues to mutate to other residue types. Germline genes are biased to place these versatile residue types on the Ab-Ag interface. We speculate that the tyrosines are there to promote low-affinity binding of the naive germline Abs to new Ags. Tyrosine can provide hydrogen bonding and substantial hydrophobic interactions. Tryptophan is the most hydrophobic of residues and arguably the best for nonspecific interactions. Serine can hydrogen bond while minimizing potential steric conflicts. Both tyrosine and serine have a polar character and are likely to help maintain solution stability. In support of these ideas, Fellouse et al. (28, 29) have used phage display techniques to show that tyrosine and a combination of small flexible residues are sufficient to recognize a number of Ags, including human and mouse vascular endothelial growth factor molecules.
Not only does nature bias the interface with promiscuous binding residue types, but it provides a maturation strategy capable of making efficient and productive changes to them. Tyrosine, serine, and glycine residues are well represented in germline mutation hotspots (Fig. 9), increasing their tendency to mutate. The presence of serines in hotspots has been previously noted and suggested to be a strategy for improving the targeting of CDRs (30). If one examines into which residue types the tyrosines mutate, one sees that they most often become the small residues glycine and serine. Aspartate and asparagine, two short-chain charged and uncharged residues, are also favored, perhaps to provide ion-pair and H-bond interactions. Phenylalanine may be favored because of its similarity to tyrosine and greater hydrophobicity. Tryptophan often becomes arginine, glycine, serine, and leucine. It may be converted to eliminate steric conflicts and minimize nonspecific interactions that could lead to aggregation.
The low usage of histidine and methionine at the Ab-Ag interface (Fig. 6) is puzzling. Next to proline and cysteine, which often play specific structural and functional roles, histidine and methionine are least used on the interface. The work of Lo Conte et al. (26) shows that these residues are used less on Ab-Ag interfaces than on other protein-protein interfaces. One possible explanation for the low use of histidine and methionine is the hypothesis that they have only recently been introduced into the amino acid repertoire on an evolutionary time scale (31). These amino acids may be underused relative to their potential. If under-utilization in the germline can explain their low usage, then one would expect their usage to increase in the mature Abs. Figs. 7 and 8 shows that this is consistently the case for histidine in most regions, but only at the Ab-Ag interface for methionine. A more structurally oriented hypothesis can also be used to explain low methionine usage on the interface. Methionine has a long side chain and is likely to lose comparatively more entropy on binding relative to its gain in enthalpy, resulting in net lower binding affinity. Relative to methionine, histidine usage may be favored at interfaces, because it can provide both hydrogen bonding and hydrophobic interactions.
One may also speculate on the usage of certain amino acids to modify loop characteristics with benefits for Ab stability and affinity. Proline residues are frequently created on the Ab-Ag interface (Figs. 7A and 8A), particularly in the H3 loop where they could stabilize beneficial loop conformations. Loop preorganization should reduce the entropy cost of binding and increase affinity (see Ref. 32 for a possible example). Similarly, Ab maturation sometimes leads to the introduction of disulfide bonds, which may stabilize the Ig fold or result in subtle conformational changes that lead to higher affinity. An example may be the anti-gp120 peptide complex structure (1ACY), where an additional disulfide is formed between CDR H1 and H2, which actually contacts the Ag (33). It is also notable that many of the excess tyrosines on the Ab-Ag interface of the germline chains are converted to glycines (see Fig. 10). In this case, the creation of glycines may allow for the backbone flexibility necessary for the remaining tyrosines or other nearby Ag-directed side chains to better contact the Ag.
Ab-Ag interaction systems are ideal for examining factors and strategies for improving binding affinity at protein-protein interfaces. The trends uncovered and discussed in this work illuminate the strategies that nature uses to bias immature Ab properties and subsequently refine them during the affinity maturation process. Residue type changes are clearly biased (Fig. 1A), by which codons are accessible given a single base change. Serine, tyrosine, and tryptophan are over-represented in the germline at common Ab-Ag positions and are some of the most frequently coded in the known mutation hotspot motifs. Remaining influences on the residue type changes are presumed to result from functional selection during the affinity maturation process. These functional selection aspects make the somatic hypermutation process a good model for evolution on an accelerated time scale.
| Acknowledgments |
|---|
| Disclosures |
|---|
|
|
|---|
| Footnotes |
|---|
1 Address correspondence and reprint requests to Dr. Louis A. Clark and Dr. Herman W. T. van Vlijmen, Biogen Idec Inc., 14 Cambridge Center, Cambridge, MA 02142. E-mail addresses: louie{at}alumni.northwestern.edu and hvvlijme{at}tibbe.jnj.com ![]()
Received for publication November 4, 2005. Accepted for publication April 19, 2006.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Bostrom, S.-F. Yu, D. Kan, B. A. Appleton, C. V. Lee, K. Billeci, W. Man, F. Peale, S. Ross, C. Wiesmann, et al. Variants of the Antibody Herceptin That Interact with HER2 and VEGF at the Antigen Binding Site Science, March 20, 2009; 323(5921): 1610 - 1614. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Murray, N. Darzentas, A. Hadzidimitriou, G. Tobin, M. Boudjogra, C. Scielzo, N. Laoutaris, K. Karlsson, F. Baran-Marzsak, A. Tsaftaris, et al. Stereotyped patterns of somatic hypermutation in subsets of patients with chronic lymphocytic leukemia: implications for the role of antigen selection in leukemogenesis Blood, February 1, 2008; 111(3): 1524 - 1533. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |