|
|
||||||||


* Repertoire Analysis Group, Autoimmunity Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases,
Division of Computational Bioscience, Center for Information Technology, and
Biodata Mining and Discovery Section, Office of Science and Technology, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, MD 20892
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Despite its crucial role in determining the nature and specificity of the Ag binding capability of the Ab molecule, the human CDR3H has not yet been fully characterized. One reason for this lack of information about the CDR3H relates to the difficulty in analyzing the sequences of this highly diverse region. Specific problems in identifying the D segment used in the components of the CDR3H relate to the overall similarity of the germline D segment sequences and the extensive exonuclease and TdT modification of the D segment. This has made precise identification of the components of the CDR3H difficult. Adding to the difficulty of analyzing the composition of the CDR3H have been the underlying assumptions of the analytical instruments used. Available software tools for Ig gene analysis such as, DNAPLOT (Centre for Protein Engineering, http://www.mrc-cpe.cam.ac.uk) and the Immunogenetic database (IMGT;6), which assess the germline genes that most closely match the given gene sequence, use an alignment scoring system (7). This method is usually straightforward in the VH and JH regions, where there are large regions of sequence similarity. However, in the shorter D region, where mutations and terminal processing is common, this method is less successful. To assess the D segment more accurately, we have used a more intuitive scoring system to match D segments based upon consecutive nucleotide matches. This consecutive match scoring approach assigns a higher score for longer matches, and searches for a D segment core around which mutation or terminal processing occurs. This approach, along with an automated analytical instrument, JOINSOLVER, has made it possible to analyze a large number of human CDR3H sequences and to begin to understand the influences that shape this important Ag-binding region of Ab molecules.
It is notable that previous attempts to describe the CDR3H have often yielded conflicting results. For example, some reports claim the existence of D segment fusion in both human (8, 9, 10, 11) and murine (12, 13, 14) VHDJH sequences, the usage of DIR segments (8, 10, 15, 16, 17), and inverted D segments (10, 18, 19), whereas other analytic approaches have come to the conclusion that these are rare events in human sequences (20) and that D-D fusion is rare in mouse sequences (21). This is particularly important because it has been claimed in the mouse that the use of inverted D segments and DD fusions predispose to autoantibody formation (9, 13). Indeed, even the definition of the D segment is controversial, with some analyses using shorter consecutive nucleotide matches (8, 22, 23, 24, 25, 26), or allowing one mismatch (8, 25, 26), and others defining the D segment only when a match of 10 consecutive nucleotides is present (20).
In an effort to resolve some of these issues, we developed a new computer algorithm, JOINSOLVER, and used it to analyze a set of 650 VHDJH sequences amplified from normal adults. Monte Carlo simulations were used to establish the required length of a D segment match to establish identity. The use of these approaches has permitted a better understanding of the variability inherent in the human CDR3H.
| Materials and Methods |
|---|
|
|
|---|
The following Ig H chain sequences were analyzed: 1) a set of 400 sequences from genomic DNA obtained by single cell sorting of normal human adult peripheral CD5+/IgM+ and CD5/IgM+ B cells (GenBank accession numbers Z80363-770); 2) a set of 116 sequences from cDNA obtained by single cell sorting of human tonsilar IgM+ and IgA+ CD3/CD19/CD38+++ plasma cells (GenBank accession numbers AY003749-869); 3) a set of 135 sequences from cloned cDNA of human adult peripheral IgM+ and IgG+ B cells (GenBank accession numbers Z68345-487).
Algorithm used by JOINSOLVER for sequence analysis
A software tool, JOINSOLVER, was developed specifically to analyze the CDR3H region of the Ig genes expressed by human B cells and is available at http://joinsolver.niams.nih.gov. The strategy of JOINSOLVER is to search for D germline sequences flanking VH and JH germline genes. Additionally, it searches for P and N nucleotide additions in the VHD and DHJ junctions. The database of human D germline genes used includes all D segments from the IMGT databank (6) as well as the reverse and DIR germline genes.
JOINSOLVER initially interrogates the sequence to find the beginning of the CDR3H region that is defined as codon 93. This codon was used to define the beginning of the CDR3 based on the results of structural analyses of VHDJH rearrangements (27, 28) as recommended (2, 3, 4). To identify this codon, JOINSOLVER searches for the sequence, "TAT TAC TGT", which comprises codons 9092 of the VH region (after Kabat et al. (1)) and is a conserved motif in most of the human VH germline genes. If a "TAT TAC TGT" motif is not found, the search is reinitiated with 1-bp change allowed in the sequence. If a "TAT TAC TGT" with one nucleotide change is not found, then homologies with the germline genes are used to find the most likely start of the CDR3H region. If the start of the CDR3H region is not yet identified, JOINSOLVER marks the CDR3H as not found and defers finding the CDR3H region until after V and J matching.
After the VH end of the CDR3H is defined, JOINSOLVER screens for the JH border of the CDR3H. A "C TGG GG" motif demarks the 3' end of the CDR3H region and is conserved in all JH sequences. A similar algorithm is used to find the "C TGG GG" at the 3' end of the CDR3H.
Once the CDR3H region is identified, VH, JH, and D assignment is conducted. The V region is matched to a database of germline genes from the "TAT TAC TGT" back 3'
5' toward the beginning of the sequence, and forward in the 5'
3' direction to the end of the germline gene. The JH region is matched from the "C TGG GG" back to the beginning of the germline gene and forward until the end of the sequence or the end of the germline gene is identified. The VH and JH regions are scored with an alignment score that assigns a +5 to a nucleotide match and 4 for a mismatch between the unknown sequence and the germline (7).
The end of the VH region is identified when the given unknown sequence matches the complete VH germline gene or has a mismatch after the "TAT TAC TGT" with the highest scoring VH germline. The beginning of the JH region is defined when the unknown sequence has one mismatch before the "C TGG GG" with the highest scoring JH region or the sequence matches the complete JH germline gene.
In the event that the CDR3H was initially not found, JOINSOLVER looks for matches between the V and J germline databases and the unknown sequence. The unknown sequence is aligned to the highest scoring germline genes. The CDR3H region is defined as the region from codon 93 and the "C" of the "C TGG GG" motif. The VH end and JH start are defined the same way as if the CDR3H region had been found first.
After VH and JH assignment, D segment assignment is conducted using a consecutive match scoring system. All matches to the D germline genes are scored and sorted based on the VH-JH distance (the distance in nucleotides between the end of the VH segment and the beginning of the JH segment). The longest matches are aligned and returned to the user.
Monte Carlo simulation for D segment assignment
A Monte Carlo simulation was used to determine the probability of matching a randomly generated sequence of length, ml, to the database of known human D germline genes. To accomplish this, a randomly generated set of 1 x 105 sequences of a particular length, gl, was analyzed and searched for matches between the sequences and the D segment germline database. The value of gl is equivalent to the VH-JH distance. The error in this simulation is
1 over the square root of 100,000 or 0.00316.
Monte Carlo simulation for multiple D segment fusion assignment
A second Monte Carlo simulation was conducted using 1 x 106 randomly generated sequences for different values of gl, ml1, and ml2, where ml1 and ml2 are the lengths of the first and second D matches, respectively. Because it is less likely that a longer match is random, the longer match was assigned as the first match. One million random sequences were analyzed to reduce the error to a maximum of ±0.001.
Sequence analysis
Rearrangements were considered productive if the VHDJH junction maintained the reading frame (RF) into the JH segment and contained no stop codons in the germline D segment or CDR3H junctions. When the rearrangements failed to maintain the RF into the JH segment, or introduced stop codons during the rearrangement process, they were considered nonproductive. Junctional nucleotide additions between the VH and D or between D and JH segments were scored as: 1) P nucleotides, if they were inverted repeats at germline encoded ends; 2) N nucleotides, if they were nontemplated junctional additions. The junctions without N nucleotides which contained nucleotides that could not be unequivocally assigned to either coding end, were considered to be microhomologies. In cases where the nucleotide sequence between the VH and JH coding ends had the same number of matches with a DIR family member (17) or a D segment encoded on chromosome 15 and a conventional D segment, the latter was accepted as the D element used. Rearrangements using DD fusions, inverted, or DIR segments were excluded from the D segment RF analysis.
Basic statistics
To determine significant differences in distributions in productive and nonproductive rearrangements, the
2 test was used. Values of p
0.05 were assumed to be significant. The statistical significance between observed and expected frequencies in D genes and D RFs was calculated using the
2 goodness-of-fit test. The Student t test was used to analyze CDR3H length; VH-JH distance; D segment match length; P, and N nucleotides; VH, D, and JH excision.
AR composition analysis of CRD3H and D-segment sequences
For this analysis, the first two amino acid residues (AR, codons 9394 according to numbering by Kabat et al. (1)) and the last two (DX, codons 101102) of the predicted CDR3H segment were not included to assess the nonrandom characteristics of the CDR3H. The total numbers of productive CDR3H, nonproductive CDR3H, productive D segments, and nonproductive D segments included in the analysis are 563, 75, 390, and 60, respectively. To compare these amino acid sequences to that expected from random chance, random sequences were generated using the Genetics Computer Group SAMPLE program (Wisconsin Package version 10.2; Accelrys, San Diego, CA). Using Swiss-Prot release 38.0 and a sampling rate of 18, three sets of 523 randomly sampled human sequences were obtained consisting of a productive CDR3H length of 12, nonproductive CDR3H length of 14, and a length of 5 for both productive and nonproductive D segments. Residue compositions were calculated by using the COMPOSITION program (Genetics Computer Group, Wisconsin Package version 10.2; Accelrys).
2 analysis was performed with the statistics program R. Distribution change of a residue was determined to be significant if its contribution was >5% of the total
2 sum. Where necessary, residues K, R, H or N, Q, S were combined to perform a
2 test properly.
| Results |
|---|
|
|
|---|
Based on the Monte Carlo simulation, the minimal D segment match length required for identification was found to depend on the VH-JH distance (gl). Eight to 11 consecutively matching base pairs were necessary to identify a D segment with sufficiently high probability that it is unlikely to be from random chance (Table I).
|
JOINSOLVER used a consecutive matching algorithm rather than the typical alignment scoring system to identify D segments. To compare the results obtained from these approaches, JOINSOLVER and DNAPLOT (Centre for Protein Engineering, http://www.mrc-cpe.cam.ac.uk) were used to analyze a specific sequence, Z80389. Only the sequence flanking the CDR3H is shown (Fig. 1). The DNAPLOT method (Fig. 1B) selected the germline gene D6-25 on the basis of having a good overall match with a basic alignment search tool (BLAST) alignment score of 63. Nucleotides in the unknown sequence were identical to 15 of 18 nucleotides in the D6-25 germline sequence. However, consecutively matching nucleotides were interrupted twice by a single mismatch and the longest consecutive match consisted of only 7 nucleotides. JOINSOLVER identified a better match by applying the consecutively matching algorithm and limiting the search to the appropriate region by excluding any putative D segment alignment in regions previously identified as VH and JH segments. JOINSOLVER selected germline gene D2-2 as the best match, with 13 consecutively matching nucleotides within the 51 nucleotide VH-JH region.
|
To compare D segment alignment results from DNAPLOT and JOINSOLVER, we analyzed the D segments in 144 randomly selected unmutated and mutated VHDJH rearrangements (Z80363-Z80511). Within this subset, 74% of the sequences had zero to two mutations (98100% VH germline homology) and 9% had more than two mutations (<97% VH germline homology). JOINSOLVER and DNAPLOT gave comparable D segment gene matches in 50% of the rearrangements (Table II). However, JOINSOLVER performed better than DNAPLOT for 22% of the rearrangements, either by finding a D segment with a longer consecutive nucleotide match than that found by DNAPLOT or because DNAPLOT failed to find any match for the D segment. Occasionally (5.5% of the rearrangements), the highest scoring D gene identified by JOINSOLVER was also found by DNAPLOT, but it appeared with the second highest overall matching score by DNAPLOT and, therefore, was not considered to be the best match. Overall, JOINSOLVER identified the D segment in 74% of the rearrangements in this subanalysis. Furthermore, JOINSOLVER generated some (albeit not significant) D segment matching for every sequence, whereas, DNAPLOT failed to give any D segment alignment in 25% of the rearrangements. Importantly, in 47% of the rearrangements for which DNAPLOT failed to find a D segment match, JOINSOLVER was able to align the D segment.
|
In some sequences, more than one possible nonoverlapping D match was found in the CDR3H region. To determine whether these can be explained by random chance, a second Monte Carlo simulation was performed to examine the conditional probability of having a second match of length ml2 given that a first match of length ml1 exists. Identifying a second D match with high probability depends on both the VH-JH length (gl) and the length of the first match (ml1). When the VH-JH distance is greater than 26 nucleotides, a match of 9 nucleotides or more is necessary to identify a second D segment with confidence if the first match is 9 or more nucleotides. When the VH-JH distance is 26 nucleotides or less, matches of 8 nucleotides can be used to identify a second D match with confidence when the first match is 9 or more nucleotides. Finally, when the VH-JH distance is 18 nucleotides or less, matches of 7 nucleotides can be used to identify a second D match, when the first match is 9 or more nucleotides. When the VH-JH distance is 17 nucleotides or less, a first match of 8 nucleotides and a second match of 7 nucleotides is sufficient to identify two D segments.
CDR3H length, VH-JH distance, and D segment match tend to be longer in nonproductive rearrangements
Initially, the JOINSOLVER program was used to analyze the lengths of the CDR3H, the VH-JH distance, and the length of the retained D segment in nonproductive and productive repertoires. As shown in Table III, the mean CDR3H length of the nonproductive rearrangements from all B cell subsets was 53.8 ± 1.9 bp, whereas the CDR3H of productive rearrangements was significantly shorter, a mean of 46.7 ± 0.5 bp, (p < 0.01). As with the CDR3H length, the nonproductive rearrangements have significantly (p < 0.01) longer VH-JH distances (36.2 ± 1.6 bp) than the productive ones (28.0 ± 0.4 bp). The same trend was observed when calculating the mean match length of the assigned D segments for all B cell groups (Table IV); the nonproductive repertoire had significantly (p < 0.01) longer consecutive matches (17.6 ± 0.7 bp) than the productive rearrangements (14.6 ± 0.2 bp).
|
|
We were able to identify 71.4% of the D segments in the 77 nonproductive rearrangements and 64.4% of the D segments in the 574 productive rearrangements (Table IV). For the remaining rearrangements, no D segments were identified because the consecutive D match length was either too short (10.5% of the nonproductive; 16.9% of the productive) or had frequent point mutations (8.3% of the nonproductive; 13.6% of the productive). In addition to the 27 D segments located in the H chain locus on chromosome 14, there are 10 D segments located on chromosome 15 (29, 30, 31, 32). When the rearrangements without a D segment match were analyzed, 3 were found to have a significant match with chromosome 15 D segments.
D segment usage
As shown in Fig. 2, the use of D segments is not random. In the nonproductive repertoire, 8 of 25 genes were used significantly more than expected from random chance. Moreover, a number of D segments were not detected in the nonproductive repertoire. Three of these (D1-14, D6-25, and D4-4) have mutations in the heptamer sequences that would be expected to limit recombination (21). In the productive repertoire, eight D segments were used more than expected from random chance. Notably, only two D segments were missing from the productive repertoire, presumably because they cannot undergo recombination effectively. Whether D4-4 is present in the productive repertoire cannot be determined because its sequence is identical to that of D4-11 (21). When the distribution of D segments in the nonproductive and productive repertoires was compared, evidence of both positive and negative selection was found. The use of two D segments was significantly greater in the productive repertoire, whereas the frequency of three D segments was significantly less, consistent with positive and negative selection of these gene segments, respectively.
|
DIR family members could be assigned in none of the nonproductive rearrangements and 1.1% of the productive rearrangements (Fig. 2). Notably, the frequency of usage of the DIR family members was significantly lower (p < 0.05) in both nonproductive and productive rearrangements than that expected from random chance. The use of inverted D segments is also absent in the nonproductive rearrangements (Fig. 2). However, inverted D segments were found significantly (p < 0.01) more often in the productive rearrangements (3.8%), suggesting they were positively selected.
Fusion of multiple D segments is negatively selected
Fig. 3 depicts the rearrangements with multiple D segments. This was an infrequent occurrence, but all the nonproductive rearrangements (n = 3) with multiple D segments were organized as VHD5'D3'JH, which reflects the normal sequence of recombination events. Of the productive rearrangements (n = 5) with putative multiple D segments, three were organized as VHD5'D3'JH. One of these (Z80631) had two identified D segments, which were organized VHD3'D5'JH. The second had three identified D segments of which the middle one originated from upstream of the 5' D segment but also 5' of the downstream D segment. The appearance of rearrangements with multiple D segment fusions was significantly less in the productive repertoire (1% of total rearrangements) compared with that in the nonproductive repertoire (3.9% of the total nonproductive rearrangements) (p < 0.01), suggesting that these rearrangements were negatively selected.
|
Different levels of stringency have been applied in the identification of D segments (8, 21, 25, 26). More strict criteria may increase specificity, but at the expense of sensitivity, whereas less strict criteria may do the reverse. We were interested in determining whether the criteria used in the current analysis inappropriately included or excluded D segments. To assess this, the distribution of D segment alignments that were found with different levels of stringency (0.05 vs 0.01) was analyzed in the smaller set of 144 random sequences mentioned above. As shown in Fig. 4, only a few (n = 8, 5.6%) identified D segments were lost when the stringency was increased from 0.05 to 0.01. Importantly, the overall distribution of identified D segments was very similar using either cut-off. Notably, using either criterion, D7-27, the shortest germline D segment, was absent from the rearrangements assessed in this subset. These data suggest that a stringency of 0.05 is sufficient to balance the needs for sensitivity and specificity in this biologic analysis.
|
Distribution of D segment RFs
D segment RFs have been categorized as those containing stop codons (largely RF1), those tending to encode hydrophilic amino acids (largely RF2), and those tending to encode hydrophobic amino acids (largely RF3). Indeed, 11 of the 25 D segments have a stop codon in RF1, whereas 7 of 25 have stop codons in RF2 and only 5 of 25 have stop codons in RF3. In the nonproductive repertoire, each of the D segment RFs was used at comparable frequencies (Table V). Notably, with few exceptions, D segment RFs with stop codons were used at the frequency expected in the nonproductive repertoire (Tables V and VI). In general, the use of RFs with stop codons was excluded from the productive repertoire except when the stop codon could be removed by exonucleolytic processing. When the use of RFs in the productive repertoire was analyzed, evidence for positive selection of a number of specific D gene segment RFs was noted. Thus, overall, RF2 was positively selected, whereas RF3 was not (Table VI). Specifically, RF2 was preferred in the productive repertoire by rearrangements using D2-2, D2-8, D2-15, D3-10, D3-16, and D3-22. In contrast, RF3 was preferred by rearrangements using D1-20, D1-26, D5-12 and D5-24, although the entire RF was not positively selected. Notably, no selection of rearrangements using RF1 was detected even though some D segments (D6-6, D6-13, D6-19) encoded hydrophilic amino acids in this RF. Despite this, analysis of the preferred RFs indicated that they were frequently more enriched in hydrophilic amino acids or glycine (D2-2, D2-15, D3-10, and D3-22), although this was not a uniform finding as noted above. Notably, RF1 was not positively selected even when the rearrangements lacking stop codons or enriched for hydrophilic amino acids were analyzed separately.
|
|
We next analyzed the amino acid composition of the CDR3H more completely to determine whether it differs from that expected from random chance. When the amino acid distribution in the D segments in the nonproductive rearrangements was analyzed, only Y, W, and V, were used significantly (p < 0.05) more than random, and K, H, E, and R were used significantly (p < 0.05) less than random (Fig. 5). When comparing the amino acid composition of the D segments in the productive rearrangements with the random sequences, the only ARs used significantly (p < 0.05) more than expected from random chance were Y, W, G, and S, whereas P, L, K, Q, H, E, and R were used significantly (p < 0.05) less than random. Notably, S was the only residue that was both positively selected and used more than expected from random chance. Within the D segment, N was positively selected but not used more often than expected from random chance. All other residues that were used more frequently than expected from random chance were not positively selected. In general, the amino acid composition of the nonproductive rearrangements was closer to random than the productive ones (
2 of 157 vs 885, respectively) (Fig. 5).
|
2 of 185 vs 1354, respectively, Fig. 5). Preferential rearrangement of 5' D segments with 3' JH segments
The analysis of nonproductive rearrangements is a valuable mechanism to study the molecular events before Ag selection (34). Thus, we analyzed the combinatorial preferences of the nonproductive rearrangements to determine whether there was a bias for particular VHDJH-rearrangements. From this analysis, there was no bias for a particular VHD pairing, because 5' or 3' D segments were indiscriminately paired with 5' or 3' VH segments (data not shown). Conversely, a significant bias (p < 0.01,
2 test) could be observed in the pairing of D and JH segments, with 5' (JH distal) D segments coupled preferentially to 3' JH segments (Fig. 6). This bias was not found in productive rearrangements (data not shown).
|
The number of N nucleotides inserted between the VH and the D coding ends was similar to that inserted between the D and JH in both productive and nonproductive rearrangements (Table VII). Nevertheless, there were significantly more N nucleotide additions in the VHD and DJH junctions in the nonproductive than in the productive rearrangements. Although the presence of N nucleotides was the most common situation, there were some sequences that lacked N nucleotide additions at either the VHD (nonproductive 3.6%, productive 3.0%) or DJH junction (nonproductive 7.2%, productive 7.3%), even though there were no significant differences between nonproductive and productive rearrangements. Rearrangements lacked TdT activity significantly more often in the DJH junction than in the VHD junction (p < 0.05).
|
Exonuclease activity
The VH coding end had significantly (p < 0.01) less exonucleolytic excision when compared with the D and JH coding ends, both in the nonproductive and productive repertoires (Table VIII). D segment excision was similar in the nonproductive or productive repertoires, with more excision at the 3' end. The JH coding end was excised to a significantly (p < 0.01) greater degree in both nonproductive and productive rearrangements than the VH and D5' coding ends. P nucleotides were significantly (p < 0.01) more abundant in the VHD junctions than in the DJH junction in both nonproductive and productive rearrangements (Table IX). Palindromic (Pr) nucleotides in processed coding ends that could have developed from an overhanging hairpin intermediate structure (35) were not found at a greater frequency than expected from random chance (data not shown). As a result of the various modifications, the mean lengths of the VHD and DJH junctions were 10.2 ± 1.0 bp and 9.2 ± 0.9 bp in nonproductive rearrangements and 7.7 ± 0.3 bp and 7.1 ± 0.3 bp in productive rearrangements, respectively.
|
|
| Discussion |
|---|
|
|
|---|
The use of D segments was not random. Analysis of the nonproductive repertoire provided information concerning biased use of D segments during VHDJH recombination. Eight D segments (D2-2, D2-15, D3-3, D3-10, D3-22, D4-17, and D6-19) were significantly overrepresented in the nonproductive repertoire, suggesting that they were preferentially used during VHDJH recombination. The reasons for the preferential usage in the nonproductive repertoire are not clear as these segments are both long and short and scattered throughout the locus. Moreover, it is unlikely that the recombination signal sequences (RSS) play a major role as, for example D2-2 (overrepresented) and D2-8 (used at the expected frequency) have identical RSS and are the same length (20), but are used at markedly different frequencies.
Analysis of the productive repertoire indicated that a number of D segments were also overrepresented. Some of these, such as D3-3, D3-10, D3-22, D4-17, D6-13, and D6-19, were not positively selected but rather appeared frequently because of biased use during recombination with no subsequent evidence of negative selection. In contrast, D1-26 was overrepresented in the productive repertoire as a result of positive selection. Another D segment (D1-20) manifested evidence of positive selection, even though it did not appear more frequently than expected in the productive repertoire. Finally, a number of D segments (D2-2, D2-8, D2-21) were clearly negatively selected. The final distribution of D segments in the productive repertoire, therefore, results from biases introduced during VHDJH recombination and subsequently from positive and negative selection. The basis of these molecular and selective events is currently unresolved, but do not appear to relate solely to D segment length, RF bias, or position in the locus.
Germline D segments vary in length from 11 nucleotides (D7-27) to 37 nucleotides (D3-16). Because the length of the CDR3H appears to be restricted in the productive repertoire (46.7 ± 0.5 bp) and regulated by selection, the use of longer D segments may be limited, unless these are exposed to extensive exonuclease cleavage during recombination. Indeed, the finding that the length of the D segment after exonuclease cleavage is only 14.6 ± 0.2 nucleotides in the productive repertoire suggests that the length of the germline D segment plays little role in biasing the repertoire. In this regard, the longest D segment (D3-16) was positively selected along with a number of shorter segments, whereas some, but not all, long segments were negatively selected. Thus, it appears that the original length of the germline D segment does not play a crucial role in the selection of particular VHDJH rearrangements, which is likely related to subsequent exonucleolytic activity that reduces the size of the D segment.
It is notable that the apparent use of DIR family members and/or inverted D segments was identified in this set of rearrangements. Although both events were absent in the nonproductive repertoire implying that they were rarely used in rearrangements, they were more frequent in the productive repertoire suggesting that their use could contribute to diversity. There is controversy concerning the use of DIR segments and inverted D segments, with some studies reporting their use and others not (20, 23, 37). Notably, however, even in the stringent analysis of Corbett et al. (20) a low frequency (0.51% of rearrangements) used these elements. The bulk of the data support the conclusion that DIR family members and inverted D segments are used rarely in human VHDJH rearrangements as could be expected from the molecular constraints imposed on their use in recombination.
The presence of multiple D segments in a single rearrangement, i.e., VHDDJH, has also been a matter of controversy. The presence of such DD recombination violates the "12/23 rule", because it would disregard the strict sequential recognition by the recombination-activating gene (RAG) proteins of a 23-bp spacer associated RSS following a 12-bp spacer associated RSS (38). Whereas some studies provide evidence for the existence of DD fusions both in human (8, 16, 18, 24, 39, 40, 41) and mice (9, 12, 13), other studies conclude that such multiple D recombinations do not occur or are infrequent events (20, 25, 42). A second Monte Carlo simulation was preformed to assess the statistical probability of the existence of such multiple D segment recombinations. This analysis strongly implied that multiple D segment recombinations can occur in the human VHDJH repertoire, but their frequency does not appear to be as high as suggested by some previous reports (8, 10) nor as unlikely as concluded by others (20). Of the eight sequences shown in Fig. 3, four (Z80737, Z80727, Z80488, and Z80573) unequivocally contain two D segments even using the stringent criterion of Corbett (20). Moreover, in three other rearrangements (Z80372, ZZ80697, and Z80631), the likelihood that the second D match occurred by random chance ranged between 0.9 and 2.6%. Therefore, six of the eight sequences with the putative D-D fusions are likely to be authentic (two nonproductive, four productive). These results indicate that rearrangements using two D segments are uncommon but real. Because DD fusions appear more frequently in the nonproductive compared with the productive repertoire, it is likely that such fusions producing longer CDR3H are negatively selected, possibly because they may distort the Ag binding site or encode autoantibodies (9, 13). As a result, the use of multiple D segments is unlikely to play a major role in contributing to diversity in the human VH repertoire. It is notable that the frequent use of D-D fusions in the mouse has also been questioned (21). Moreover, in human B cells identified with an Ab to V-pre-B, the increased use of productive VHDJH rearrangements with D-D fusions that meet the current criteria (8/136 = 5.9%) indicates that this event is uncommon, even in this population (11).
D segment RFs are determined by the combined effect of exonuclease and TdT that remove or add nucleotides at the VH
DJH junction (43). Changes in RF impact the amino acid sequence which changes the hydropathic character of the CDR3H. Evidence for the preferential use of D segment RFs that encode hydrophilic amino acids has previously been presented (21). In the current study, we found that each of the RFs was used comparably in the nonproductive repertoire, implying that there was no combinatorial bias in their usage. However, the distribution of RFs in the productive repertoire was clearly not random, with RF2 overrepresented, RF1 underrepresented, and RF3 appearing at the expected frequency. The underrepresentation of RF1 relates to the more frequent presence of stop codons that preclude the appearance of these RFs in the productive repertoire unless the stop codon is removed by exonuclease activity. RF2 appears to be overrepresented in the productive repertoire because of the frequent presence of hydrophilic amino acids that are positively selected, such as can be found in D2-2, D2-8, D2-15, D3-9, D3-10, D3-16, and D3-22. Using either the method of Black and Mould (44) in which T, S, K, Q, N, H, E, D, and R or the Kyte and Doolittle (45) analysis in which T, S, W, Y, K, Q, N, H, E, D, and R are viewed as hydrophilic, all of these D segments encode amino acids in RF2 that are more hydrophilic than those encoded by the other RFs without stop codons. The data are consistent with the conclusion that there is positive selection of D segment RFs that encode hydrophilic amino acids. It is notable that there was no evidence of positive selection of some D segment RFs (RF1 in D6-6, D6-13, D6-19) encoding hydrophilic amino acids. The explanation of this is not clear, but in two of them (D6-13 and D6-19) the Kyte and Doolittle (45) and Hopp/Woods (46) analyses identified different RFs as the most hydrophilic. Alternatively, the sparcity of glycine residues that may contribute to flexibility of the CDR3H (47) may limit the ability of these D segment RFs from being positively selected. When the RFs are grouped into those actually with stop codons, those actually encoding hydrophilic amino acids and those actually encoding hydrophobic amino acids according to the Kyte and Doolittle algorithm (45), no differences were noted in their usage by nonproductive rearrangements. However, marked enrichments in productive rearrangements using the hydrophilic RF (60.3%) compared with the hydrophobic RF (32.9%) and the RF with stop codons (6.8%) was noted. Again this result is consistent with the conclusion that there is positive selection of D segment RFs encoding hydrophilic amino acids.
Because of the evidence of positive selection of D segment RFs encoding hydrophilic amino acids, the overall hydropathicity of the CDR3H was analyzed. This indicated that of all the hydrophilic amino acids encoded by the D segment, only S and N were positively selected. The D segment contributed an overabundance of hydrophobic amino acids to the CDR3H as detected in the nonproductive repertoire, including W, some of which, such as I and L, were negatively selected. In addition, the D segment contributed an increased number of Y residues to the nonproductive repertoire, whose presence was not subsequently altered by selection. Therefore, despite the positive selection of D segment RFs encoding hydrophilic amino acids, the only hydrophilic amino acid encoded by this region that was both overrepresented and positively selected was S. Notably, this contribution was counteracted by the amino acids encoded by other portions of the CDR3H (VH, JH, and junctional diversity).
As previously reported (20, 42), the mean degree of TdT activity on the VHD and DJH junctions is similar in both productive and nonproductive repertoires. It is notable that there is a small, but significant, difference in the percentage of DJH junctions with no N additions compared with the percentage of VHD junctions with no N additions. The difference may be related to developmentally regulated levels of TdT expression, as revealed by less frequent junctional TdT activity in fetal and neonatal repertoires compared with the adult repertoire (25, 48, 49, 50). Thus, the increased number of N nucleotides in the VHD junction, which is formed after the DJH junction in rearranging B cells, could be related to a higher level of TdT activity. The exonucleolytic activity was greater on the DJH junction than on the VHD join. This was especially notable when the degree of exonuclease processing of the VH and JH segments were compared and could relate to the primary sequence of these regions. The 5' coding ends of the JH segments are slightly more AT-rich, potentially making them preferential substrates for exonucleolytic processing (35, 51). The differences in the processing of VH and JH segments was reflected by the appearance of P nucleotides that were more frequent in the VHD junction than in the DJH joint. Finally, it has been suggested that the presence of Pr nucleotides might reflect a second round of RAG-mediated cleavage and "hairpining" (35). It is notable in the current study that the frequency of Pr nucleotides (35, 52) in any of the coding ends was not significantly different than the likelihood that specific consecutive nucleotides would be found by random chance. Because of this finding, we recalculated the frequency of Pr nucleotides in the original report (35) and found that the frequency of Pr overhangs was below the frequency of consecutive nucleotides occurring by random chance, even for the longer insertions. Therefore, it is unlikely that Pr nucleotides play a role in the generation of junctional diversity. Moreover, it is unlikely that "rehairpining" of the coding ends occurs during VHDJH rearrangement.
The formation of microhomology may influence the development of the CDR3H by constraining nucleolytic processing or preventing access of TdT to the coding ends (35, 53, 54). In the present analysis, a low frequency of microhomologies was observed. Notably, the frequency of microhomologies seems to be less than that found in fetal (49) or neonatal (25) arrangements.
Analysis of the nonproductive rearrangements provides insight into the molecular mechanisms occurring before selection. By analyzing the nonproductive repertoire for VH, D, and JH segments, we were able to detect biases in the association of these genetic elements without the superimposed influence of selection. A significant bias was noted in the tendency for 5' D segments to rearrange with 3' JH segments without relation to the position of the VH gene. The most likely explanation for this finding is that there are multiple DJH arrangements before VH
DJH rearrangement occurs and the rearrangement process ceases. The net result would be the tendency for 5' D segments to be found preferentially rearranged to 3' JH segments because the initial rearrangements would be deleted as the progressive rearrangement process proceeds. The finding that the distribution of VH genes is random suggests that this process occurs before VH
DJH rearrangement occurs. A similar process has been suggested to occur in the mouse (9, 55). Whether this process serves to increase diversity or rather merely reflects the persistent expression of RAG proteins and availability of the H chain locus during B cell development remains to be determined. The finding that the bias is lost in the productive repertoire presumably as a result of selection is more consistent with the latter interpretation.
The development of JOINSOLVER has permitted a detailed analysis of the human adult CDR3H and has facilitated the development of new insights into the molecular and selective mechanisms that underlie the generation of this Ag binding region of human Ig molecules. Importantly, this approach and the database generated should be of great value in determining abnormalities in individuals with immune disorders.
| Footnotes |
|---|
2 Address correspondence and reprint requests to Dr. Peter E. Lipsky, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, 9000 Rockville Pike, Building 10, Room 9N228, Bethesda, MD 20892-1820. E-mail address: LipskyP{at}mail.nih.gov ![]()
3 Abbreviations used in this paper: CDR3H, H chain complementarity determining region 3; RF, reading frame; AR, amino acid residue; BLAST, basic local alignment search tool; Pr, palindromic nucleotide; RSS, recombination signal sequence; RAG, recombination-activating gene. ![]()
Received for publication October 16, 2003. Accepted for publication March 18, 2004.
| References |
|---|
|
|
|---|
chain transcripts expressed by individual tonsil plasma cells. Mol. Immunol. 38:19.[Medline]
gene repertoire of human neonatal B cells. Mol. Immunol. 38:1113.[Medline]
and C
repertoires. J. Immunol. 150:1348.[Abstract]