|
|
||||||||

,
* Department of Biophysics, Kyoto University, Kyoto, Japan;
Theory NEC Laboratory, Real World Computing Partnership, Kawasaki, Japan; and
IBM Thomas J. Watson Research Center, Yorktown, NY 10598
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Computational approaches have been introduced to extract not only position-specific information but also to capture some sequence patterns in peptides (16, 17). To attain satisfactory prediction accuracy, usually a large amount of experimental data are needed. To date, the source of MHC-binding peptides has been databases derived from the literature, in which peptide binding was determined using a variety of methods and criteria (2, 18). The lack of coherency of the reported binding values has limited the use of data to yes/no binary outputs. This, together with the limited number of data, has hampered the full exploitation of the existing databases.
A supervised learning algorithm based on hidden Markov models
(HMM)4 was developed by Mamitsuka and has
been successfully applied to the analysis of the HLA-A*0201-binding
peptides from MHCPEP (18) with a binary output
(16). The effectiveness of this approach prompted us to
extend the method to predict the degree of peptide binding with real
number values. However, training such a learning algorithm in an
ordinary way would require a large number of binding data. For example,
for the 9-mer peptides that are most common among MHC class I ligands
in humans and mice, sequence variation reaches
209 =
5 x 1011.
Given that only 1 or 2 peptides in every 200 random peptides bind with
high affinity to a given MHC class I molecule (7, 8), it
is essential to develop a disciplined way of
selecting a small number of informative peptides for experimentation,
from which the specificity of a given MHC molecule can be extracted
effectively. In the present study, we propose a novel method for
dynamic experimental design based on a query learning technique to
address this issue. The proposed method works by iterating a feedback
loop consisting of computational analysis and experimental
measurement.
| Materials and Methods |
|---|
|
|
|---|
The general scheme of the supervised learning algorithm for HMM has been described elsewhere (19). The algorithm is devised to make HMM, which is normally used to analyze sequence patterns, applicable for sequence data that are labeled with another variable. A peptide sequence with its log Kd value is a typical example for such an application. The purpose of the algorithm is to relate given sequences to the values of the corresponding label (log Kd).
Let H be an HMM and O be a training peptide. As shown in Fig. 1
a, an HMM consists of a number of states, which are
represented by circles. Each state bears the probability for every
amino acid that varies from 0 to 1, and their sum over the 20 aa is 1
for each state. When a peptide sequence is given, say KLFGINMPL, the
probabilities for K in the 1st state, L at the 2nd state, ... and L
in the last state are multiplied. For the states in the six independent
paths, this product is further multiplied by the probability that each
path is taken and the sum total for all the six paths is obtained. The
logarithm of this final probability value is defined as
LO|H for a peptide O in a given H.
The LO|H value depends on the size
of a given H and O. Generally, the longer O is, the smaller is the
LO|H value. The size of O is fixed
in the present problem as 9-mer. Therefore, to relate
LO|H to the log
Kd value, a constant, C, is
defined as follows: C =
LO|H' -
LKa,
where LKa is the average
log Ka of all the input peptides and
H' is the reference HMM having uniform probability distribution. The
limit of the binding measurement was set for -3 in log
Kd. This was mainly due to solubility
of most peptides. For training purposes, peptides that exhibit no
binding at this concentration were all labeled as -3.
|
, ) in a
given H. The L*O|H is related to log
Kd as follows:
L*O|H = C +
LOKa,
where
LOKa is
the log Ka value of the peptide
O.
A function gO|H is defined as
follows:
![]() |
Function gO|H is introduced to find an optimal set of probability parameters for the states that gives the least discrepancy between LO|H and L*O|H. Note that when DO|H is close to zero, gO|H approaches 1. Thus, the goal of the algorithm is to seek the parameter set that gives the maximum gO|H.
Finally, probability parameters of H are trained so as to minimize the
energy function, E, which is based on a gradient descent
algorithm, for all the input peptides.
![]() |
2. Prediction of log Ka. When the HMM is optimally trained and settled with a set of probability parameters, the training can be stopped by fixing the parameters, and then the HMM can be used to predict the log Kd for unknown peptides. The log Ka of a peptide O is estimated by LO|H - C.
Peptides
Peptides were synthesized by fluorenylmethoxycarbonyl chemistry and purified by HPLC to the purity of > 95%. Peptides were analyzed using a MALDI-TOF mass spectrometer (Voyager DE-RP; Applied Biosystems, Foster City, CA). The concentration of peptide was determined by the MicroBCA assay (Pierce, Rockford, IL) using BSA as standard. For rare peptides that react poorly with the bicinchoninic acid reagent, the quantitative ninhydrin reaction was used instead.
Peptide binding
Peptide binding was measured by a stabilization assay using TAP-deficient RMAS cells (8). Briefly, 1 x 105 RMAS cells incubated overnight at 26°C were mixed with graded concentrations of peptide in 0.25% BSA-containing DMEM, incubated for 30 min at room temperature, and then exposed to 37°C for 70 min. The remaining Db molecules on the cell surface were stained with FITC-labeled anti-Db mAb (B22.249) and analyzed by flow cytometry using a FACScan (BD Biosciences, Tokyo, Japan). Differences between experiments were normalized by including the Db binding reference peptides in every assay (14).
| Results |
|---|
|
|
|---|
We use a query learning algorithm called query by bagging (Qbag),
the performance of which has been demonstrated to be superior in
several applications (20, 21). Qbag combines the advantage
of the query by committee method, which helps select the most
informative samples, and the averaging by bagging (Bootstrapping by
Aggregation) method, which helps reduce the part of the prediction
error that is attributable to variability in the sample. Qbag can be
thought of as a general scheme for obtaining a query learning method,
using an arbitrary supervised learning algorithm as a component. Here,
a supervised learning algorithm is an algorithm that estimates a
function F: X
Y, given labeled examples, or
pairs of the form (x, F(x)). A query learning
algorithm is one that estimates a function F, based on
labeled examples (x, F(x)) for x
values of its choice. Qbag enjoys the following advantages: 1) the
number of data (queries) required to attain a given predictive
performance has been shown to be minimal; 2) it can deal with data
having real number values (labels); 3) it places no restriction on the
type of learning algorithm as its component.
We chose as the component algorithm a supervised learning algorithm for HMMs. The HMM has been applied to multiple alignment, because it can capture a sequence pattern common to a given set of multiple sequences even if the pattern appears at different places in the sequences (22, 23). Other applications of HMMs include the prediction of protein structures (24) and open reading frame finding in the genome (25). HMM is chosen for the following reasons. 1) It can help visualize amino acid preferences and sequence patterns hidden in a given set of training peptides. This is not normally possible with an artificial neural network (ANN), for example. 2) The learning and predictive abilities of HMM are at least comparable with those of other learning models such as ANN, if not better. Because general unsupervised learning algorithms for HMMs can deal only with unlabeled data, we used the supervised learning algorithm developed by Mamitsuka (16) for the analysis of peptide sequences labeled with their binding values.
The HMM used in this study was designed as shown in Fig. 1
a. The HMM consists of
states, represented by open circles that correspond roughly to amino
acid positions in this case. Each state bears symbol generation
probabilities for all the amino acids, specifying the probability that
each amino acid assumes that position. A parallel circular model was
designed that shares the variable state at the top. This design was
aimed at accommodating lateral shifts of subsequences by 1 aa. The
variable state can be occupied by the first or last amino acid. The
core states of 8 aa are connected by one-directional transitions. The
first and eighth states have alternative state transitions to and from
the variable state at the top. Six independent paths were designed in
parallel without interconnection between them. This simple model
exploits the fact that, for MHC class I-binding peptides, orientation
is fixed, and size variation is limited. When an HMM is trained with a
set of peptide-binding data, symbol generation probabilities and state
transition probabilities are optimized so that the log likelihood value
(predictive score) of a given peptide approximates its log
Ka as described in Materials and
Methods. We chose a mouse MHC class I molecule,
Db, as a model MHC molecule in this study.
Db predominantly binds 9-mer peptides.
Learning process
The Qbag-designed experiments proceed through the following six
steps (Fig. 2
).
|
Step 3: generation of test peptides. To search the space of size 209 for Db-binding peptides, 100,000 nonamer peptides were randomly generated for each cycle of training.
Steps 4 and 5: judgment by the committee of HMMs.
The 50 independently trained HMMs from step 2 were each given the task
of predicting the binding ability of 100,000 random peptides. A peptide
for which the predictions made by the committee HMMs were most spread,
i.e., bore the largest variance, was selected. By excavating the most
unpredictable peptides, the most wanted information can be sought
efficiently. In effect, this method preferentially explores the
unexamined parts of the search space. By repeating 50 times steps 35,
50 such peptides were selected. Through these steps,
5 million
peptides (1 of 100,000 of the whole search space) were, in effect,
screened.
Step 6: measurement of Db binding and feedback to the database. From 24 to 36 peptides were arbitrarily chosen from the 50 selected peptides, synthesized in dozens, and subjected to measurement of Db binding. Peptide binding data are submitted as the supplemental material. The number of peptides examined in each training cycle varied for practical reasons, such as the inability to synthesize some peptides or the time allowed between experiments. A subset of peptides was chosen of 50 to omit peptides that would be hard to synthesize or would not be likely to be soluble during the cell binding assay, e.g., FWLLLLLLL. The proportion of such peptides was 10% or less. Thus, selecting from the 40 candidate peptides should have been sufficient. One or two peptides of 2436, on average, could not be synthesized. Finally, the results of peptide binding were added to the data pool used in step 1.
Our past research demonstrated that the Qbag procedure, basically as used in the present study, substantially improved the prediction accuracy in various application domains (20, 21). This performance improvement is a combined effect of the query by committee aspect, which helps select the most informative samples, and the averaging by bagging aspect, which helps reduce the component of prediction error attributable to variance of a particular data set.
Also, the choice of the various parameters in the Qbag procedure
(M, T, T', R, I, D) is mostly a matter of
pragmatic decision, usually determined by the amount of computational
and human resources available. For example, the parameters T
and T' in Fig. 2
determine the number of subsamples of the
binding data; thus, the number of HMMs to obtain in each iteration was
determined based on the past experimental evidence, suggesting that the
performance tends to saturate at the order of
100 (26)
and the fact that the computational time required to obtain each
hypothesis is
10 min.
The MHC stabilization assay used for measurement of peptide binding was
conducted as described before (14). Typical binding curves
are shown in Fig. 3
a. To
obtain the log Kd values that
correspond to the peptide concentrations at half-maximal binding, the
binding curves were linearized as described in Materials and
Methods. X intercepts corresponding to the log
Kd values in Fig. 3
b were
deduced by the least squares method. The relative, but not absolute,
binding abilities among peptides were consistent between experiments.
Therefore, three reference peptides of high, medium, and low binding
affinity were always included to normalize the raw binding data for
experimental variation.
|
Sequence patterns seen in the HMMs after training
Unlike ANN, HMM has the advantage that its state transitions
generating sequence patterns can be examined after the training. As two
examples are seen in Fig. 1
, b and c, the
patterns are in fact quite interesting. Db has
the major anchor amino acids of N at position 5 (P5) and M or I at P9
(1). Preferences for these amino acids can more or less be
seen in the major paths (Fig. 1
, b and c). We
tentatively designed the HMM with six parallel paths. In the HMMs
obtained as a result of training, there were typically two or three
major paths that peptides would take frequently (Fig. 1
, b
and c). Thus, restricting the number of paths to six in the
model seemed to achieve computational efficiency with no significant
loss in representational power. The probability parameters in the HMM
itself represent the dynamic specificity of the MHC molecule. It would
be interesting to refer to this information to aid computer modeling in
the future. For instance, better binding variants may be designed by
replacing the MHC-binding residues with a pattern of sequence found in
the major paths. Interestingly, some of the patterns exhibit sequential
properties; e.g., the combination of A or S at P2 and N at P5 is
favored (Fig. 1
, b and c), but Q is rather
favored for P2 only if P3 and P4 are MN (Fig. 1
c). This type
of specificity can be better captured/visualized in HMMs than
positional scanning or ANN. It will be interesting to examine the
structural basis for such sequence-specific patterns by structural
analysis.
We designed the HMM as a cyclic model. This is effective for
accommodating some lateral slide of subsequences. However, considering
the critical role of the C-terminal amino acid (P9) (2)
and the substantial preference of the amino acid at P1
(7), a noncyclic model with entirely independent paths may
have been more appropriate. In retrospect, however, this did not appear
to have greatly affected the outcome, because prominent patterns at the
C terminus often ended at the end of the core parallel paths and had a
limited influence on P1 selection in the variable state at the top
(Fig. 1
, b and c).
Comparison of the previous prediction methods
The 181 peptide binding data obtained during Qbag learning are of
random source, and they exhibit a wide range of binding abilities.
Thus, this data set provides an excellent test bed for comparing the
performance of existing methods of automated prediction. Although it is
not the focus of this study, such a direct comparison will be useful.
Fig. 4
shows the correlation between
predicted and actual binding for the library method (Lib) (Ref.
14 and Fig. 4
a), SYFPEITHI (Ref.
12 and Fig. 4
b) and BIMAS (Ref. 4
and Fig. 4
c). All these methods are accessible online
(http://www.ddbj.nig.ac.jp/analysesp-e.html,
http://www.uni-tuebingen.de/uni/kxi/,
http://bimas. dcrt.nih.gov/molbio/hla_bind/index.html). These
three methods differ with respect to the source of peptides used to
probe the MHC molecules (synthetic peptide libraries for Lib, natural
peptide libraries for SYFPEITHI, and the amino acid-substituted
variants of binder peptides for BIMAS). Linear correlation can be seen
for all three methods. Linear correlation has been demonstrated
previously for Lib (6, 14). This is not surprising because
these methods are based on position-specific information and they all
assume the independence of amino acid positions. The expected log
Kd scales are deduced by linear
regression, and they are added as the second x-axes at the
bottom of Fig. 4
. There are, however, substantial differences in the
accuracy of the prediction. The x-axes use different values
in the respective figures; therefore, standard error along the
y-axis (SEy) is used as a measure of assessment.
SEy becomes smaller in the order BIMAS, SYFPEITHI, and Lib.
The fitness of amino acids used for BIMAS is measured using a series of
variant peptides. Thus, the fitness values are likely to be influenced
by the sequence context of peptides, as reported previously
(8). These sequence-dependent properties are averaged out
by the use of the libraries in the cases of SYFPEITHI and Lib. Now
SYFPEITHI scores are given in integers, whereas Lib uses real number
values from the binding measurement. Unlike the natural library of
SYFPEITHI, equimolar representations of amino acids in the synthetic
libraries used in Lib, and its quantitatively more accurate
measurement, appear to be accountable for improved prediction over
SYFPEITHI. Next, we compared the predictive performance of Qbag with
Lib, the best performing method among the three.
|
To assess the effectiveness of the Qbag learning, a program for
prediction was constructed using 367 binding data described above (186
before learning, 181 during learning). First, 100 HMMs were loaded with
different initial parameters, and each of them was trained with a
subset of 367 binding data. Subsets of the binding data were obtained
by sampling the same number of peptides (367) as the size of the entire
data pool with replacement as described in steps 1 and 2. Increasing
the number of HMMs from 50 to 100 this time was due to upgrading of the
computational capacity. It took
10 min to train 1 HMM with 367
peptides. Test peptides were scored by 100 HMMs, and the output score
was given by averaging the 100 predictions. The average value was used
as the final prediction, according to the theoretical and empirical
bases given by Breiman (26). We chose the number 100,
because past work has indicated that the prediction accuracy of
learning saturates on the order of
100 (26).
Predictive power of Qbag for random peptides
To make a fair judgment on an unbiased population, we compared the
predictive power of Qbag with that of Lib for random peptides. In the
past publications describing different computational methods for
predicting MHC-binding peptides, four measures, true/false positives
and true/false negatives, were often used as criteria for performance
evaluation. However, the same technique is not feasible for our
problem. We set out to explore the search space of
209 peptides where the chances to encounter
binders were rare (<1%). To select a feasible number of test peptides
with a variety of binding abilities, we must use some predictive
method. This selection process inevitably introduces a bias that
depends on the selection method used. Therefore, instead of the above
mentioned four-measure approach, we used the so-called precision-recall
curve analysis (for ease of understanding, in this paper we refer to it
as the precision-coverage curve (P-C curve) for assessment (Fig. 5
g). The P-C curve is a
standard measure of assessment in the fields of information retrieval,
machine learning, and statistics, where a relatively small number of
relevant items (binder peptides, here) must be predicted from a huge
population. It essentially examines whether the ranking of the
predictions is in the correct order. Here, precision is the fraction of
actual binders among peptides that are predicted to bind. Coverage
(recall) is the fraction of peptides that are predicted to bind, among
all the binders. A low precision indicates many false positives,
whereas a low coverage indicates many false negatives. The two measures
are in general in conflict, given that higher precision can be attained
by sacrificing coverage and vice versa. (For example, by predicting
just the very few top ranked peptides as binders, high precision can be
achieved but coverage will be low.) Generally, a prediction method with
a P-C curve that lies above another has a better predictive
performance.
|
2.5% ranked peptides. This
is a higher percentage than the
1% estimated within the random
peptide library (7, 8, 14), suggesting the presence of
false positives. As shown in Fig. 5
|
The relationship between the Lib and Qbag (Que) scores for the
Libhigh and Quehigh
peptides is shown in Fig. 5
, c and d. The actual
binding of these peptides indicates that the two methods are
complementary. The Libhigh peptides include many
peptides the Que scores of which predict lower binding (Fig. 5
c). These peptides tend to exhibit lower binding (Fig. 5
, c and e,
and
). The
Quehigh peptides, in contrast, include peptides
that are predicted to be higher and lower binders by Lib. Peptides with
Lib scores that predict higher binding tend to bind better (Fig. 5
, d and f,
) whereas those with lower Lib scores
bind less (Fig. 5
, d and f,
). Taken together,
Lib tends to overestimate the binding for a number of peptides. This is
consistent with the lower precision of Lib demonstrated by the P-C
analysis, which indicates a high rate of false positives.
This time, active training of HMMs was halted at 181 peptides but the
prediction should improve further if the training were continued. This
is an encouraging performance for a sequence analysis that explores the
entire random space with real number representation. In a previous
report of passive learning using published binding data, at least 1000
data or more can be estimated to be necessary for a prediction with
binary output to reach an error rate of 15% and a confidence level of
95% (9, 27). In another report, a simulation using
binding data that contained many binder peptides (59 of 329 for A24,
221 of 404 for B27, 72 of 285 for B35 with all A24 and B35 peptides
bearing major anchors) suggested that 350500 binding data would be
necessary (Fig. 2
in Ref. 9) to derive reasonable matrix
models with binary output. Also, 300400 data containing many binders
(61 of 317 peptides with IC50 < 500 nM; 273 of
463 with both major anchors) were used for ANN analysis by Gulukota et
al. (10). Because the above studies used part of the
binding data for training and the rest for assessment of the predictive
power, it is not clear how well these methods would predict random
peptides. Although direct comparison is difficult, it is encouraging
that Qbag can achieve the reported level of predictive performance in
real number prediction with a comparable number of training data. If we
had started from random peptides without the guidance of Qbag (such as
in Fig. 3
d), it would have taken much longer to obtain a
sufficient number of informative data. It is rare to find binding data
containing >100 peptides for an MHC molecule in the current databases,
and passive analysis of published data alone would not meet future
demands.
Prediction of Db-binding peptides from proteins in the database
To test the predictive performance on peptides existing in the real world, we next chose protein sequences from GenBank that had been reported to be up-regulated in tumor cells. From 21 proteins, 14,071 overlapping 9-mer peptides were obtained and subjected to prediction by Qbag or Lib. No peptide was overlapping with the peptides used for training of the Qbag algorithm.
A positive overall correlation can be seen between the Lib and Qbag
scores (Fig. 6
a). This
suggests that a large part of the binding energy is, in fact, supplied
additively by the independent binding of amino acids, which is captured
well by Lib. Note, however, that the points form an upward-pointing
triangular shape, indicating that Lib may tend to overestimate the
binding ability, as has been suggested for random peptides. In
contrast, Qbag can capture both position-specific preferences of amino
acids and sequence patterns and is therefore able to differentiate
false positives that are hard to distinguish by positional scanning
type methods.
|
) by the Lib and/or Qbag method
because binders, if identified, may have a clinical importance. For
this reason, selection of peptides was arbitrary, and it so happened
that the selected peptides were among the top 0.6% of ranked peptides
according to the Lib score, but 1.3% according to the Que score. Thus,
sampling was biased and not equal for the two methods. Therefore,
rather than conducting a comparative analysis of the type performed for
the random peptides, we examined the variance of predictions by the
respective methods and whether the two methods were complementary.
We first examined how prediction relates to actual binding. As seen in
Fig. 6
, b and c, standard error along the
y-axis (SEy 0.75) is smaller for Qbag than for
Lib (SEy 0.93). There are two outliers in Fig. 6
b, but SEy is still greater without them.
Sampling of the Quehigh peptides was more biased
than that of the Libhigh peptides.
Quehigh peptides with intermediate Lib scores
were underrepresented (Fig. 6
a). If these points were
included, however, they would most likely settle in the middle of Fig. 6
c and would not greatly affect the above
interpretation.
We next examined how the prediction could be refined by combining the
two methods. The Lib scores and the Que scores are plotted for all the
peptides shown in Fig. 6
, b and c (Fig. 6
d). Closed circles represent peptides for which predictions
by the Lib and Qbag methods were consistent. Although peptides that
scored high by only one method (
,
) tend to bind more weakly than
peptides represented by the closed circles, they are still better
binders among peptides that are scored in the same range by the
alternative method. Thus, the open symbols (
in e,
in
f) tend to settle below the line of correlation (i.e.,
better than expected binders). This suggests that the two methods can
be used complementarily, especially when the predictions by the two
methods differ considerably.
| Discussion |
|---|
|
|
|---|
Prediction of peptide binding in real number Kd values has a critical importance in strategic planning for antitumor immunotherapy (28, 29) as well as in understanding the pathogenesis of cryptic epitopes in autoimmune disorders (30, 31, 32). In the cases of tumor-specific Ags, it often happens that peptides that have marginal affinities to the hosts MHC molecules serve as tumor Ags, especially when the Ags are also expressed in the normal tissues (33, 34). Focusing the peptide search in the marginal affinity range is easily attainable with real number prediction.
The present study is the first attempt of its kind, and there is ample room for improvement regarding the number of training peptides and the choice and specific configurations of the component algorithm. Although training of the algorithm still requires costly materials and labor-intensive experiments, once it is done for an MHC molecule, the information obtained will be a shared resource that can be widely exploited forever. The large number of MHC alleles is another hurdle in compiling the comprehensive information that would cover most people. To achieve this goal, organizing a global collaboration would be necessary to exploit precious peptide resources and standardize the measurements.
Thus far, a major obstacle to computational analysis has been the insufficient number of binding data. Reaching the level of performance achieved here with active learning of 181 peptides (in addition to 186 initial peptides) by Qbag seems promising. One might argue that if these informative 181 peptide binding data were available, other algorithms might perform as well. This may be so, but the essence of this study lies not in the performance of HMM per se but in the fact that Qbag has enabled the selection of a set of most informative peptides for empirical measurements. We happened to use as many as 186 peptides for initial training in this study, but now that the proposed approach has been proved effective, in subsequent analyses it may be possible to start from a smaller number of peptides. In practice, the number of peptides for initial training is not necessarily a serious problem because peptides used for analysis of other MHC molecules can also serve the purpose.
We used HMM because of the advantages mentioned above, but other algorithms can well replace HMM. Qbag is a general method that can take any algorithm as its component. Which algorithm would serve best as the component is an open question not specifically addressed in the present study. We deposited all the peptide binding data as supplemental material. They should offer an excellent test bed for researchers to test their arsenal algorithms. Comparing different approaches for computational prediction published to date is not easy at present. The previously developed methods mostly use binary output. Although there are a few algorithms dealing with real number predictions, in their evaluation the data pool was split; part of it was used for training and the rest for assessment (9, 10). Because the peptides used for assessment were mostly a biased population of peptides intended for the authors own aims, the training peptides and those used for performance evaluation consisted more or less of a similar type. Comparing the performance on a common, unbiased population of peptides would be necessary for fair judgment on competing prediction methods.
This time, we did not test peptides in the low affinity range. Instead,
we used the P-C curve and the measure of linear correlation between the
predicted and actual binding. The former indicated a better ranking
ability of Qbag, compared with Lib. Smaller variance of prediction
values with respect to the actual binding was also confirmed for Qbag
by linear regression analysis in Fig. 6
, b and
c.
We found that the Qbag and Lib methods are complementary to each other and that combining them enhances the predictive performance of either method in isolation. It appears that with a limited number of binding data, although HMMs capture some sequence patterns of binding peptides, they do not capture as much position-specific information as Lib. This seems to indicate that the technique of randomizing the amino acid positions except for the position of interest, used in Lib, is in fact quite effective for extracting position-specific information (6, 7). Library scanning also requires a set of 172 sublibraries (19 x 9 + 1, for a random reference library). However, the advantage of Lib lies in the fact that the same set can be used for any 9-mer binding MHC class I molecule. Therefore, for analysis of a new MHC molecule, combining the two methods and using a couple of hundred new peptides for the Qbag learning would be a good option.
| Acknowledgments |
|---|
| Footnotes |
|---|
2 Address correspondence and reprint requests to Dr. Keiko Udaka, Department of Biophysics, Kyoto University, Kyoto, 606-8502, Japan. E-mail address: keiko{at}molbio.biophys.kyoto-u.ac.jp ![]()
3 Current address: Institute for Chemical Research, Kyoto University, Uji, Kyoto, 611-0011, Japan. ![]()
4 Abbreviations used in this paper: HMM, hidden Markov model; Qbag, query by bagging method; ANN, artificial neural network; Lib, library method; P-x, position x; P-C curve, precision-coverage curve. ![]()
5 The on-line version of this article contains supplemental material. ![]()
Received for publication January 3, 2002. Accepted for publication September 24, 2002.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
E. Hayashi, Y. Matsuzaki, G. Hasegawa, T. Yaguchi, S. Kurihara, T. Fujita, T. Kageshita, M. Sano, and Y. Kawakami Identification of a Novel Cancer-Testis Antigen CRT2 Frequently Expressed in Various Cancers Using Representational Differential Analysis Clin. Cancer Res., November 1, 2007; 13(21): 6267 - 6274. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Zhu, K. Udaka, J. Sidney, A. Sette, K. F. Aoki-Kinoshita, and H. Mamitsuka Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules Bioinformatics, July 1, 2006; 22(13): 1648 - 1655. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Sidney, S. Southwood, V. Pasquetto, and A. Sette Simultaneous Prediction of Binding Capacity for Multiple Molecules of the HLA B44 Supertype J. Immunol., December 1, 2003; 171(11): 5964 - 5974. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |