Abstract
Ag presentation on HLA molecules plays a central role in infectious diseases and tumor immunology. To date, large-scale identification of (neo-)Ags from DNA sequencing data has mainly relied on predictions. In parallel, mass spectrometry analysis of HLA peptidome is increasingly performed to directly detect peptides presented on HLA molecules. In this study, we use a novel unsupervised approach to assign mass spectrometry–based HLA peptidomics data to their cognate HLA molecules. We show that incorporation of deconvoluted HLA peptidomics data in ligand prediction algorithms can improve their accuracy for HLA alleles with few ligands in existing databases. The results of our computational analysis of large datasets of naturally processed HLA peptides, together with experimental validation and protein structure analysis, further reveal how HLA-binding motifs change with peptide length and predict new cooperative effects between distant residues in HLA-B07:02 ligands.
Introduction
Presentation of nonself-peptides on HLA molecules is key for T cell recognition of infected or malignant cells. Over the last 30 years, many studies have been devoted to the identification of viral peptides displayed on HLA molecules in diseases such as EBV or HIV (1, 2). In cancer, epitopes derived from cancer testis Ags have been shown to elicit T cell reactivity and have been used in different attempts to develop cancer vaccines (3). More recently, clinical studies in cancer immunotherapy have shown the importance of altered self-peptides resulting from missense mutations for T cell cancer immunity and response to checkpoint blockade (4–7).
With today’s technology it is still challenging to experimentally test for T cell reactivity against all possible antigenic peptides from either pathogens or cancer cells. Thus, two main strategies have been developed to prioritize the most likely candidates. In the first approach, predictions of HLA–peptide binding affinity are used to narrow down the list of potential epitopes. Such predictions use machine-learning algorithms that are trained on a set of known ligands (8–11). For frequent HLA class I alleles with many ligands in their training sets, high accuracy is reached in binding affinity predictions. Unfortunately, one of the main limitations of these predictions for (neo-)Ag identification comes from the very complex process of Ag presentation and T cell recognition (12). For instance, in tumors, only a small fraction of predicted nonself-peptides is typically validated in T cell immunogenicity assays (13, 14).
The second high-throughput strategy is based on direct identification of peptides presented on HLA molecules by mass spectrometry. Using either HLA class I– or class II–specific Abs, HLA–peptide complexes are first isolated from either transfected cell lines expressing a given HLA molecule of interest (15, 16), also as a soluble form, or cell lines or tissues expressing the endogenous HLA molecules. The peptides are then eluted and analyzed by liquid chromatography coupled to mass spectrometry (MS) (17). Recent studies using high-sensitivity MS have shown that, in addition to tens of thousands of self-peptides, several mutated peptides can be directly identified from tumor samples (13, 18).
A major challenge when analyzing naturally processed peptides detected by MS from unmodified cell lines or tissues comes from the presence of up to six different HLA class I alleles in a cell. One approach that was successfully applied to samples displaying frequent HLA alleles consists of using predictive models such as NetMHC (8, 10) to assign each peptide to its corresponding allele based on predicted binding affinity (19). Although this approach is efficient for samples with common HLA alleles, it is unclear whether it could perform equally well for other alleles for which predictions are less reliable. In parallel, recent works have suggested that HLA-binding motifs can be detected by simply clustering the peptides using a simulated annealing algorithm (i.e., Gibbs Clustering) (20) and without relying on HLA-binding prediction tools (21–23).
Recently, increasing amount of in-depth and highly accurate MS-based HLA peptidomics data is becoming available from cell lines and tissues. Therefore, there is a growing interest in using these data toward a better understanding of (neo-)Ag presentation and peptide–HLA interactions. In this study, we use a novel unsupervised computational approach to identify HLA-binding motifs based on the machine-learning framework of mixture models (24–26). Our work shows that HLA peptidome data can be incorporated to train predictors with improved accuracy compared with training only with publicly available binding data. Applying our approach to recent data of naturally processed HLA ligands (21, 27), we further show how HLA-binding motifs can change with peptide length for many alleles and confirm experimentally these predictions. Finally, our work predicts cooperative effects between amino acids at positions 3 and 6 in HLA-B07:02 ligands, which can be structurally rationalized.
Materials and Methods
Publicly available data
HLA peptidome data were retrieved from two recent high-quality MS studies with 1% false discovery rate (21, 27). HLA class I–binding peptides were first split according to their length. Peptides of length 9–12 were considered in this work.
In parallel, publicly available data about HLA–peptide interactions were retrieved from the Immune Epitope Database (IEDB, as of October 2015) (28) and interactions annotated as nonnegatives (i.e., positive-high, positive-intermediate, positive, or positive-low) with a specific HLA allele were used to build IEDB motifs and train predictors (see below). Both HLA–peptide binding data and naturally processed ligands (mainly from MS studies) were considered, unless otherwise specified. Notably, the IEDB version used does not include peptides from HLA peptidome studies analyzed in this work that would bias the study. Position weight matrices (PWMs) mathematically describing each allele’s binding motif were built by computing the frequency of each amino acid at each position, adding a random count of 1 at each position and each residue to account for low sampling. Sequence logos describing the motif of each allele (see Fig. 1A) were created with the LoLa software (http://baderlab.org/Software/LOLA).
Computational method for HLA peptidome deconvolution
PWMs have been widely used to represent peptide binding to HLA molecules and derive the corresponding motifs (29–31). As such, we anticipated that naturally processed HLA ligands that come from up to six different HLA molecules could be ideally modeled with a mixture of PWMs (24–26). In this probabilistic framework, a total log-likelihood function is defined as follows:where X represents the set of peptide ligands, L the length of the peptides, N the number of ligands, and K the number of motifs. Mk is the PWM representing the kth motif (with
the PWM entry corresponding to the lth residue in peptide Xn), and wk is the contribution of each motif in the total likelihood. Priors were included, as described previously (24, 25).
For a given K, the aim is to infer the matrices Mk and weights wk that optimally describe the data. However, it is important to stress that we do not make any assumption on these matrices (such as using a priori knowledge of HLA-binding motifs) and let the algorithm find them by itself. As such, our approach has the potential to overcome the limitations of existing computational models of HLA-binding specificity.
In the probabilistic framework described above, each peptide Xn is assigned a responsibility that quantifies how much each motif explains the observation Xn. For most peptides, responsibility values are close to 1 for one of the motifs and close to 0 for all others so that each peptide was assigned to a motif (i.e., cluster) based on the largest responsibility value.
The final number of motifs K was derived in two steps. First, the Kullback–Leibler divergence score introduced previously (20) was computed for each value of K between 1 and 8, and the largest value was selected. However, in practice we observed that many relevant motifs were missed. We therefore explored increasingly larger values of K and accepted them as long as exactly one new motif was detected at each step. Given a set of K motifs, the new motifs with K−>K + 1 were accepted if 1) there is no redundancy (minimal Euclidean distance larger than 1/(2L) between all K + 1 new motifs); 2) one of the K + 1 new motifs has a distance larger than 1/(2L) with all those identified at the previous step; 3) all the other K motifs have high similarity with at least one motif identified at the previous step [distance lower than 1/(4L)]; and 4) all motifs are supported by at least 100 peptides and >3% of the data for a given sample. Euclidean distance between two PWMs M and M’ was computed as .
The Gibbs Clustering approach of (20) was used as provided on the Web site (http://www.cbs.dtu.dk/services/GibbsCluster/) without performing alignment, with sequence weighting type set to none and default values for all other parameters.
Each motif predicted by the mixture model or the Gibbs Clustering algorithm was then compared with the motifs derived from IEDB data for the up to six known alleles of each sample. Motifs were assigned to the allele with the closest IEDB-derived motif using the Euclidean distance between the corresponding PWMs (see above). Ambiguous cases were retrieved as having a Euclidean distance to a second allele <1.2 times the distance with the most similar PWM (shown as logos between two columns in Supplemental Fig. 1). These cases were not used when training predictors. All results were manually verified. For a few alleles, no positive data are available in IEDB (HLA-B15:08, HLA-B52:01, HLA-C07:04, HLA-C02:02). In these cases, we randomly generated 100,000 peptides and used the top 2% of NetMHCpan (9) predictions to generate the motifs (see Supplemental Fig. 1).
Training and testing predictors
To investigate the effect of including data from naturally processed peptides in HLA ligand predictors and benchmark their performances, we retrieved all 9 mers from MS data that had been unambiguously assigned to a single allele. We then built our own predictors for each allele so that we could control exactly which data are included in training sets and testing sets. Two types of predictors were considered and trained either on IEDB data (9 mers) only or on both IEDB and HLA peptidome data. First, PWMs were built for HLA alleles based on the corresponding set of ligands. By construction, these predictors include only positive data, which makes them easier to interpret and compare. Second, neural networks were built using both positive and negative data. Negative data were first retrieved from IEDB. When including HLA peptidome data, additional negatives were randomly chosen from the human proteome so as to have 5 times more negatives than positives. Neural networks were implemented using the R package nnet with 10 hidden units. Each ligand position was encoded using a binary 20-dimensional vector with 1 at the position of the corresponding amino acids and 0 elsewhere. As such, 9-mer peptides are encoded as a 180-dimensional vector. For each allele, 10 different neural networks were trained with different seeds, and area under the receiver operating curve (AUC) values shown in Supplemental Fig. 2C, 2D correspond to the average AUC.
Data from SYFPEITHI (30) were used to evaluate the performances. In practice, all 9-mer ligands binding to alleles for which HLA peptidome data were available were used. Negative testing data were randomly generated from human proteins so as to have 5 times more negatives than positives. For all alleles with data both in the deconvoluted HLA peptidome samples and in SYFPEITHI, performance was evaluated with receiver operating curves, and area under the AUC values were computed. IEDB data that were also present in SYFPEITHI were systematically excluded from training to prevent circularity. Among HLA-C7 alleles, experimental ligands in SYFPEITHI are only available for HLA-C07:02. Therefore, all HLA-C7 alleles were grouped together because they are extremely similar in their binding motifs (29).
As a second validation, we performed a cross-validation on IEDB data. Peptide ligands from IEDB for each allele were split into five groups. Recursively, four groups were used to build PWMs, and the model was tested on peptides from the fifth group. Negative interactions were retrieved from IEDB, as before, excluding cases that are also annotated as positives. Average AUC values over the 5-fold cross-validation were compared for each allele when training only on IEDB data (standard 5-fold cross-validation) and when including the ligands predicted by our analysis of HLA peptidomics data to the training data used in the cross-validation (Fig. 2C, 2D). In this case, all alleles with HLA peptidomics data unambiguously mapped to them could be used, except for HLA-B15:18, which does not have reported ligands in IEDB.
Experimental testing of peptides binding to HLA-B07:02 and HLA-B08:01
Twelve peptides of length 9, 10, 11, and 12 were synthesized with arginine, glutamine, or alanine at position 3 at the Peptide Facility (University of Lausanne) with free N and C termini (2 mg each peptide, purity >80%). The peptide sequences were for 9 mers: SPXDPVLTL; for 10 mers: SPXDPVALTL; for 11 mers: SPXDPAVALTL; for 12 mers: SPXDPAVSALTL, with X = R, Q, or A. These sequences were selected to match closely the binding motif of HLA-B07:02, and therefore all peptides with R at position 3 were expected to bind strongly to HLA-B07:02. Binding was tested using a standard refolding assay (32). Synthetic peptides were incubated with denaturated HLA-B07:02 and biotinylated β2-microglobulin proteins at temperature = 4°C for 48 h. The solution was then incubated at 37°C, and samples were retrieved at time t = 0 h and t = 24 h. Stable complexes indicating interactions between HLA-B07:02 and the peptides were detected by ELISA. Negative controls consist of absence of peptides, whereas CMV pp65 417–426 (TPRVTGGGAM) was used for positive controls. Three replicates were performed for each peptide.
For HLA-B08:01, we tested one 9-mer (ELKEKYISL) and three 10-mer peptides (AELKEKYISL, ELKEAKYISL, ELKEKYIASL) that correspond to possible extensions of the 9-mer motif to 10-mer peptides. The negative control consists of absence of peptides, and FILDKSGSVL was used as a positive control.
Multiple specificity analysis of HLA-B07:02
Analysis of dependencies between residue positions among HLA-B07:02 peptide ligands predicted by HLA peptidome deconvolution and from IEDB (annotated as positive-high and excluding ligands identified by MS) was carried out with mutual information (MI). MI is an information theory measure of how much preferences of amino acids at one position A depend on the choice of amino acids at another position B: . Here, pA(i) stands for the probability of amino acid i at position A and pAB(i,j) for the joint probability of amino acid i at position A and amino acid j at position B. Ranking of MI values shows that position A = 3 and B = 6 display the highest MI value in all three datasets where this allele was present. Mutual exclusion of charged amino acids at these two positions can be measured by comparing positional frequencies, f3(+) and f6(+), with actual joint frequency, f36(+,+), of positively charged amino acids (i.e., arginine and lysine). Expected values for the joint frequency in Table I were calculated as f3(+) × f6(+)/N, where N is the total number of ligands detected by our approach. The p values were computed with Fisher’s test.
Peptides predicted as ligands of HLA-B07:02 in all three samples containing this allele were used as inputs for our mixture model with K = 3. Although these peptides are expected to bind to one allele, different motifs were observed (see Fig. 4B, Supplemental Fig. 4B). These results suggest that this allele is better described using multiple motifs, as it is the case in other peptide-binding proteins such as Src homology 3 domains (24).
Results
Unbiased deconvolution of HLA peptidome
Current MS analysis of naturally processed HLA peptides identifies pools of thousands of peptides that are presented on the different HLA molecule expressed in a given sample (21, 27, 33). However, to exploit the wealth of data provided by this high-throughput technology toward a better understanding of HLA-binding mechanisms, one should ideally deconvolute the contribution of each HLA allele. Peptides from each allele are often modeled with motifs, as shown in the logos of Fig. 1A. As such, we postulated that HLA peptidome data consist of a superposition of different motifs that represent the up-to-six class I alleles expressed in any given sample. From a computational point of view, this kind of superposition can be deconvoluted using the machine-learning framework of mixture models (25, 26). In this work, we implemented such an algorithm for the special case of unsupervised deconvolution of HLA class I peptidome (see Materials and Methods).
(A) Comparison of 9-mer motifs obtained on one hepatocellular carcinoma cell line (HCC1143) with the mixture model (first line), the Gibbs Clustering approach (second line), and corresponding motifs derived from IEDB data (binding data in the third line and elution data in the fourth line). Numbers below the logos indicate the number of peptides used to build each motif. Parentheses in the second line indicate the size of the intersections between the peptide clusters found by the two unsupervised deconvolution approaches. (B) Number of distinct alleles predicted by the mixture model and the Gibbs Clustering in each sample. Parentheses show the actual number of HLA alleles. (C) Total number of alleles that could be unambiguously detected. As expected, HLA-C alleles are less often seen because their expression is much lower and they tend to display overlapping binding motifs.
Fig. 1A shows the result of our approach on a hepatocellular carcinoma cell line with five HLA-I alleles (HCC1143) from (21). The mixture model naturally identifies five motifs. In this case, only four motifs were obtained with the Gibbs Clustering approach (20), but each motif displays a strong overlap with the results of the mixture model. This shows that unbiased deconvolution approaches tend to converge to similar solutions, which confirms the robustness of our approach. More importantly, comparison with known HLA motifs derived from publicly available data (Fig. 1A, third and fourth line) shows a striking similarity with those identified by the mixture model, although no information about HLA alleles was used to find the motifs in the HLA peptidomics data. Results for other cell lines’ HLA peptidomes are shown in Supplemental Fig. 1. The total number of predicted motifs in each sample analyzed in this study is shown in Fig. 1B for both the mixture model and the Gibbs Clustering, highlighting that often the same number of motifs is identified, with one additional motif detected by our approach in a few cases (see example in Fig. 1A). Interestingly, all motifs corresponding to A and B alleles are in general correctly identified, whereas motifs corresponding to C alleles often fail to be detected (Fig. 1C). The higher degeneracy and lower expression of HLA-C alleles (29) most likely prevent in some cases the correct identification of the corresponding motifs.
Training HLA–peptide interaction predictors
Large datasets of HLA class I ligands have been key to develop prediction algorithms, from sequence patterns (30) to PWMs (11) to more recent machine-learning approaches such as neural networks (8–10). For alleles with thousands of known ligands in databases such as IEDB (28), the prediction accuracy for binding is very good. Yet, many other alleles are still poorly described with only a handful of ligands, as shown for instance in Fig. 1A for HLA-B35:08 and HLA-B37:01. We therefore reasoned that large-scale HLA peptidomics data analyzed with our unsupervised approach may provide a unique training set to improve prediction accuracy for those alleles. To quantitatively show this possibility, we trained both linear predictors (PWMs) and neural networks on all ligands from IEDB with annotated target HLA molecules, including or not the new ligands derived from the deconvoluted HLA peptidomics data analyzed in this work (see 4Materials and Methods). Validation was carried out using the HLA ligands from SYFPEITHI, as is often done for HLA-binding prediction algorithms (8, 34) (see Materials and Methods). Fig. 2A shows the AUC values obtained with and without deconvoluted HLA peptidomics data for training. Although results are similar for many alleles, some alleles show higher AUC values when including the naturally processed HLA–peptide data. Importantly, the improvement in prediction accuracy was clearly linked to the size of the IEDB training sets (Fig. 2B). The trend is even stronger if MS data from IEDB are not included in training (see Supplemental Fig. 2A, 2B) or when training neural networks (see Supplemental Fig. 2C, 2D). These results confirm that alleles with less known ligands in IEDB are likely to benefit from recent HLA peptidomics data processed by computational approaches like the one used in this work. Similar results were obtained when performing a cross-validation in IEDB data (see Materials and Methods), including or not the deconvoluted HLA peptidome data analyzed in this work for training the models (Fig. 2C, 2D).
Training HLA interaction predictors with and without deconvoluted HLA peptidomics data for all alleles detected in this work. (A) AUC resulting from PWMs trained with only IEDB data (light gray) or both IEDB and HLA peptidome data (dark gray). SYFPEITHI data were used as a testing set. (B) Differences in AUC values from (A) as a function of the number of training data in IEDB. (C) AUC resulting from PWMs trained with only IEDB data (light gray) or both IEDB and HLA peptidome data (dark gray). AUC represent the average of the 5-fold cross-validation on IEDB data. (D) Differences in AUC values from (C) as a function of the number of training data in IEDB.
Influence of peptide length on binding motifs
Most HLA class I ligands in public databases are 9 or 10 mers. For this reason, our understanding of HLA-binding motifs comes mainly from peptides of these lengths. For this reason, existing HLA class I ligand prediction tools tend to focus on 9 and 10 mers, and only recently have peptides of different lengths been incorporated in training sets (8, 35). This is in stark contrast with HLA peptidomics data in which many longer peptides are routinely identified (15, 16, 21, 36). In this study, we take advantage of this unique property to explore how binding motifs change as a function of peptide length. For many alleles, changes in binding motifs simply correspond to longer regions between anchor residues (see some examples in Supplemental Fig. 3A). These events can be accurately modeled with gaps and insertion, as recently shown in (8, 37).
A number of alleles display more interesting changes. We first observed that the two alleles in our data with a clear anchor residue in the middle of the motif for 9 mers (HLA-B08:01 and HLA-B37:01) tend to lose this motif with longer peptides (see Fig. 3A, Supplemental Fig. 3A). Publicly available crystal structures of HLA molecules in complex with longer than 9-mer peptides and anchor residues at the second and last positions show that these peptides form a bulge outside of the HLA binding site that can accommodate the additional amino acids, as shown in Fig. 3B. For alleles without anchor residues at middle positions, the presence of the bulge may not dramatically affect the interactions mediated by the anchor residues and has mainly an entropic destabilizing effect. Reversely, in the presence of anchor residues in the middle of the peptide, longer peptides are much more likely to alter their conformation and thus strongly destabilize the interactions mediated by these anchor residues. This observation may explain the loss of binding motifs for longer peptides observed for alleles with anchor residue at position 5.
Influence of peptide length on HLA motifs. (A) Motifs showing anchor residues at middle positions for 9 mers in the HLA peptidomics data. For both alleles the motif is not found for 10 mers in the HLA peptidome data (see Supplemental Fig. 3). (B) Structural view of HLA binding site with peptides of different length: 9 mer (green) from Protein Data Bank: 4U1H (40) and 11 mer (orange) from Protein Data Bank: 3VCL (41). The bulge in middle positions for the longer peptide (orange line) is clearly visible, showing that the position of middle residues is less conserved from a structural point of view. The α2-helix has been truncated for visualization. (C) Length dependence of the preference for arginine at position 3 among the predicted ligands of HLA-B07:02. (D) Absorbance ratios from ELISA between peptides with Q (black) or A (dark gray) and those with R at position 3 at times 0 and 24 h. Error bars represent the SD of three independent experiments. At 24 h, 10-, 11-, and 12-mer peptides with A or Q at position 3 are almost fully dissociated, whereas those with R at position 3 show stable binding. (−) indicates negative control, and (+) indicates positive control.
To explore the generality of the loss of motifs in the presence of anchor residues at middle position in 9-mer peptides, we analyzed all IEDB data. Seven alleles with enough data to carry out motif analysis (i.e., at least eight ligands) show presence of anchor residues at middle positions (HLA-B08:01/02/03, HLA-B14:02/03, HLA-B37:01, HLA-B73:01) (Supplemental Fig. 3B). Interestingly, for 10-mer peptides, only one allele has at least eight ligands (HLA-B08:01) and the 10-mer motif is clearly distinct from the 9-mer motif (see Supplemental Fig. 3B). For 11 and 12 mers, most of these alleles have no ligands (Supplemental Fig. 3B). Although we cannot exclude that some of these alleles do not have longer ligands simply because none has been tested, these observations are consistent with a model of loss of binding motifs for longer peptides. To validate these predictions, we tested 10-mer peptides with alanine inserted between anchor residues of the HLA-B08:01 motif shown in Fig. 3A (AELKEKYISL, ELKEAKYISL, ELKEKYIASL). As expected, although the 9 mer (ELKEKYISL) shows stable binding, the 10 mers show significantly lower binding affinity (Supplemental Fig. 3C), confirming the loss of the 9-mer binding motif of HLA-B08:01 for longer peptides.
A second example of changes in binding motifs with peptide length is shown in Fig. 3C for HLA-B07:02 motifs derived from HLA peptidomics data from a B cell line (27). In this case, arginine (R) at position 3 appears to play an increasingly important role for longer peptides. To validate this observation, we synthesized and tested peptides of different lengths with arginine (R), glutamine (Q), or alanine (A) at position 3 (see Supplemental Table I). Results of a refolding assay (see Materials and Methods) show that all three 9-mer peptides display similar refolding efficiency (Fig. 3D). However, for longer peptides, the presence of arginine at position 3 resulted in much higher refolding efficiency, especially at time t = 24 h, confirming the importance of charged residues at position 3 for stabilizing longer peptides. Crystal structures of 9-mer ligands in complex with this allele show that the residue at position 3 can make direct contact with the negatively charged aspartate (D) at position 139 in the HLA binding site (see Fig. 3B) (Protein Data Bank: 1U4H). Although glutamine (Q) at position 3 is found in this crystal structure, charged complementarity suggests that arginine or lysine could also optimally interact with D139 with minor rearrangements of the peptide to accommodate these longer amino acids. Such rearrangements may be even more accessible with longer peptides that show flexibility between anchor residues. Our structural interpretation therefore suggests that positively charged amino acids at position 3 may be optimal to stabilize longer peptides (thereby counterbalancing the penalty of the bulge protruding outside of the binding site) by making favorable interactions with residue D139 in the binding site. As noted for other alleles (15, 16), we also observed higher frequency of small amino acids (A, C, G, P, S) and a lower frequency for large and bulky ones (F, H, K, R, Y, W, E, Q, M) at position P4 to PΩ-1 for longer peptides predicted to bind to HLA-B07:02 in the B cell line, although the trend was not observed for 9 mers in other samples containing HLA-B07:02 (Supplemental Fig. 4A).
Cooperative effects in HLA binding
Our unsupervised analysis of HLA peptidomics data suggests that polar or charged residues at position 3 play an important role in peptide interactions with HLA-B07:02. However, analysis of another crystal structure (Protein Data Bank: 5EO0) (38) of HLA-B07:02 in complex with a 9-mer peptide shows that hydrophobic amino acids (methionine in this case) can also be found at this position. This is consistent with the relatively low specificity at position 3 of the 9-mer logo shown in Fig. 3C. Interestingly, in this second crystal structure, a lysine (K) is found at position 6 of the ligand (see Fig. 4A, blue ligand). This lysine makes direct contact with D139 on the HLA molecule.
(A) Structural alignment of HLA-B0702 structures with two different ligands (TPQDLNTML–Protein Data Bank: 4U1H and RPMTFKGAL–Protein Data Bank: 5EO0) highlighting how residues either at position 3 or at position 6 can interact with D139. The α2-helix has been truncated for visualization. (B) Multiple specificity analysis of ligands of HLA-B07:02 found by HLA peptidome deconvolution of the B cell line (27). The preference for a charged residue at either position 3 or 6 is clearly visible (see also Table I).
To explore whether these two types of apparently mutually exclusive charged interactions with D139 correspond to a general principle, we took all ligands of HLA-B07:02 predicted in each of the three samples with HLA-B07:02 allele–B cell line from (27), JY and HCC1937 cell lines from (21). MI, which is a measure of dependencies of amino acids at one position with respect to those found at another position (24), was computed between all pairs of peptide positions (see Materials and Methods). Our results show that in all three samples, the highest MI value is found for the pair of positions 3 and 6. We then compared the frequency of charged amino acids (R or K) at positions 3, f3(+), and 6, f6(+), with the joint frequency of having charged residues at both positions, f36(+,+). In all cases, the actual joint frequency was much smaller than the expected one (see Table I). These results clearly suggest a cooperative effect in which positively charged amino acids are favored at either position 3 or 6, but not at both positions at the same time. We then applied our mixture model on these peptides asking for K = 3 motifs. In all cases, we can clearly see one motif with charged amino acids at position 3, one with charged amino acids at position 6, and a third motif with less specificity at positions 3 and 6 (see Fig. 4B, Supplemental Fig. 4B). These findings are fully in line with the two protein structures shown in Fig. 4A because amino acids at either position 3 or 6 can interact with D139. Altogether, our integrative analysis of HLA peptidomics data and available HLA-B07:02 structures predicts cooperative effects between distant residues along peptides binding to this HLA molecule.
Discussion
Detailed understanding of HLA–peptide interactions plays a central role in our ability to decipher fundamental properties of Ag presentation and predict HLA peptide ligands in infectious diseases and cancer. Much of our current knowledge comes from biochemical assays that have identified many HLA ligands and crystal structures that have revealed their general binding modes. In this study, we show that new insights can be gained by careful bioinformatics analysis of recent high-quality HLA peptidomics data. In particular, the very large number of ligands identified in these studies enables us to robustly predict new properties of HLA–peptide interactions using the power of statistics and machine-learning approaches, combined with analysis of publicly available crystal structures and experimental validation. Although several previous attempts used transfected cell lines or cell lines that naturally express only one HLA allele (15, 16, 39), our work shows that many insights can be gained even in the presence of multiple alleles, thereby bypassing the need to genetically manipulate cell lines and providing a proof of concept for the analysis of HLA peptidomics data from patient samples.
Importantly, akin to the Gibbs Clustering algorithm (20), our approach does not rely on predictions of peptide–HLA interactions. As such, it is not restricted to a subset of alleles (19). Moreover, it has the potential to improve existing HLA–peptide interaction prediction algorithms, as shown in Fig. 2. As our aim was not to provide yet another HLA ligand prediction tool, but rather to explore the use of deconvoluted HLA peptidomics data on prediction accuracy, we did not use existing predictors, but rather built our own using standard approaches (PWMs and neural networks). In this way, we could precisely investigate the impact of including additional ligands from HLA peptidomics data to train the model and ensure that testing data were not included in the training set. This would not have been possible with previously published algorithms.
As expected, any deconvolution approach works particularly well when HLA alleles expressed in a sample clearly differ in their binding motifs (e.g., when they belong to different supertypes; see Fig. 1A). In other cases, the deconvolution may not identify separate motifs for each allele, suggesting that analysis of naturally processed HLA peptides will not perform equally well on any sample (see for instance SupB15WT cell line in Supplemental Fig. 1). This is especially true for HLA-C alleles that are less specific and more degenerated (29). In addition, HLA-C alleles are often poorly expressed, which may explain why their motifs often fail to be detected. As such, it is expected that different samples will not perform equally well in the deconvolution, and in some cases the approach proposed in this work may not allow to distinguish all motifs. Despite these limitations, our results indicate that HLA peptidomics data, when they can be unambiguously assigned to their corresponding alleles, can improve HLA–peptide prediction algorithms (Fig. 2, Supplemental Fig. 2).
MS results can be biased toward charged peptides. This is visible for instance when comparing motifs of HLA-C04:01 derived from HLA peptidome data and IEDB binding data in Fig. 1A (D at P3), and the same trend is observed for the logo derived from binding data and the one derived from elution data in IEDB (Fig. 1A). Overall, visual inspection of the motifs predicted by our method (see Fig. 1A, Supplemental Fig. 1) shows that charged residues are often given a higher frequency, suggesting that one should consider these potential biases before interpreting differences in the motifs derived from in vitro HLA–peptide interaction data and naturally processed ligands to infer properties of Ag processing and presentation. Along these lines, it is likely that the motifs for HLA-B07:02 in Figs. 3 and 4 are enriched in charged peptides. As such, our results suggest that there is a preference for one positively charged residue at position either 3 or 6 especially among charged 9-mer peptides, which results in the multiple binding motifs shown in Fig. 4B. Importantly, a similar trend can be observed among HLA-B07:02 ligands annotated as positive-high that do not come from MS experiments in IEDB (p = 0.016; see Table I). Multiple motifs, such as those predicted for HLA-B07:02, were previously shown to reflect multiple binding modes in PDZ and Src homology 3 domains (24), which is consistent with the structural analysis performed in this work. Combining results shown in Figs. 3 and 4 also suggest that the important role of charged residues at position 3 for longer peptides may reflect the inability of charged residues at other positions (e.g., position 6) to interact with D139 due to the different conformation of these longer peptides. This example indicates that cooperative effects may have been underestimated in previous descriptions of HLA–peptide interactions, in which peptides are often described as linear epitopes. This may also help understanding why machine-learning approaches such as neural networks, which can model such cooperative effects, perform better than linear models (10, 11).
Other types of influences of peptide length on binding motifs have been previously reported, such as N- or C-terminal extensions (16, 36). Most alleles analyzed in this work (except those shown in Fig. 3A) seem to have extended motifs for longer peptides with conserved anchor residues at P2 and PΩ (see Supplemental Fig. 3). Yet, peptides were analyzed without aligning them, which may prevent observing cases on N- or C-terminal extensions in the motifs. We also anticipate that peptides of different lengths obtained from our analysis may be integrated in global models of HLA binding and presentation that would implicitly include information about both binding affinity and length biases due to proteasomal cleavage and Ag processing, as suggested previously (35).
MS analysis of HLA peptidomics data is increasingly performed in infectious diseases and cancer to detect (neo)-Ags. In this work, we show how high-throughput high-quality HLA peptidomics data can be exploited to refine our understanding of HLA-binding events and improve HLA ligand predictions. We anticipate that integrating HLA peptidomics data with machine-learning approaches like the one presented in this work will play an increasingly important role in future studies of Ag presentation on HLA molecules.
Disclosures
The authors have no financial conflicts of interest.
Acknowledgments
We acknowledge the work of Philippe Guillaume and Julien Schmidt at TCMetrix for experimental testing of HLA–peptide interactions, and Catherine Servis at the Peptide Facility of University of Lausanne for peptide synthesis. The computations were performed at the Vital-IT (http://www.vital-it.ch), a center for high-performance computing of the Swiss Institute of Bioinformatics.
Footnotes
↵1 D.G. should be considered as both first and last author.
This work was supported by the Ludwig Institute for Cancer Research (to D.G.), the Center for Advanced Modelling Science (to D.G.), and the Ludwig Institute for Cancer Research (to M.B.-S.).
The online version of this article contains supplemental material.
Abbreviations used in this article:
- AUC
- area under the receiver operating curve
- IEDB
- Immune Epitope Database
- MI
- mutual information
- MS
- mass spectrometry
- PWM
- position weight matrix.
- Received May 9, 2016.
- Accepted July 11, 2016.
- Copyright © 2016 by The American Association of Immunologists, Inc.