Abstract
Adaptive immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to Ag, with diversified Ig receptors. These receptors can now be profiled on a large scale by next-generation sequencing. Such data provide a window into the microevolutionary dynamics that drive successful immune responses and the dysregulation that occurs with aging or disease. Clonal relationships are not directly measured, but they must be computationally inferred from these sequencing data. Although several hierarchical clustering-based methods have been proposed, they vary in distance and linkage methods and have not yet been rigorously compared. In this study, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. We find that single linkage clustering has high performance, with specificity, sensitivity, and positive predictive value all >99%, whereas other linkages result in a significant loss of sensitivity. Surprisingly, distance metrics that incorporate the biases of somatic hypermutation do not outperform simple Hamming distance. Although errors were more likely in sequences with short junctions, using the entire dataset to choose a single distance threshold for clustering is near optimal. Our results suggest that hierarchical clustering using single linkage with Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms.
Introduction
The capacity of B cells to modify their Abs or Ig receptors to adapt in response to pathogenic challenges is a key mechanism that protects us from infection. An initial diversity of ∼107 unique Ig molecules (1) stems from somatic recombination of gene segments in the B cell Ig gene locus compounded by stochastic nucleotide insertions and deletions at the junctions of these segments. Upon activation, these naive B cells diversify further by undergoing clonal expansion with somatic hypermutation (SHM) in the Ig gene (approximately one point mutation per 1000 bp per cell division) (2, 3) followed by selection for higher affinity B cells. This microevolutionary process known as affinity maturation results in B cells with diversified Ig receptors that are clonal relatives of the original activated B cell. In healthy human adults, class-switched memory B cells express Ig receptors that are ∼7% mutated (4). Our ability to profile this adaptive immune response has dramatically improved through the application of next-generation sequencing, which allows for measurement of tens to hundreds of millions of B cell receptors (5). However, the identification of sequences that belong to the same B cell clone in these data remains a significant challenge (6).
Adaptive immune receptor repertoire sequencing (AIRR-Seq) is being widely used for both basic science and clinical studies (7–9). Statistical properties of the repertoire, such as diversity or mutational load, are being used to gain insights into the dysregulation that occurs with aging or disease (4, 8, 10–21). Properly identifying clones is central to the calculation of many of these properties. For example, clone size distributions are the basis for several diversity measures, such as species richness, Shannon entropy, and the Gini–Simpson index (6) that parallel diversity measures in ecology (22). Diseases such as chronic lymphocytic leukemia are characterized by low diversity that is driven by the dominance of a small number of clones (23), and repertoire sequencing has been used to improve minimal residual disease detection for lymphoid cancers (8, 10). Responses to drugs such as rituximab have also been measured by changes in repertoire diversity in autoimmune disease (11, 12), characterizing treatment regimens that lead to successful remission or result in persistent clonal expansions. Decreases in repertoire diversity have been associated with aging (4, 13, 14). In subjects with seasonal allergies, the IgE repertoire is the least diverse compared with other isotypes in blood and nasal biopsies, indicating a focused immune response (16).
Analysis of diversity within a clonal lineage also has several applications. Reconstruction of B cell clonal lineages using methods such as maximum parsimony or likelihood (24) allows tracing somatic mutations through the Ig sequences and helps in understanding the evolution of neutralizing Abs (17, 18). Lineage relationships have also been used to gain insight into the mechanisms underlying isotype switching (25, 26) and to show that B cell clones in the CNS in subjects with multiple sclerosis are first activated in the periphery (15). Identifying clones that include sequences with known Ag specificities has also been used to reveal novel Ag-specific sequences (19, 20). Thus, clonal partitioning of AIRR-Seq data is central to a wide range of applications.
Despite its importance, there is no consensus on the best method for grouping Ig sequences into B cell clones. Most current approaches leverage the high diversity of the junction region (i.e., where the V, D, and J gene segments join) as a “fingerprint” to identify each B cell clone (27). Because it is unlikely that two separate recombination events would lead to identical junctions, sequences with junction regions that are similar enough are determined to share a common B cell ancestor (i.e., be clonally related) rather than to have arisen independently. Probabilistic models have been developed to calculate likelihood of sharing a B cell ancestor and subsequently infer clonal grouping (28, 29). However, these algorithms have run times that scale exponentially, which is computationally intractable for large sequencing datasets (29). In practice, most studies cluster sequences based on junction sequence similarity (13, 15, 18, 21, 30–32).
Although many clustering approaches exist, hierarchical clustering is the most widely used framework for grouping clonally related sequences. Hierarchical clustering requires a measure of distance between pairs of sequences and a choice of linkage to define the distance between groups of sequences. Because hierarchical clustering produces a tree defining the relationships between all sequences, it is also necessary to specify a method to cut the hierarchy to identify discrete clonal groups. In practice, most studies first split the sequences using some similarity requirement on the germline gene segments (e.g., identical V and J gene segments, junction length) and then apply hierarchical clustering on the junction sequence of these smaller groups (8, 15, 18, 21, 30–34). Several distance metrics have been proposed, including Hamming distance, which is simply the absolute count of differences between two amino acid (31, 33) or nucleotide (21, 32) sequences, normalized edit distance (30), and a metric that incorporates hot-/cold-spot biases in SHM targeting (15, 18). In addition to metrics defining distance between two sequences, linkage methods define how distance is calculated between groups of sequences. Different clonal grouping algorithms use single (15, 18, 21, 32), average (30), or complete (13) linkage. The threshold at which the hierarchy is cut to define clusters of clonally related sequences has also been determined in several ways. Chen et al. (30) propose a fixed threshold that is manually identified based on when the rate of cluster merging events changes for a gold standard dataset (30). Glanville et al. (31) introduced a method based on the observed bimodal distribution of distances from each sequence to its nearest neighbor. In this case, the first mode is assumed to represent sequences with clonal relatives in the data (near neighbors), whereas the second mode is taken to represent sequences without clonal relatives in the data (distant neighbors). The threshold is then selected to be the value that separates the two modes of this distribution (21, 31). As of yet, there has not been an in-depth evaluation of performance of hierarchical clustering–based clonal grouping algorithms including a comparison of the different distance and linkage methods on AIRR-Seq data.
In this study, we carry out a comparative analysis of distance metrics and linkage methods for hierarchical clustering–based clonal grouping of IgH sequences. A combination of experimental and simulation-based criteria is used to evaluate the performance of these algorithms, including estimates of specificity, sensitivity, and positive predictive value (PPV). Overall, we find that single-linkage hierarchical clustering with nucleotide Hamming distance has excellent performance, with specificity, sensitivity, and PPV all >99%. Implementations of all clonal grouping methods, along with extensive documentation, are available through the Change-O and SHazaM packages (35) as part of the Immcantation tool suite (http://immcantation.readthedocs.io) for AIRR-Seq analysis.
Materials and Methods
Human BCR repertoire sequencing data
Three BCR repertoire sequencing datasets (Healthy, Dengue and West Nile virus [WNV]) were used to measure the performance of clonal grouping methods. The Healthy dataset was composed of sequences from PBMCs isolated from healthy adult subjects (n = 27) as previously described (14). Sequences were filtered as described in Wang et al. (14), including removal of non-IGH artifacts, sequences with V-gene insertion or deletions, and chimeric sequences. The Dengue dataset was composed of sequences from PBMCs isolated from subjects with acute dengue infection (n = 42) as described previously (36). Sequences were filtered as described in Parameswaran et al. (36), including removal of reads containing insertions and deletions as determined by alignment against the V and J germline repertoires. The WNV dataset was composed of sequences from PBMCs and sorted plasma, memory, and naive B cells isolated from subjects recently infected with WNV (n = 7) as previously described (18). Sequences were filtered as described in Tsioris et al. (18), including filtering based on sequence quality using pRESTO (37) version 0.4. In each case, processed sequencing data were obtained from the authors. Germline gene segments were inferred for each sequence by using IMGT/HighV-QUEST (38). The Healthy dataset was run through IMGT/HighV-QUEST on December 21, 2014, the Dengue dataset was run through IMGT/HighV-QUEST on March 12, 2015, and the WNV dataset was run through IMGT/HighV-QUEST on March 21, 2014. Sequences identified as nonfunctional by IMGT/HighV-QUEST were removed using the Change-O toolkit version 0.2.0 (35).
Two additional BCR repertoire sequencing datasets from healthy adult subjects were used as a source of naive BCR sequences for the lineage simulations. The first was composed of sequences from PBMCs and sorted naive B cells isolated from healthy control subjects (n = 4) as part of a study of myasthenia gravis described in Vander Heiden et al. (39). The second was composed of sequences from total RNA isolated from blood samples of healthy adult subjects (n = 3) as part of an influenza vaccination study described in Laserson et al. (40). In this case the samples, which were originally sequenced using Roche 454, were resequenced using Illumina MiSeq and published for the first time in this article (see details in the next two sections of the Materials and Methods). Identical sequences from the same sample were counted once, but identical sequences from different samples were counted independently. Both datasets are available on the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) under BioProject accession numbers PRJNA349143 (influenza vaccination study) and PRJNA338795 (myasthenia gravis study).
Library preparation and BCR sequencing of healthy subject sequences
The blood samples collected in the influenza vaccination study by Laserson et al. (40) were resequenced using the Illumina MiSeq platform as previously described (18, 41). Briefly, RNA was reverse transcribed into cDNA using a biotinylated oligo(dT) primer. An adaptor sequence was added to the 3′ end of all cDNA, which contains the Illumina P7 universal priming site and a 17-nt unique molecular identifier (UMI). Products were purified using streptavidin-coated magnetic beads followed by a primary PCR reaction using a pool of primers targeting the IGHA, IGHD, IGHE, IGHG, IGHM, IGKC, and IGLC regions, as well as a sample-indexed Illumina P7C7 primer. The Ig-specific primers contained tails corresponding to the Illumina P5 sequence. PCR products were then purified using AMPure XP beads. A secondary PCR was then performed to add the Illumina C5 clustering sequence to the end of the molecule containing the C region. The number of secondary PCR cycles was tailored to each sample to avoid entering plateau phase, as judged by a prior quantitative PCR analysis. Final products were purified, quantified with Agilent TapeStation, and pooled in equimolar proportions, followed by high-throughput paired-end sequencing on the Illumina MiSeq platform. For sequencing, the Illumina 600 cycle kit was used with the modifications that 325 cycles were used for read 1, six cycles for the index reads, 300 cycles for read 2, and a 10% PhiX spike-in to increase sequence diversity.
Read processing of healthy subject sequences
MiSeq reads from the influenza vaccination study by Laserson et al. (40) were demultiplexed using Illumina software. Positions with less than Phred quality 5 were masked with Ns. Isotype-specific primers and unique molecular barcodes (UMI) were identified in the amplicon and cut using pRESTO (37) MaskPrimers-score. Read 1 and read 2 consensus sequences were generated separately for each mRNA from reads grouped by UMI, which represent PCR replicates arising from a single initiating mRNA molecule. UMI read groups were aligned with MUSCLE (42), and pRESTO was used to construct a consensus sequence using BuildConsensus, requiring that ≥60% of called PCR primer sequences agree for the read group, maximum nucleotide diversity of 0.1, using majority rule on indel positions, and masking alignment columns with low posterior (consensus) quality. Paired-end consensus sequences were then stitched in two rounds. First, ungapped alignment of each read pair’s consensus sequence termini was optimized using a Z-score approximation and scored with a binomial p value as implemented in pRESTO AssemblePairs-align. For read pairs failing to stitch this way, stitching was attempted using the human BCR germline V exons to scaffold each read prior to stitching or gapped read-joining, using pRESTO AssemblePairs-reference. Positions with posterior consensus quality less than Phred 5 were masked again with Ns. All pRESTO tools used were version 0.5.1 in conjunction with Python 3.4. Germline gene segments were inferred using IgBLAST version 1.4.0 (43) with the IMGT/GENE-DB (44) reference sequences from June 7, 2014 and output was parsed with Change-O (35) MakeDb version 0.3.0. Duplicate sequences were collapsed and only those H chain sequences with at least two reads supporting the sequence were retained for further analysis.
Simulation of B cell clonal lineages
Each simulated clone was generated by introducing mutations into an experimentally observed naive BCR sequence according to an observed lineage tree topology (i.e., branching pattern). Lineage tree topologies were previously derived based on sequencing data from lymph node samples collected as part of a published study of multiple sclerosis (15). The set of 7,103, 4,066, 8,244, and 14,782 lineage topologies from four subjects (referred to in this article as R1, R2, R3, and R4, respectively) were each used as the basis for 10 simulations resulting in 40 total simulated datasets. To generate a simulated dataset, the root of each lineage was randomly chosen (without replacement) from a large pool of unmutated sequences from healthy subjects obtained from Vander Heiden et al. (39) and Laserson et al. (40) (described in Human BCR repertoire sequencing data). Mutations were then added to the sequence to match the experimentally observed mutation counts of each branch in the lineage tree according to the human S5F (hS5F) targeting model (45). The simulated sequences then had germline gene segments inferred using IgBLAST version 1.4.0 (43) with the IMGT/GENE-DB (44) reference sequences from May 2, 2016 and output was parsed with Change-O (35) MakeDb version 0.3.3.
Distance metrics
Hamming distance was defined as the absolute count of letter changes between nucleotide junction sequences or amino acid junction sequences. The 5-mer distance metrics were all based on the hS5F targeting and substitution models described in Yaari et al. (45), which estimate 1): the relative probability of a nucleotide position being targeted for somatic mutation, and 2) the probability of mutating to each of the three other possible nucleotides, based on the two nucleotides up- and downstream. This probability (p) was transformed into a distance (d) using the formula: d = −log10(p). The distance between two junction sequences was defined to be the sum of distances between each nucleotide position. For a given mutation between two junction sequences, the hS5F-min model took its distance to be the minimum of mutating from nucleotide n1 to n2 and from n2 to n1 at that position. The hS5F-avg model took the distance of the mutated position to be the average of mutating from n1 to n2 and from n2 to n1.
The human S1F (hS1F) model is equivalent to hS5F-min, but used the human symmetric substitution matrix based on the single mutated nucleotide described in Yaari et al. (45). The m1n model is equivalent to hS5F-min, but used the mouse symmetric substitution matrix based on the single mutated nucleotide described in Shapiro et al. (46). Both the human and mouse substitution matrices are four-by-four matrices whose rows correspond to the conditional probabilities of each of the four nucleotides to mutate to any of the other nucleotides. Hence, the diagonal of each matrix is 0, and each row sums to 1. To estimate distances between sequences based on each single nucleotide substitution matrix, we first approximated each matrix by a symmetric matrix both with respect to time (i.e., equal likelihood of mutating from nucleotide n1 to n2 and from n2 to n1), and with respect to strand (e.g., equal likelihood of mutating from C to T and from G to A). The symmetric matrix has three free parameters that were estimated by minimizing the sum of squares between this matrix and the original substitution matrix. The fitted matrix was normalized to ensure that each row sums to 1. Using the symmetric substitution matrix, a distance between two sequences was computed by going over all substitution events and summing over the logarithms of the corresponding conditional probabilities from the symmetric substitution matrix.
When normalizing by length for any distance metric, these distances were divided by the length of the junction region.
Implementation of clonal grouping algorithms
Clonal grouping algorithms were implemented and are made available in the Change-O toolkit (35) (version 0.3.1 or newer). Sequences were first grouped by shared V gene, J gene, and junction length. Within these groups of sequences, hierarchical clustering was performed using the bygroup subcommand of DefineClones.py with the specified distance metric and linkage type. The resulting hierarchy was then trimmed into flat clusters at a fixed threshold determined using an automated method based on analyzing the “distance-to-nearest” profile. For each sequence, the distance to its nearest unidentical junction was calculated using the SHazaM R package (35) (version 0.1.3 or newer). The ideal bandwidth for the fourth derivative kernel density estimate of these distances was then estimated using the unbiased cross-validation method (47) of the fourth derivative of the kernel density estimate (48, 49) from the kedd R package (50) (version 1.0.3). This bandwidth was used to calculate a binned kernel density estimate of the distances with a Gaussian kernel using the KernSmooth R package (version 2.23-15). The minimum between the two modes of the resulting bimodal distribution of distances was then calculated by finding the first value at which the first derivative was 0 while the second derivative was positive, indicating a local minimum following a local maximum. If such a minimum were not found, an error would have been returned, but this was not the case for any of the analyses in this study.
Specificity, PPV, and sensitivity
Performance was characterized by considering the binary classification task of defining the relationship between all pairs of sequences (s1 and s2) with the same junction length. These classifications were then pooled together for the entire dataset. When s1 and s2 were known to be unrelated (termed condition negative) but were grouped into the same cluster (termed test positive), this was counted as a false positive. When they were grouped into different clusters (termed test negative), this was a true negative. When s1 and s2 were known to be related (termed condition positive) but were grouped into different clusters, this was counted as a false negative. When they were grouped into the same cluster, this was counted as a true positive. These relationships are outlined in Fig. 1.
In the case of experimental data, two sequences were known to be unrelated when they were derived from two separate individuals. Therefore, false positives were defined as sequences from different individuals being grouped together in a clone, whereas true negatives were defined as sequences from different individuals that were grouped into separate clones. Specificity was then calculated by dividing the number of true negative classifications by the number of condition negative classifications. In other words, specificity was defined as the fraction of pairs of unrelated sequences that were successfully inferred by the algorithm to be unrelated.
For the simulated datasets, the precise clonal membership of each sequence was known, yielding the intuitive definition of false positive and false negative classification. PPV was calculated by dividing true positive classifications by test positive classifications. In other words, PPV was the fraction of predicted clonal relationships that were actually true. Sensitivity was calculated by dividing true positive classifications by condition positive classifications. In other words, sensitivity was defined to be the fraction of actual clonal relationships that were successfully inferred by the algorithm.
Shannon entropy calculation
The Shannon entropy of clonally related sequences (within clones) was calculated for true clones having at least two members. Entropy was calculated for each of the first 24 nucleotide positions of the junction within each clone and averaged across clones having junction length <30 nt and 51 nt. To calculate Shannon entropy of clonally unrelated sequences (between clones), the most mutated sequence was selected from each true clone. These mutated sequences were then placed into groups sharing the same V gene, J gene, and junction length. Groups with only one sequence were discarded. Entropy was calculated for each of the first 24 nucleotide positions of the junction within each group and averaged across groups having junction length <30 nt and 51 nt. Error bars represent SEM. The calculations were made on all of the simulated datasets pooled together.
Results
The problem of clonal grouping takes a set of BCR sequences as input and returns a partition of that set into subsets (clonal groups) that each represent an independent clonal lineage. In this study, we investigated hierarchical clustering-based algorithms that infer a dendrogram based on pairwise sequence distances and then cut the dendrogram at a fixed distance (or threshold) to predict groups of clonally related sequences. To evaluate the performance of these clonal grouping algorithms, we consider three metrics: specificity, sensitivity, and PPV (Fig. 1).
Overview of method for calculating the performance measures: sensitivity, PPV, and specificity. Each node represents a sequence that belongs to either of two clones (gray or white clone fill color). In the top left panel, sequences that are truly from the same clone are connected with solid lines. In the top middle panel, sequences that are inferred to be from the same clone are connected with dashed lines. Both true and inferred relationships are shown in the top right panel. In the bottom left panel, sequences that are clonally unrelated are connected by solid lines. In the bottom middle panel, sequences that are inferred to be clonally unrelated are connected by dashed lines. The overlap of true and inferred unrelated sequences is shown in the bottom right panel. The true and inferred edges in the right panels representing clonally related (top) or unrelated (bottom) sequences are compared in order to assess performance. Sensitivity is defined by the number of true positive edges (dashed solid double lines in the top right panel) divided by the number of true edges (solid lines in the top left panel). PPV is the number of true positive edges (dashed solid double lines in the top right panel) divided by the number of inferred edges (dashed lines in the top middle panel). Specificity is number of true negative edges (dashed solid double lines in the bottom right panel) divided by the number of true non-edges (solid lines in the bottom left panel).
Specificity quantifies how frequently unrelated sequences are correctly separated into different clonal groups. In most experimental datasets, the exact clonal relationships between sequences are unknown. However, to estimate specificity we can take advantage of the fact that B cell clones cannot span multiple individuals (by definition). Using this knowledge, specificity is defined based on how frequently sequences from separate individuals are incorrectly inferred to be clonal relatives. This measure is used to quantify performance on three human Ig AIRR-Seq datasets (referred to as Healthy, Dengue, and WNV as detailed in Materials and Methods), each of which contains samples from multiple individuals.
Sensitivity represents inclusivity of an algorithm by measuring how often clonally related sequences are grouped together. PPV is a complementary metric that quantifies the precision of an algorithm by measuring how often inferred clonal relatives are truly clonally related. The calculations for sensitivity and PPV require knowledge of true clonal relationships and thus cannot be estimated from current human experimental datasets. For these measures, performance was evaluated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (referred to as R1–R4 as detailed in Materials and Methods). In the following sections, we use these performance metrics on the human experimental and simulation data to evaluate the choice of distance metrics, linkage methods, and threshold parameters in clustering-based clonal grouping algorithms.
Automated determination of clonal distance thresholds
A key step in hierarchical clustering-based clonal grouping involved choosing a threshold at which to cut the dendrogram, thus forming discrete groups of clonally related sequences. In previous work (21, 31), this threshold has often been fixed at a single value determined by manual inspection of a histogram of nearest neighbor distances (the so-called distance-to-nearest plot) (6). These histograms are typically bimodal and the threshold is selected to separate these two modes (Fig. 2A). This choice is motivated by the intuition that the smaller peak represents the distance between sequences within a clone (intraclonal distance), whereas the larger peak represents the distance between sequences in different clones (interclonal distance). Inspection of the nearest neighbor distance distributions for the Healthy, Dengue, and WNV experimental datasets used in this study showed that they are each clearly bimodal. However, they differed in the values that best separated the two modes (Fig. 2B). This result indicates that the distance threshold for clonal grouping is dataset-specific and must be recomputed for each study.
Analysis of the distance-to-nearest neighbor plot to define the distance threshold for partitioning clones. For each sequence, the length-normalized nucleotide Hamming distance to every other sequence was calculated, and the nearest (non-zero) neighbor was identified. (A) The histogram of nearest neighbor distances for a simulated dataset was fit using a density estimation of the distribution (solid line), and this fitting was then used to automatically infer a threshold that separated the two modes of the distribution (dashed vertical line). (B) Nearest neighbor distributions were calculated for the Dengue (dashed line), Healthy (dotted line), and WNV (dash-dot line) experimental datasets. Inferred thresholds for each of these human datasets are indicated by the vertical lines.
Manual determination of the clustering threshold is problematic because inspecting a distribution by eye is time-consuming and imprecise. We therefore sought to develop an automated analytic procedure for inferring the clustering threshold that mimics the widely used manual approach. Because the histograms generated from real data are rarely smooth, we first smooth the empirical distributions using a binned Gaussian kernel density estimator using a procedure that is well suited for bimodal distributions (49) (see Materials and Methods for details). Next, we computationally determine the minimum between the two peaks of the smoothened distribution and define this value to be the clustering threshold (Fig. 2A). This method placed the threshold at intuitive locations in the Healthy, Dengue, and WNV experimental datasets (Fig. 2B). This method for automated determination of the clustering threshold enables efficient application of clonal grouping algorithms under many parameter settings and on many different datasets.
We next applied the automated threshold to assess the performance of clonal grouping methods on experimental and simulated datasets. Hierarchical clustering using the nucleotide-based Hamming distance metric with single linkage was an effective approach. The mean specificity of the algorithm was >99% on experimental data (Fig. 3A), the mean sensitivity was ∼99% on simulated datasets (Fig. 3B), and PPV was >99% on simulated datasets (Fig. 3C). In contrast, using amino acid–based Hamming distance, which has been used in some previous studies (31, 33), had significantly worse sensitivity (Supplemental Fig. 1). One potential shortcoming of using the Hamming distance metric is that mutations in short junctions are penalized more heavily than mutations in longer junctions. Because junction regions vary widely in length (33–81 nt, 95% range from experimental datasets) and the clustering algorithm uses a fixed threshold, this bias could lead to suboptimal performance. In an attempt to address this issue, we and others have used a length-normalized Hamming distance metric, in which Hamming distance is divided by the length of the junction. This length normalization had minimal effect on specificity in the experimental data (Fig. 3A), but it significantly improved sensitivity (Fig. 3B) and PPV (Fig. 3C) in the simulated data (p < 10−4, paired t test). Thus, length normalization of the distance metric is an important step in clonal grouping algorithms.
Length normalization of the distance measure increases performance. Single linkage hierarchical clustering was used to identify clonally related sequences using a distance metric based on the absolute Hamming distance of the junction sequences (None), or the Hamming distance normalized by the length of the junction (Length). (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance. *p < 0.0001 by paired t test.
Single linkage has highest sensitivity with minimal compromise of PPV
Hierarchical and other agglomerative clustering algorithms require a method for determining the distance between two sets of points (in this case, sequences). The most common linkage methods include single, average, and complete linkage (51). Single linkage defines the interset distance as the minimum distance between all pairs of points from the given sets. This generally results in larger and more heterogeneous clusters (51). Complete linkage defines the interset distance as the maximum distance between all pairs of points from the given sets, and generally results in smaller and more homogeneous clusters (51). Average linkage defines the interset distance as the average distance between all pairs of points from the given sets, thus providing a compromise between single and complete linkage.
As expected, single linkage had the lowest specificity followed by average and then complete linkage (Fig. 4A). However, these differences were small, and specificity was >99% in all cases. A similar ranking was found for PPV, with complete and average linkage significantly improving performance relative to single linkage (p < 10−4, paired t test; Fig. 4C). Once again, however, the absolute performance differences were small, with all three approaches exhibiting a mean PPV of >99%. As specificity and PPV both reflect the accuracy of clonal grouping, we conclude that all of the linkage methods are accurate. In contrast, single linkage exhibited significantly higher sensitivity for clonal grouping relative to both average and complete linkage (p < 10−4, paired t test; Fig. 4B). In this case, the sensitivity differences were large, with single linkage having a mean sensitivity of 99% compared with 88% for average linkage and 60% for complete linkage. Overall, these results show that single linkage is significantly better at capturing the breadth of true clonal relationships, with only a modest reduction in accuracy.
Single linkage clustering provides the highest sensitivity, with minimal loss of specificity or PPV. Hierarchical clustering was used to identify clonally related sequences using length-normalized Hamming distance and single (Single), average (Avg), or complete (Compl) linkage. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance. *p < 0.0001 by paired t test.
Incorporating SHM biases does not significantly improve clonal grouping
Although the Hamming distance between two sequences is quick and easy to compute, it does not account for the intrinsic targeting and substitution biases in SHM (45). It is well established that activation-induced cytidine deaminase and the error-prone DNA repair pathways that drive B cell diversification frequently target specific DNA motifs (termed hot spots), whereas others are rarely mutated (termed cold spots). There is also a substitution bias such that transition mutations are significantly more frequent than transversions. Weighing all mutations equally undervalues the less probable mutations because two sequences are less likely to be part of a clone when they differ by mutations that occur less frequently (i.e., transversion mutations at cold-spot positions).
To test whether accounting for the intrinsic biases of SHM could improve the performance of clonal grouping algorithms, we implemented four previously proposed SHM models that account for these biases in different ways (see Materials and Methods for details). The first two models (hS5F-min and hS5F-avg) incorporate the hS5F targeting (mutability and substitution) models that incorporate the effects of the two nucleotides up- and downstream of a mutation (45). For each pair of nucleotides (n1 and n2) that differ between two junctions being compared, the hS5F-avg metric assumes that each one has an equal probability of having been present in the most recent common ancestor. Thus, the distance is taken as the average of mutating from n1 to n2 and from n2 to n1. The hS5F-min metric assumes that the ancestral base is the one that leads to the most likely mutation, and thus uses the minimum distance at each nucleotide position. The second two models (hS1F and m1n) ignore mutability, but they account for substitution bias using a model that depends only on the targeted base (i.e., ignoring surrounding nucleotides). As these models are symmetric, there is no assumption of which nucleotide was ancestral. Surprisingly, we found no significant performance differences for any of the distance metrics in experimental (Fig. 5A) or simulated datasets (Fig. 5B, 5C). These results support the use of the more efficient nucleotide Hamming distance metric. Overall, we find that hierarchical clustering using length-normalized nucleotide Hamming distance with single linkage performs well with mean sensitivity, specificity, and PPV all >99%.
Including hot- and cold-spot biases in the distance measure does not significantly impact the performance of clonal grouping. Single linkage hierarchical clustering was used to identify clonally related sequences using length-normalized nucleotide Hamming distance (ham), as well as other distance metrics that incorporated varying SHM biases as described in Materials and Methods: hS5F-min, m1n, hS1F, and hS5F-avg. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance.
Sequences with short junctions have high false-positive rates
We next investigated the dependence of performance on junction length to better understand the source of errors in clonal grouping. Junction length was minimally correlated with specificity in the experimental datasets (r = 0.1, Pearson correlation; Fig. 6A). Similarly, there was no correlation of junction length with sensitivity in the simulated datasets (r = 0.02, Pearson correlation; Fig. 6B). In contrast, there was a strong positive correlation of junction length with PPV in the simulated datasets (r = 0.4, Pearson correlation), with a mean PPV of 99.1% for sequences with shorter junctions (junctions <30 nt represented by at least 0.001% of the sequences in the repertoire) compared with a mean PPV of 99.8% for sequences with longer junctions (Fig. 6C). For sequences with shorter junction lengths, the right peak in the nearest neighbor distance distributions (interpreted as distances between unrelated sequences) begins to overlap the left peak (interpreted as distances between clonally related sequences). This pattern of decreasing interclonal distances as junction lengths decrease was also apparent considering nearest neighbors across individuals (Fig. 7). Thus, it appears that the distance threshold that effectively separates clonal members with longer junctions begins to group together unrelated sequences with shorter junctions. These results raise the possibility that using a single distance threshold to separate clonal groups across the entire dataset may not be optimal for sequences with shorter junctions.
PPV is decreased among sequences with smaller junction lengths. Single linkage hierarchical clustering was used to identify clonally related sequences using length-normalized nucleotide Hamming distance. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Horizontal dashed lines are shown at an arbitrary value in each panel to highlight trends.
Peaks in the distance-to-nearest neighbor distribution begin to converge at small junction lengths. Nucleotide Hamming distance to nearest neighbor distributions were calculated for sequences with junctions of length 24, 39, 51, 69, and 81 nt from the Healthy experimental dataset. Nearest neighbors were defined using sequences within the same subject (dark gray bars), or by using sequences from all other subjects (light gray bars). The single distance threshold inferred using normalized Hamming distance on all junction lengths in the Healthy dataset is shown by the dashed line in each distribution.
To determine whether using multiple distance thresholds could improve performance, we assessed precision (PPV) and recall (sensitivity) across a range of distance thresholds using sequences of varying junction lengths. We selected the shortest junction length represented by at least 0.001% of total sequences (24 nt), the overall average junction length (51 nt), and the longest junction length with a distinguishable spread in precision across distance thresholds (81 nt) as example junction lengths with which to assess performance. When considering all junction lengths as one group, the automated threshold appears close to optimal in trading off between PPV and sensitivity, with both >99% (Fig. 8A). The same holds true when considering the average (51 nt; Fig. 8C) and longer junction lengths (81 nt; Fig. 8D). Interestingly, the single threshold chosen on the entire dataset still provided a near optimal tradeoff in performance for sequences with shorter (24 nt) junctions, although peak sensitivity was lower for some of the simulated repertoires (Fig. 8B). Thus, using a junction length–specific threshold is unlikely to improve performance.
A single distance threshold is near optimal for all junction lengths. Single linkage hierarchical clustering with length-normalized nucleotide Hamming distance was used to identify clonally related sequences in 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Precision-recall curves were generated by varying the distance threshold from 0.10 to 0.20 at intervals of 0.01. The precision (PPV) and recall (sensitivity) of each run were averaged across the 10 simulations of each repertoire in (A) sequences of all lengths and in sequences with junctions of length (B) 24, (C) 51, and (D) 81 nt. The performance of the algorithm run with the inferred threshold is shown by filled points in all panels. Insets show the same data zoomed in on the upper right of the plot.
The inability to separate unrelated sequences with shorter junction lengths implies a lack of diversity between clones. Indeed, sequences with short junctions had lower nucleotide diversity than did sequences with longer junctions (Fig. 9). In other words, unrelated sequences with short junctions were more similar on a per nucleotide basis than unrelated sequences with longer junctions. This difference in diversity was not spread evenly across the junction region, but only became apparent after the first ∼7 nt of the junction region, which are generally derived directly from the V gene segment (44). As expected, this was in contrast to nucleotide diversity within clones, which was low across all junction lengths (Fig. 9). Overall, these results show that sequences with shorter junctions have a lower diversity than expected (given their length), making it difficult to separate clonally related and unrelated sequences.
Unrelated sequences with shorter junctions have lower entropy per nucleotide. Shannon entropy was calculated for each of the first 24 junction nucleotides of clonally related (▴) and unrelated (▪) sequences with junctions <30 nt in length (solid lines) and 51 nt in length (dashed lines). Error bars are SEM entropy across all sequences in the given group.
Although sequences with longer junctions can be grouped into clones with high sensitivity and PPV, false-positive assignments are still present. One reason underlying these errors is the use of the IGHJ6 gene, which is overrepresented in false positives with junctions at least 30 nt in length (p < 10−3, χ2 test). The IGHJ6 gene extends an extra 10 nucleotides into the junction region relative to all other IGHJ genes (44), and clones that use this J gene would thus be more similar to each other than clones using other J genes.
Discussion
AIRR-Seq enables large-scale characterization of the Ig repertoire with the potential for significant basic science and clinical insights. Effective population-level analysis of these data often relies on first identifying groups of clonally related sequences. Although hierarchal clustering-based approaches are widely applied in current studies, estimates of their performance and the tradeoffs inherent in the choice of distance or linkage method are lacking. In this study, we carry out an in-depth comparison of hierarchical clustering-based clonal grouping algorithms using an automated analysis pipeline, along with experimental and simulated validation datasets. The analysis pipeline has three stages. First, sequences are separated by V gene, J gene, and junction length. Second, sequences in these groups are assembled into a hierarchy as defined by the distance metric and linkage method. Finally, the hierarchy is partitioned into discrete clones at a fixed distance threshold. Whereas previous applications of this framework relied on a manual process to choose the distance threshold, we minimized human imprecision by developing an automated method to select a customized threshold for any dataset based on analysis of the distance-to-nearest distribution.
Quantitative evaluation of the clonal grouping methods was based on a combination of human experimental and simulated IgH data using the common performance measures of specificity, sensitivity, and PPV (29, 30, 34). Experimental data were used to estimate specificity based on the fact that, by definition, B cell clones cannot span different individuals. Sensitivity and related measures such as PPV cannot be estimated from human experimental data, because current approaches do not allow unequivocal identification of members of a clone. However, some murine model systems now allow identification of individual clones through Brainbow color labeling of individual B cells prior to affinity maturation (52). In this study, we used simulated data (where all clonal relationships are known explicitly) to estimate sensitivity and PPV along with specificity. Simulations have previously been used to validate performance of a probabilistic clonal grouping algorithm (29) and to benchmark other repertoire analysis tools (53, 54). Unlike previous approaches, our validation framework does not rely on an underlying model of clonal expansion and affinity maturation. Rather, the lineage tree topologies are taken from experimental datasets and are overlaid with new root sequences and somatic mutation patterns to more closely mimic observed repertoire structures. Furthermore, sensitivity and PPV were calculated on the dataset as a whole, which is less biased by clone size than the per-read averages calculated in previous studies (29).
Hierarchical clustering methods relate sequences to each other, but they do not split sequences into discrete clusters. Thus, a critical step in clonal grouping based on hierarchical clustering is determining the threshold used to partition the sequences into clusters (each representing a single clone). Previous approaches used a fixed threshold such as 80% nucleotide similarity (32), or relied on manual inspection of the distance-to-nearest plot to generate a study-specific threshold (31). Distance-to-nearest plots are generally bimodal, with the two peaks interpreted as clonally related (small distance peak) and unrelated (larger distance peak) sequences. Previously the clustering threshold has been determined manually by looking for the distance that best separates these two peaks. However, this process is time-consuming, can be subjective, and there are often multiple possible thresholds that provide equivalent separation between the peaks. To minimize human bias and to enable rapid evaluation of a range of parameter choices for this study, we developed an automated method to mimic the manual approach. Other general methods that have been used for determining the number of clusters in other types of data include the silhouette (55) or V-fold cross validation (56), but these require many rounds of clustering for optimization and are computationally intractable for the large size of AIRR-Seq datasets. The gap statistic (57) is also not applicable because it requires a null distribution of expected within-cluster dispersion, which is unknown and would require several assumptions to simulate for Ig sequences. Thus, the automated threshold inference based on the distance-to-nearest plot proved most feasible for the data type and is supported by biological intuition.
Hierarchical clustering is an agglomerative (or bottom up) method. Each sequence starts as its own cluster, and the closest pair of clusters is merged together until all sequences are connected. Closeness is defined by a distance metric. Many previous studies used Hamming distance (21, 32), which simply counts the number of differences between two junction sequences. Others attempted to incorporate the intrinsic biases of SHM to account for the presence of hot and cold spots (15, 18). In the present study, we found that incorporating the targeting and substitution biases of SHM into the distance metric did not significantly improve performance compared with nucleotide Hamming distance. It is possible that more sophisticated distance measures could play a more important role under conditions different from those investigated in this study. For example, when the mutation frequency is low, a different metric may better capture the importance of each individual mutation in determining the separation between clones. However, the current results suggest that the additional assumptions and computational cost of more complex distance metrics are unlikely to provide substantial performance improvements.
Although distance is measured between pairs of individual sequences, the linkage method defines how to calculate the closeness between clusters that contain multiple sequences. We evaluated the tradeoffs in the most common linkage methods: single, average, and complete. Single linkage is generally considered to be the most inclusive, and we found that it provides the best overall performance with specificity, sensitivity, and PPV all >99%. However, the appropriate choice of linkage may depend on the biological question being addressed. Complete linkage offers a higher PPV, but at the cost of a significant loss of sensitivity. This may be appropriate for research questions that are highly dependent on the accuracy of calling sequences as part of the same clone. For example, studies that attempt to link small numbers of Ag-specific sequences with clonal relatives or establish migration patterns between compartments with infrequent overlaps may benefit from the high confidence in each clonal connection provided by complete linkage. Nevertheless, the high absolute performance of single linkage should be acceptable for most studies.
The specificity, sensitivity, and PPV of single linkage clustering with Hamming distance are all >99%. However, the errors that are made by this algorithm are not random. We found that Ig sequences incorrectly grouped together as clonally related had disproportionately short junction regions (defined in this article as <30 nt). Because the V gene extends into the junction region by approximately seven nucleotides (44), a higher fraction of the nucleotides in short junctions would be expected to have limited diversity compared with longer junctions, potentially limiting the ability to distinguish between clones. This could be particularly problematic when using a length-normalized distance metric, which we showed was critical to achieve acceptable specificity. However, our analysis showed that the problem went beyond the V segment constituting a higher fraction of junction nucleotides. Clonally unrelated sequences with short junctions have less entropy on a per nucleotide basis compared with similar sequences with longer junctions. This lack of interclonal diversity could be due to a lower mutation frequency, the use of a restricted set of D genes, or fewer untemplated nucleotide additions between the germline gene segments. However, the entropy of clonally related sequences was comparable between short and longer junctions, suggesting that a uniformly lower mutation frequency is not responsible for the lower diversity in short junctions. Current algorithms for inferring germline gene segments still struggle with inference of the D gene (58), making it difficult to determine whether the underlying cause of low diversity is due to D gene usage bias, fewer untemplated nucleotide additions, or another mechanism.
Despite the diversity differences between shorter and longer junctions, using a separate threshold to partition sequences with different junction lengths did not improve performance. As precision-recall curves showed, the single threshold selected by analyzing the entire dataset as a whole almost always optimized the tradeoff between sensitivity and PPV for all junction lengths. Although a few repertoires did have an alternate threshold with slightly improved performance, these thresholds were not evident from the distance-to-nearest distributions. It is possible a method other than hierarchical clustering could better separate clones with shorter junctions, but this would be a minor improvement, as absolute performance of the single linkage hierarchical clustering with Hamming distance was high.
False-positive clonal assignments still occur among sequences with longer junctions, but these appear to have a different underlying cause. In this case the lack of nucleotide diversity can be explained, at least in part, by an overrepresentation of the IGHJ6 gene. This gene extends an extra 10 nt into the junction region (44), causing sequences to appear more similar than others using different J genes. It is possible that a separate analysis of these sequences may improve performance. One possibility for better separating clones that do not have sufficient diversity in the junction region is to require shared mutations in the V or J region, although this would penalize clones that have few mutations overall. Likelihood-based approaches, such as Cloanalyst (28) or partis (29), may help to increase confidence in clones with short junctions or those using IGHJ6, although these approaches are too computationally intensive to use on full AIRR-Seq datasets. Although it has been suggested that partis improves performance relative to hierarchal clustering (29), this study did not use dataset-specific distance thresholds and thus likely underestimated the performance of the clustering-based method.
The comparative analysis presented in the present study suggests clear tradeoffs in the choice of distance and linkage methods. However, it is possible that different tradeoffs would become apparent in data with different clone size distributions, mutation frequencies, and such. The simulation data used to measure sensitivity and PPV were based on lineage tree topologies drawn from only four underlying repertoires. The simulations also assume that Ig sequences maintain the same junction length during clonal evolution, an assumption that was also made in the clustering algorithm. However, recent research indicates that a small percentage of SHM events may lead to changes in junction length within a clone (59). Insertions/deletions may be present in the junction due to sequencing errors, but the inclusion of UMIs followed by computational approaches for sequencing error correction can reduce this impact (6, 60). Few clonal grouping methods deal with junction length differences, and although these effects are also not accounted for in the present study, their influence on performance is expected to be small. Another possible source of bias in the performance on experimental data is the potential presence of so-called “public clones,” or highly similar sequences across individuals. Such sequences may skew specificity estimates that were approximated on publicly available human experimental datasets based on the frequency of inferred groups that spanned individuals. Furthermore, this specificity measure depends on the frequency of highly similar sequences found across individuals, which may differ from the frequency of highly similar sequences found within an individual by chance. Public clones are especially prevalent in L chains due to the relative lack of diversity in the junction region, caused in part by the absence of the D gene segment. Clonal grouping using only L chain sequences is expected to have lower performance than the H chain results shown in this study. Future studies could benefit from using a larger number of AIRR-Seq datasets that span age, tissue, disease state, and others in addition to simulations based on a larger number of underlying experimental Ig repertoires.
In summary, computationally grouping Ig sequences into B cell clones is a critical part of AIRR-seq studies, and enables understanding the structure and affinity maturation of the Ig repertoire. In this study, we developed a framework for comparative analysis of clonal grouping approaches and determined that single linkage hierarchical clustering with length-normalized nucleotide Hamming distance performs well on both human experimental and simulated datasets. This algorithm is available as part of the Change-O and SHazaM packages (35) in our Immcantation tool suite (http://immcantation.readthedocs.io).
Disclosures
The authors have no financial conflicts of interest.
Acknowledgments
We thank Gur Yaari and Moriah Cohen for preprocessing the influenza vaccination data used to generate the simulated data and for helpful comments on the manuscript. We thank the High Performance Computing facilities operated by the Yale Center for Research Computing and Yale’s W. M. Keck Biotechnology Laboratory.
Footnotes
The sequences presented in this article have been submitted to the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) under accession number PRJNA349143.
S.H.K. was supported by National Institutes of Health Grant R01 AI104739, and N.T.G. was supported by National Institutes of Health Grant R01 AI104739, United States–Israel Binational Science Foundation Grant 2013395, and by Pharmaceutical Research and Manufacturers of America Grant R12886. The Yale High Performance Computing facilities are funded by National Institutes of Health Grants RR19895 and RR029676-01.
The online version of this article contains supplemental material.
Abbreviations used in this article:
- AIRR-Seq
- adaptive immune receptor repertoire sequencing
- hS1F
- human S1F
- hS5F
- human S5F
- PPV
- positive predictive value
- SHM
- somatic hypermutation
- UMI
- unique molecular identifier
- WNV
- West Nile virus.
- Received October 31, 2016.
- Accepted January 4, 2017.
- Copyright © 2017 by The American Association of Immunologists, Inc.