Skip to main content

Main menu

  • Home
  • Articles
    • Current Issue
    • Next in The JI
    • Archive
    • Brief Reviews
      • Neuroimmunology: To Sense and Protect
    • Pillars of Immunology
    • Translating Immunology
    • Most Read
    • Top Downloads
    • Annual Meeting Abstracts
  • COVID-19/SARS/MERS Articles
  • Info
    • About the Journal
    • For Authors
    • Journal Policies
    • Influence Statement
    • For Advertisers
  • Editors
  • Submit
    • Submit a Manuscript
    • Instructions for Authors
    • Journal Policies
  • Subscribe
    • Journal Subscriptions
    • Email Alerts
    • RSS Feeds
    • ImmunoCasts
  • More
    • Most Read
    • Most Cited
    • ImmunoCasts
    • AAI Disclaimer
    • Feedback
    • Help
    • Accessibility Statement
  • Other Publications
    • American Association of Immunologists
    • ImmunoHorizons

User menu

  • Subscribe
  • My alerts
  • Log in
  • Log out

Search

  • Advanced search
The Journal of Immunology
  • Other Publications
    • American Association of Immunologists
    • ImmunoHorizons
  • Subscribe
  • My alerts
  • Log in
  • Log out
The Journal of Immunology

Advanced Search

  • Home
  • Articles
    • Current Issue
    • Next in The JI
    • Archive
    • Brief Reviews
    • Pillars of Immunology
    • Translating Immunology
    • Most Read
    • Top Downloads
    • Annual Meeting Abstracts
  • COVID-19/SARS/MERS Articles
  • Info
    • About the Journal
    • For Authors
    • Journal Policies
    • Influence Statement
    • For Advertisers
  • Editors
  • Submit
    • Submit a Manuscript
    • Instructions for Authors
    • Journal Policies
  • Subscribe
    • Journal Subscriptions
    • Email Alerts
    • RSS Feeds
    • ImmunoCasts
  • More
    • Most Read
    • Most Cited
    • ImmunoCasts
    • AAI Disclaimer
    • Feedback
    • Help
    • Accessibility Statement
  • Follow The Journal of Immunology on Twitter
  • Follow The Journal of Immunology on RSS

Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data

Namita T. Gupta, Kristofor D. Adams, Adrian W. Briggs, Sonia C. Timberlake, Francois Vigneault and Steven H. Kleinstein
J Immunol March 15, 2017, 198 (6) 2489-2499; DOI: https://doi.org/10.4049/jimmunol.1601850
Namita T. Gupta
*Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kristofor D. Adams
†AbVitro, Boston, MA 02210;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kristofor D. Adams
Adrian W. Briggs
†AbVitro, Boston, MA 02210;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sonia C. Timberlake
†AbVitro, Boston, MA 02210;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Francois Vigneault
†AbVitro, Boston, MA 02210;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Francois Vigneault
Steven H. Kleinstein
*Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520;
‡Department of Immunobiology, Yale School of Medicine, New Haven, CT 06520; and
§Department of Pathology, Yale School of Medicine, New Haven, CT 06520
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & Data
  • Info & Metrics
  • PDF + SI
  • PDF
Loading

Abstract

Adaptive immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to Ag, with diversified Ig receptors. These receptors can now be profiled on a large scale by next-generation sequencing. Such data provide a window into the microevolutionary dynamics that drive successful immune responses and the dysregulation that occurs with aging or disease. Clonal relationships are not directly measured, but they must be computationally inferred from these sequencing data. Although several hierarchical clustering-based methods have been proposed, they vary in distance and linkage methods and have not yet been rigorously compared. In this study, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. We find that single linkage clustering has high performance, with specificity, sensitivity, and positive predictive value all >99%, whereas other linkages result in a significant loss of sensitivity. Surprisingly, distance metrics that incorporate the biases of somatic hypermutation do not outperform simple Hamming distance. Although errors were more likely in sequences with short junctions, using the entire dataset to choose a single distance threshold for clustering is near optimal. Our results suggest that hierarchical clustering using single linkage with Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms.

Introduction

The capacity of B cells to modify their Abs or Ig receptors to adapt in response to pathogenic challenges is a key mechanism that protects us from infection. An initial diversity of ∼107 unique Ig molecules (1) stems from somatic recombination of gene segments in the B cell Ig gene locus compounded by stochastic nucleotide insertions and deletions at the junctions of these segments. Upon activation, these naive B cells diversify further by undergoing clonal expansion with somatic hypermutation (SHM) in the Ig gene (approximately one point mutation per 1000 bp per cell division) (2, 3) followed by selection for higher affinity B cells. This microevolutionary process known as affinity maturation results in B cells with diversified Ig receptors that are clonal relatives of the original activated B cell. In healthy human adults, class-switched memory B cells express Ig receptors that are ∼7% mutated (4). Our ability to profile this adaptive immune response has dramatically improved through the application of next-generation sequencing, which allows for measurement of tens to hundreds of millions of B cell receptors (5). However, the identification of sequences that belong to the same B cell clone in these data remains a significant challenge (6).

Adaptive immune receptor repertoire sequencing (AIRR-Seq) is being widely used for both basic science and clinical studies (7–9). Statistical properties of the repertoire, such as diversity or mutational load, are being used to gain insights into the dysregulation that occurs with aging or disease (4, 8, 10–21). Properly identifying clones is central to the calculation of many of these properties. For example, clone size distributions are the basis for several diversity measures, such as species richness, Shannon entropy, and the Gini–Simpson index (6) that parallel diversity measures in ecology (22). Diseases such as chronic lymphocytic leukemia are characterized by low diversity that is driven by the dominance of a small number of clones (23), and repertoire sequencing has been used to improve minimal residual disease detection for lymphoid cancers (8, 10). Responses to drugs such as rituximab have also been measured by changes in repertoire diversity in autoimmune disease (11, 12), characterizing treatment regimens that lead to successful remission or result in persistent clonal expansions. Decreases in repertoire diversity have been associated with aging (4, 13, 14). In subjects with seasonal allergies, the IgE repertoire is the least diverse compared with other isotypes in blood and nasal biopsies, indicating a focused immune response (16).

Analysis of diversity within a clonal lineage also has several applications. Reconstruction of B cell clonal lineages using methods such as maximum parsimony or likelihood (24) allows tracing somatic mutations through the Ig sequences and helps in understanding the evolution of neutralizing Abs (17, 18). Lineage relationships have also been used to gain insight into the mechanisms underlying isotype switching (25, 26) and to show that B cell clones in the CNS in subjects with multiple sclerosis are first activated in the periphery (15). Identifying clones that include sequences with known Ag specificities has also been used to reveal novel Ag-specific sequences (19, 20). Thus, clonal partitioning of AIRR-Seq data is central to a wide range of applications.

Despite its importance, there is no consensus on the best method for grouping Ig sequences into B cell clones. Most current approaches leverage the high diversity of the junction region (i.e., where the V, D, and J gene segments join) as a “fingerprint” to identify each B cell clone (27). Because it is unlikely that two separate recombination events would lead to identical junctions, sequences with junction regions that are similar enough are determined to share a common B cell ancestor (i.e., be clonally related) rather than to have arisen independently. Probabilistic models have been developed to calculate likelihood of sharing a B cell ancestor and subsequently infer clonal grouping (28, 29). However, these algorithms have run times that scale exponentially, which is computationally intractable for large sequencing datasets (29). In practice, most studies cluster sequences based on junction sequence similarity (13, 15, 18, 21, 30–32).

Although many clustering approaches exist, hierarchical clustering is the most widely used framework for grouping clonally related sequences. Hierarchical clustering requires a measure of distance between pairs of sequences and a choice of linkage to define the distance between groups of sequences. Because hierarchical clustering produces a tree defining the relationships between all sequences, it is also necessary to specify a method to cut the hierarchy to identify discrete clonal groups. In practice, most studies first split the sequences using some similarity requirement on the germline gene segments (e.g., identical V and J gene segments, junction length) and then apply hierarchical clustering on the junction sequence of these smaller groups (8, 15, 18, 21, 30–34). Several distance metrics have been proposed, including Hamming distance, which is simply the absolute count of differences between two amino acid (31, 33) or nucleotide (21, 32) sequences, normalized edit distance (30), and a metric that incorporates hot-/cold-spot biases in SHM targeting (15, 18). In addition to metrics defining distance between two sequences, linkage methods define how distance is calculated between groups of sequences. Different clonal grouping algorithms use single (15, 18, 21, 32), average (30), or complete (13) linkage. The threshold at which the hierarchy is cut to define clusters of clonally related sequences has also been determined in several ways. Chen et al. (30) propose a fixed threshold that is manually identified based on when the rate of cluster merging events changes for a gold standard dataset (30). Glanville et al. (31) introduced a method based on the observed bimodal distribution of distances from each sequence to its nearest neighbor. In this case, the first mode is assumed to represent sequences with clonal relatives in the data (near neighbors), whereas the second mode is taken to represent sequences without clonal relatives in the data (distant neighbors). The threshold is then selected to be the value that separates the two modes of this distribution (21, 31). As of yet, there has not been an in-depth evaluation of performance of hierarchical clustering–based clonal grouping algorithms including a comparison of the different distance and linkage methods on AIRR-Seq data.

In this study, we carry out a comparative analysis of distance metrics and linkage methods for hierarchical clustering–based clonal grouping of IgH sequences. A combination of experimental and simulation-based criteria is used to evaluate the performance of these algorithms, including estimates of specificity, sensitivity, and positive predictive value (PPV). Overall, we find that single-linkage hierarchical clustering with nucleotide Hamming distance has excellent performance, with specificity, sensitivity, and PPV all >99%. Implementations of all clonal grouping methods, along with extensive documentation, are available through the Change-O and SHazaM packages (35) as part of the Immcantation tool suite (http://immcantation.readthedocs.io) for AIRR-Seq analysis.

Materials and Methods

Human BCR repertoire sequencing data

Three BCR repertoire sequencing datasets (Healthy, Dengue and West Nile virus [WNV]) were used to measure the performance of clonal grouping methods. The Healthy dataset was composed of sequences from PBMCs isolated from healthy adult subjects (n = 27) as previously described (14). Sequences were filtered as described in Wang et al. (14), including removal of non-IGH artifacts, sequences with V-gene insertion or deletions, and chimeric sequences. The Dengue dataset was composed of sequences from PBMCs isolated from subjects with acute dengue infection (n = 42) as described previously (36). Sequences were filtered as described in Parameswaran et al. (36), including removal of reads containing insertions and deletions as determined by alignment against the V and J germline repertoires. The WNV dataset was composed of sequences from PBMCs and sorted plasma, memory, and naive B cells isolated from subjects recently infected with WNV (n = 7) as previously described (18). Sequences were filtered as described in Tsioris et al. (18), including filtering based on sequence quality using pRESTO (37) version 0.4. In each case, processed sequencing data were obtained from the authors. Germline gene segments were inferred for each sequence by using IMGT/HighV-QUEST (38). The Healthy dataset was run through IMGT/HighV-QUEST on December 21, 2014, the Dengue dataset was run through IMGT/HighV-QUEST on March 12, 2015, and the WNV dataset was run through IMGT/HighV-QUEST on March 21, 2014. Sequences identified as nonfunctional by IMGT/HighV-QUEST were removed using the Change-O toolkit version 0.2.0 (35).

Two additional BCR repertoire sequencing datasets from healthy adult subjects were used as a source of naive BCR sequences for the lineage simulations. The first was composed of sequences from PBMCs and sorted naive B cells isolated from healthy control subjects (n = 4) as part of a study of myasthenia gravis described in Vander Heiden et al. (39). The second was composed of sequences from total RNA isolated from blood samples of healthy adult subjects (n = 3) as part of an influenza vaccination study described in Laserson et al. (40). In this case the samples, which were originally sequenced using Roche 454, were resequenced using Illumina MiSeq and published for the first time in this article (see details in the next two sections of the Materials and Methods). Identical sequences from the same sample were counted once, but identical sequences from different samples were counted independently. Both datasets are available on the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) under BioProject accession numbers PRJNA349143 (influenza vaccination study) and PRJNA338795 (myasthenia gravis study).

Library preparation and BCR sequencing of healthy subject sequences

The blood samples collected in the influenza vaccination study by Laserson et al. (40) were resequenced using the Illumina MiSeq platform as previously described (18, 41). Briefly, RNA was reverse transcribed into cDNA using a biotinylated oligo(dT) primer. An adaptor sequence was added to the 3′ end of all cDNA, which contains the Illumina P7 universal priming site and a 17-nt unique molecular identifier (UMI). Products were purified using streptavidin-coated magnetic beads followed by a primary PCR reaction using a pool of primers targeting the IGHA, IGHD, IGHE, IGHG, IGHM, IGKC, and IGLC regions, as well as a sample-indexed Illumina P7C7 primer. The Ig-specific primers contained tails corresponding to the Illumina P5 sequence. PCR products were then purified using AMPure XP beads. A secondary PCR was then performed to add the Illumina C5 clustering sequence to the end of the molecule containing the C region. The number of secondary PCR cycles was tailored to each sample to avoid entering plateau phase, as judged by a prior quantitative PCR analysis. Final products were purified, quantified with Agilent TapeStation, and pooled in equimolar proportions, followed by high-throughput paired-end sequencing on the Illumina MiSeq platform. For sequencing, the Illumina 600 cycle kit was used with the modifications that 325 cycles were used for read 1, six cycles for the index reads, 300 cycles for read 2, and a 10% PhiX spike-in to increase sequence diversity.

Read processing of healthy subject sequences

MiSeq reads from the influenza vaccination study by Laserson et al. (40) were demultiplexed using Illumina software. Positions with less than Phred quality 5 were masked with Ns. Isotype-specific primers and unique molecular barcodes (UMI) were identified in the amplicon and cut using pRESTO (37) MaskPrimers-score. Read 1 and read 2 consensus sequences were generated separately for each mRNA from reads grouped by UMI, which represent PCR replicates arising from a single initiating mRNA molecule. UMI read groups were aligned with MUSCLE (42), and pRESTO was used to construct a consensus sequence using BuildConsensus, requiring that ≥60% of called PCR primer sequences agree for the read group, maximum nucleotide diversity of 0.1, using majority rule on indel positions, and masking alignment columns with low posterior (consensus) quality. Paired-end consensus sequences were then stitched in two rounds. First, ungapped alignment of each read pair’s consensus sequence termini was optimized using a Z-score approximation and scored with a binomial p value as implemented in pRESTO AssemblePairs-align. For read pairs failing to stitch this way, stitching was attempted using the human BCR germline V exons to scaffold each read prior to stitching or gapped read-joining, using pRESTO AssemblePairs-reference. Positions with posterior consensus quality less than Phred 5 were masked again with Ns. All pRESTO tools used were version 0.5.1 in conjunction with Python 3.4. Germline gene segments were inferred using IgBLAST version 1.4.0 (43) with the IMGT/GENE-DB (44) reference sequences from June 7, 2014 and output was parsed with Change-O (35) MakeDb version 0.3.0. Duplicate sequences were collapsed and only those H chain sequences with at least two reads supporting the sequence were retained for further analysis.

Simulation of B cell clonal lineages

Each simulated clone was generated by introducing mutations into an experimentally observed naive BCR sequence according to an observed lineage tree topology (i.e., branching pattern). Lineage tree topologies were previously derived based on sequencing data from lymph node samples collected as part of a published study of multiple sclerosis (15). The set of 7,103, 4,066, 8,244, and 14,782 lineage topologies from four subjects (referred to in this article as R1, R2, R3, and R4, respectively) were each used as the basis for 10 simulations resulting in 40 total simulated datasets. To generate a simulated dataset, the root of each lineage was randomly chosen (without replacement) from a large pool of unmutated sequences from healthy subjects obtained from Vander Heiden et al. (39) and Laserson et al. (40) (described in Human BCR repertoire sequencing data). Mutations were then added to the sequence to match the experimentally observed mutation counts of each branch in the lineage tree according to the human S5F (hS5F) targeting model (45). The simulated sequences then had germline gene segments inferred using IgBLAST version 1.4.0 (43) with the IMGT/GENE-DB (44) reference sequences from May 2, 2016 and output was parsed with Change-O (35) MakeDb version 0.3.3.

Distance metrics

Hamming distance was defined as the absolute count of letter changes between nucleotide junction sequences or amino acid junction sequences. The 5-mer distance metrics were all based on the hS5F targeting and substitution models described in Yaari et al. (45), which estimate 1): the relative probability of a nucleotide position being targeted for somatic mutation, and 2) the probability of mutating to each of the three other possible nucleotides, based on the two nucleotides up- and downstream. This probability (p) was transformed into a distance (d) using the formula: d = −log10(p). The distance between two junction sequences was defined to be the sum of distances between each nucleotide position. For a given mutation between two junction sequences, the hS5F-min model took its distance to be the minimum of mutating from nucleotide n1 to n2 and from n2 to n1 at that position. The hS5F-avg model took the distance of the mutated position to be the average of mutating from n1 to n2 and from n2 to n1.

The human S1F (hS1F) model is equivalent to hS5F-min, but used the human symmetric substitution matrix based on the single mutated nucleotide described in Yaari et al. (45). The m1n model is equivalent to hS5F-min, but used the mouse symmetric substitution matrix based on the single mutated nucleotide described in Shapiro et al. (46). Both the human and mouse substitution matrices are four-by-four matrices whose rows correspond to the conditional probabilities of each of the four nucleotides to mutate to any of the other nucleotides. Hence, the diagonal of each matrix is 0, and each row sums to 1. To estimate distances between sequences based on each single nucleotide substitution matrix, we first approximated each matrix by a symmetric matrix both with respect to time (i.e., equal likelihood of mutating from nucleotide n1 to n2 and from n2 to n1), and with respect to strand (e.g., equal likelihood of mutating from C to T and from G to A). The symmetric matrix has three free parameters that were estimated by minimizing the sum of squares between this matrix and the original substitution matrix. The fitted matrix was normalized to ensure that each row sums to 1. Using the symmetric substitution matrix, a distance between two sequences was computed by going over all substitution events and summing over the logarithms of the corresponding conditional probabilities from the symmetric substitution matrix.

When normalizing by length for any distance metric, these distances were divided by the length of the junction region.

Implementation of clonal grouping algorithms

Clonal grouping algorithms were implemented and are made available in the Change-O toolkit (35) (version 0.3.1 or newer). Sequences were first grouped by shared V gene, J gene, and junction length. Within these groups of sequences, hierarchical clustering was performed using the bygroup subcommand of DefineClones.py with the specified distance metric and linkage type. The resulting hierarchy was then trimmed into flat clusters at a fixed threshold determined using an automated method based on analyzing the “distance-to-nearest” profile. For each sequence, the distance to its nearest unidentical junction was calculated using the SHazaM R package (35) (version 0.1.3 or newer). The ideal bandwidth for the fourth derivative kernel density estimate of these distances was then estimated using the unbiased cross-validation method (47) of the fourth derivative of the kernel density estimate (48, 49) from the kedd R package (50) (version 1.0.3). This bandwidth was used to calculate a binned kernel density estimate of the distances with a Gaussian kernel using the KernSmooth R package (version 2.23-15). The minimum between the two modes of the resulting bimodal distribution of distances was then calculated by finding the first value at which the first derivative was 0 while the second derivative was positive, indicating a local minimum following a local maximum. If such a minimum were not found, an error would have been returned, but this was not the case for any of the analyses in this study.

Specificity, PPV, and sensitivity

Performance was characterized by considering the binary classification task of defining the relationship between all pairs of sequences (s1 and s2) with the same junction length. These classifications were then pooled together for the entire dataset. When s1 and s2 were known to be unrelated (termed condition negative) but were grouped into the same cluster (termed test positive), this was counted as a false positive. When they were grouped into different clusters (termed test negative), this was a true negative. When s1 and s2 were known to be related (termed condition positive) but were grouped into different clusters, this was counted as a false negative. When they were grouped into the same cluster, this was counted as a true positive. These relationships are outlined in Fig. 1.

In the case of experimental data, two sequences were known to be unrelated when they were derived from two separate individuals. Therefore, false positives were defined as sequences from different individuals being grouped together in a clone, whereas true negatives were defined as sequences from different individuals that were grouped into separate clones. Specificity was then calculated by dividing the number of true negative classifications by the number of condition negative classifications. In other words, specificity was defined as the fraction of pairs of unrelated sequences that were successfully inferred by the algorithm to be unrelated.

For the simulated datasets, the precise clonal membership of each sequence was known, yielding the intuitive definition of false positive and false negative classification. PPV was calculated by dividing true positive classifications by test positive classifications. In other words, PPV was the fraction of predicted clonal relationships that were actually true. Sensitivity was calculated by dividing true positive classifications by condition positive classifications. In other words, sensitivity was defined to be the fraction of actual clonal relationships that were successfully inferred by the algorithm.

Shannon entropy calculation

The Shannon entropy of clonally related sequences (within clones) was calculated for true clones having at least two members. Entropy was calculated for each of the first 24 nucleotide positions of the junction within each clone and averaged across clones having junction length <30 nt and 51 nt. To calculate Shannon entropy of clonally unrelated sequences (between clones), the most mutated sequence was selected from each true clone. These mutated sequences were then placed into groups sharing the same V gene, J gene, and junction length. Groups with only one sequence were discarded. Entropy was calculated for each of the first 24 nucleotide positions of the junction within each group and averaged across groups having junction length <30 nt and 51 nt. Error bars represent SEM. The calculations were made on all of the simulated datasets pooled together.

Results

The problem of clonal grouping takes a set of BCR sequences as input and returns a partition of that set into subsets (clonal groups) that each represent an independent clonal lineage. In this study, we investigated hierarchical clustering-based algorithms that infer a dendrogram based on pairwise sequence distances and then cut the dendrogram at a fixed distance (or threshold) to predict groups of clonally related sequences. To evaluate the performance of these clonal grouping algorithms, we consider three metrics: specificity, sensitivity, and PPV (Fig. 1).

FIGURE 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 1.

Overview of method for calculating the performance measures: sensitivity, PPV, and specificity. Each node represents a sequence that belongs to either of two clones (gray or white clone fill color). In the top left panel, sequences that are truly from the same clone are connected with solid lines. In the top middle panel, sequences that are inferred to be from the same clone are connected with dashed lines. Both true and inferred relationships are shown in the top right panel. In the bottom left panel, sequences that are clonally unrelated are connected by solid lines. In the bottom middle panel, sequences that are inferred to be clonally unrelated are connected by dashed lines. The overlap of true and inferred unrelated sequences is shown in the bottom right panel. The true and inferred edges in the right panels representing clonally related (top) or unrelated (bottom) sequences are compared in order to assess performance. Sensitivity is defined by the number of true positive edges (dashed solid double lines in the top right panel) divided by the number of true edges (solid lines in the top left panel). PPV is the number of true positive edges (dashed solid double lines in the top right panel) divided by the number of inferred edges (dashed lines in the top middle panel). Specificity is number of true negative edges (dashed solid double lines in the bottom right panel) divided by the number of true non-edges (solid lines in the bottom left panel).

Specificity quantifies how frequently unrelated sequences are correctly separated into different clonal groups. In most experimental datasets, the exact clonal relationships between sequences are unknown. However, to estimate specificity we can take advantage of the fact that B cell clones cannot span multiple individuals (by definition). Using this knowledge, specificity is defined based on how frequently sequences from separate individuals are incorrectly inferred to be clonal relatives. This measure is used to quantify performance on three human Ig AIRR-Seq datasets (referred to as Healthy, Dengue, and WNV as detailed in Materials and Methods), each of which contains samples from multiple individuals.

Sensitivity represents inclusivity of an algorithm by measuring how often clonally related sequences are grouped together. PPV is a complementary metric that quantifies the precision of an algorithm by measuring how often inferred clonal relatives are truly clonally related. The calculations for sensitivity and PPV require knowledge of true clonal relationships and thus cannot be estimated from current human experimental datasets. For these measures, performance was evaluated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (referred to as R1–R4 as detailed in Materials and Methods). In the following sections, we use these performance metrics on the human experimental and simulation data to evaluate the choice of distance metrics, linkage methods, and threshold parameters in clustering-based clonal grouping algorithms.

Automated determination of clonal distance thresholds

A key step in hierarchical clustering-based clonal grouping involved choosing a threshold at which to cut the dendrogram, thus forming discrete groups of clonally related sequences. In previous work (21, 31), this threshold has often been fixed at a single value determined by manual inspection of a histogram of nearest neighbor distances (the so-called distance-to-nearest plot) (6). These histograms are typically bimodal and the threshold is selected to separate these two modes (Fig. 2A). This choice is motivated by the intuition that the smaller peak represents the distance between sequences within a clone (intraclonal distance), whereas the larger peak represents the distance between sequences in different clones (interclonal distance). Inspection of the nearest neighbor distance distributions for the Healthy, Dengue, and WNV experimental datasets used in this study showed that they are each clearly bimodal. However, they differed in the values that best separated the two modes (Fig. 2B). This result indicates that the distance threshold for clonal grouping is dataset-specific and must be recomputed for each study.

FIGURE 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 2.

Analysis of the distance-to-nearest neighbor plot to define the distance threshold for partitioning clones. For each sequence, the length-normalized nucleotide Hamming distance to every other sequence was calculated, and the nearest (non-zero) neighbor was identified. (A) The histogram of nearest neighbor distances for a simulated dataset was fit using a density estimation of the distribution (solid line), and this fitting was then used to automatically infer a threshold that separated the two modes of the distribution (dashed vertical line). (B) Nearest neighbor distributions were calculated for the Dengue (dashed line), Healthy (dotted line), and WNV (dash-dot line) experimental datasets. Inferred thresholds for each of these human datasets are indicated by the vertical lines.

Manual determination of the clustering threshold is problematic because inspecting a distribution by eye is time-consuming and imprecise. We therefore sought to develop an automated analytic procedure for inferring the clustering threshold that mimics the widely used manual approach. Because the histograms generated from real data are rarely smooth, we first smooth the empirical distributions using a binned Gaussian kernel density estimator using a procedure that is well suited for bimodal distributions (49) (see Materials and Methods for details). Next, we computationally determine the minimum between the two peaks of the smoothened distribution and define this value to be the clustering threshold (Fig. 2A). This method placed the threshold at intuitive locations in the Healthy, Dengue, and WNV experimental datasets (Fig. 2B). This method for automated determination of the clustering threshold enables efficient application of clonal grouping algorithms under many parameter settings and on many different datasets.

We next applied the automated threshold to assess the performance of clonal grouping methods on experimental and simulated datasets. Hierarchical clustering using the nucleotide-based Hamming distance metric with single linkage was an effective approach. The mean specificity of the algorithm was >99% on experimental data (Fig. 3A), the mean sensitivity was ∼99% on simulated datasets (Fig. 3B), and PPV was >99% on simulated datasets (Fig. 3C). In contrast, using amino acid–based Hamming distance, which has been used in some previous studies (31, 33), had significantly worse sensitivity (Supplemental Fig. 1). One potential shortcoming of using the Hamming distance metric is that mutations in short junctions are penalized more heavily than mutations in longer junctions. Because junction regions vary widely in length (33–81 nt, 95% range from experimental datasets) and the clustering algorithm uses a fixed threshold, this bias could lead to suboptimal performance. In an attempt to address this issue, we and others have used a length-normalized Hamming distance metric, in which Hamming distance is divided by the length of the junction. This length normalization had minimal effect on specificity in the experimental data (Fig. 3A), but it significantly improved sensitivity (Fig. 3B) and PPV (Fig. 3C) in the simulated data (p < 10−4, paired t test). Thus, length normalization of the distance metric is an important step in clonal grouping algorithms.

FIGURE 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 3.

Length normalization of the distance measure increases performance. Single linkage hierarchical clustering was used to identify clonally related sequences using a distance metric based on the absolute Hamming distance of the junction sequences (None), or the Hamming distance normalized by the length of the junction (Length). (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance. *p < 0.0001 by paired t test.

Single linkage has highest sensitivity with minimal compromise of PPV

Hierarchical and other agglomerative clustering algorithms require a method for determining the distance between two sets of points (in this case, sequences). The most common linkage methods include single, average, and complete linkage (51). Single linkage defines the interset distance as the minimum distance between all pairs of points from the given sets. This generally results in larger and more heterogeneous clusters (51). Complete linkage defines the interset distance as the maximum distance between all pairs of points from the given sets, and generally results in smaller and more homogeneous clusters (51). Average linkage defines the interset distance as the average distance between all pairs of points from the given sets, thus providing a compromise between single and complete linkage.

As expected, single linkage had the lowest specificity followed by average and then complete linkage (Fig. 4A). However, these differences were small, and specificity was >99% in all cases. A similar ranking was found for PPV, with complete and average linkage significantly improving performance relative to single linkage (p < 10−4, paired t test; Fig. 4C). Once again, however, the absolute performance differences were small, with all three approaches exhibiting a mean PPV of >99%. As specificity and PPV both reflect the accuracy of clonal grouping, we conclude that all of the linkage methods are accurate. In contrast, single linkage exhibited significantly higher sensitivity for clonal grouping relative to both average and complete linkage (p < 10−4, paired t test; Fig. 4B). In this case, the sensitivity differences were large, with single linkage having a mean sensitivity of 99% compared with 88% for average linkage and 60% for complete linkage. Overall, these results show that single linkage is significantly better at capturing the breadth of true clonal relationships, with only a modest reduction in accuracy.

FIGURE 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 4.

Single linkage clustering provides the highest sensitivity, with minimal loss of specificity or PPV. Hierarchical clustering was used to identify clonally related sequences using length-normalized Hamming distance and single (Single), average (Avg), or complete (Compl) linkage. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance. *p < 0.0001 by paired t test.

Incorporating SHM biases does not significantly improve clonal grouping

Although the Hamming distance between two sequences is quick and easy to compute, it does not account for the intrinsic targeting and substitution biases in SHM (45). It is well established that activation-induced cytidine deaminase and the error-prone DNA repair pathways that drive B cell diversification frequently target specific DNA motifs (termed hot spots), whereas others are rarely mutated (termed cold spots). There is also a substitution bias such that transition mutations are significantly more frequent than transversions. Weighing all mutations equally undervalues the less probable mutations because two sequences are less likely to be part of a clone when they differ by mutations that occur less frequently (i.e., transversion mutations at cold-spot positions).

To test whether accounting for the intrinsic biases of SHM could improve the performance of clonal grouping algorithms, we implemented four previously proposed SHM models that account for these biases in different ways (see Materials and Methods for details). The first two models (hS5F-min and hS5F-avg) incorporate the hS5F targeting (mutability and substitution) models that incorporate the effects of the two nucleotides up- and downstream of a mutation (45). For each pair of nucleotides (n1 and n2) that differ between two junctions being compared, the hS5F-avg metric assumes that each one has an equal probability of having been present in the most recent common ancestor. Thus, the distance is taken as the average of mutating from n1 to n2 and from n2 to n1. The hS5F-min metric assumes that the ancestral base is the one that leads to the most likely mutation, and thus uses the minimum distance at each nucleotide position. The second two models (hS1F and m1n) ignore mutability, but they account for substitution bias using a model that depends only on the targeted base (i.e., ignoring surrounding nucleotides). As these models are symmetric, there is no assumption of which nucleotide was ancestral. Surprisingly, we found no significant performance differences for any of the distance metrics in experimental (Fig. 5A) or simulated datasets (Fig. 5B, 5C). These results support the use of the more efficient nucleotide Hamming distance metric. Overall, we find that hierarchical clustering using length-normalized nucleotide Hamming distance with single linkage performs well with mean sensitivity, specificity, and PPV all >99%.

FIGURE 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 5.

Including hot- and cold-spot biases in the distance measure does not significantly impact the performance of clonal grouping. Single linkage hierarchical clustering was used to identify clonally related sequences using length-normalized nucleotide Hamming distance (ham), as well as other distance metrics that incorporated varying SHM biases as described in Materials and Methods: hS5F-min, m1n, hS1F, and hS5F-avg. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Bars indicate mean performance.

Sequences with short junctions have high false-positive rates

We next investigated the dependence of performance on junction length to better understand the source of errors in clonal grouping. Junction length was minimally correlated with specificity in the experimental datasets (r = 0.1, Pearson correlation; Fig. 6A). Similarly, there was no correlation of junction length with sensitivity in the simulated datasets (r = 0.02, Pearson correlation; Fig. 6B). In contrast, there was a strong positive correlation of junction length with PPV in the simulated datasets (r = 0.4, Pearson correlation), with a mean PPV of 99.1% for sequences with shorter junctions (junctions <30 nt represented by at least 0.001% of the sequences in the repertoire) compared with a mean PPV of 99.8% for sequences with longer junctions (Fig. 6C). For sequences with shorter junction lengths, the right peak in the nearest neighbor distance distributions (interpreted as distances between unrelated sequences) begins to overlap the left peak (interpreted as distances between clonally related sequences). This pattern of decreasing interclonal distances as junction lengths decrease was also apparent considering nearest neighbors across individuals (Fig. 7). Thus, it appears that the distance threshold that effectively separates clonal members with longer junctions begins to group together unrelated sequences with shorter junctions. These results raise the possibility that using a single distance threshold to separate clonal groups across the entire dataset may not be optimal for sequences with shorter junctions.

FIGURE 6.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 6.

PPV is decreased among sequences with smaller junction lengths. Single linkage hierarchical clustering was used to identify clonally related sequences using length-normalized nucleotide Hamming distance. (A) Specificity was calculated using three human experimental datasets (Healthy, Dengue, and WNV). Sensitivity (B) and PPV (C) were calculated using 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Horizontal dashed lines are shown at an arbitrary value in each panel to highlight trends.

FIGURE 7.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 7.

Peaks in the distance-to-nearest neighbor distribution begin to converge at small junction lengths. Nucleotide Hamming distance to nearest neighbor distributions were calculated for sequences with junctions of length 24, 39, 51, 69, and 81 nt from the Healthy experimental dataset. Nearest neighbors were defined using sequences within the same subject (dark gray bars), or by using sequences from all other subjects (light gray bars). The single distance threshold inferred using normalized Hamming distance on all junction lengths in the Healthy dataset is shown by the dashed line in each distribution.

To determine whether using multiple distance thresholds could improve performance, we assessed precision (PPV) and recall (sensitivity) across a range of distance thresholds using sequences of varying junction lengths. We selected the shortest junction length represented by at least 0.001% of total sequences (24 nt), the overall average junction length (51 nt), and the longest junction length with a distinguishable spread in precision across distance thresholds (81 nt) as example junction lengths with which to assess performance. When considering all junction lengths as one group, the automated threshold appears close to optimal in trading off between PPV and sensitivity, with both >99% (Fig. 8A). The same holds true when considering the average (51 nt; Fig. 8C) and longer junction lengths (81 nt; Fig. 8D). Interestingly, the single threshold chosen on the entire dataset still provided a near optimal tradeoff in performance for sequences with shorter (24 nt) junctions, although peak sensitivity was lower for some of the simulated repertoires (Fig. 8B). Thus, using a junction length–specific threshold is unlikely to improve performance.

FIGURE 8.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 8.

A single distance threshold is near optimal for all junction lengths. Single linkage hierarchical clustering with length-normalized nucleotide Hamming distance was used to identify clonally related sequences in 40 simulated datasets based on four experimentally observed sets of clonal lineage structures (R1–R4). Precision-recall curves were generated by varying the distance threshold from 0.10 to 0.20 at intervals of 0.01. The precision (PPV) and recall (sensitivity) of each run were averaged across the 10 simulations of each repertoire in (A) sequences of all lengths and in sequences with junctions of length (B) 24, (C) 51, and (D) 81 nt. The performance of the algorithm run with the inferred threshold is shown by filled points in all panels. Insets show the same data zoomed in on the upper right of the plot.

The inability to separate unrelated sequences with shorter junction lengths implies a lack of diversity between clones. Indeed, sequences with short junctions had lower nucleotide diversity than did sequences with longer junctions (Fig. 9). In other words, unrelated sequences with short junctions were more similar on a per nucleotide basis than unrelated sequences with longer junctions. This difference in diversity was not spread evenly across the junction region, but only became apparent after the first ∼7 nt of the junction region, which are generally derived directly from the V gene segment (44). As expected, this was in contrast to nucleotide diversity within clones, which was low across all junction lengths (Fig. 9). Overall, these results show that sequences with shorter junctions have a lower diversity than expected (given their length), making it difficult to separate clonally related and unrelated sequences.

FIGURE 9.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIGURE 9.

Unrelated sequences with shorter junctions have lower entropy per nucleotide. Shannon entropy was calculated for each of the first 24 junction nucleotides of clonally related (▴) and unrelated (▪) sequences with junctions <30 nt in length (solid lines) and 51 nt in length (dashed lines). Error bars are SEM entropy across all sequences in the given group.

Although sequences with longer junctions can be grouped into clones with high sensitivity and PPV, false-positive assignments are still present. One reason underlying these errors is the use of the IGHJ6 gene, which is overrepresented in false positives with junctions at least 30 nt in length (p < 10−3, χ2 test). The IGHJ6 gene extends an extra 10 nucleotides into the junction region relative to all other IGHJ genes (44), and clones that use this J gene would thus be more similar to each other than clones using other J genes.

Discussion

AIRR-Seq enables large-scale characterization of the Ig repertoire with the potential for significant basic science and clinical insights. Effective population-level analysis of these data often relies on first identifying groups of clonally related sequences. Although hierarchal clustering-based approaches are widely applied in current studies, estimates of their performance and the tradeoffs inherent in the choice of distance or linkage method are lacking. In this study, we carry out an in-depth comparison of hierarchical clustering-based clonal grouping algorithms using an automated analysis pipeline, along with experimental and simulated validation datasets. The analysis pipeline has three stages. First, sequences are separated by V gene, J gene, and junction length. Second, sequences in these groups are assembled into a hierarchy as defined by the distance metric and linkage method. Finally, the hierarchy is partitioned into discrete clones at a fixed distance threshold. Whereas previous applications of this framework relied on a manual process to choose the distance threshold, we minimized human imprecision by developing an automated method to select a customized threshold for any dataset based on analysis of the distance-to-nearest distribution.

Quantitative evaluation of the clonal grouping methods was based on a combination of human experimental and simulated IgH data using the common performance measures of specificity, sensitivity, and PPV (29, 30, 34). Experimental data were used to estimate specificity based on the fact that, by definition, B cell clones cannot span different individuals. Sensitivity and related measures such as PPV cannot be estimated from human experimental data, because current approaches do not allow unequivocal identification of members of a clone. However, some murine model systems now allow identification of individual clones through Brainbow color labeling of individual B cells prior to affinity maturation (52). In this study, we used simulated data (where all clonal relationships are known explicitly) to estimate sensitivity and PPV along with specificity. Simulations have previously been used to validate performance of a probabilistic clonal grouping algorithm (29) and to benchmark other repertoire analysis tools (53, 54). Unlike previous approaches, our validation framework does not rely on an underlying model of clonal expansion and affinity maturation. Rather, the lineage tree topologies are taken from experimental datasets and are overlaid with new root sequences and somatic mutation patterns to more closely mimic observed repertoire structures. Furthermore, sensitivity and PPV were calculated on the dataset as a whole, which is less biased by clone size than the per-read averages calculated in previous studies (29).

Hierarchical clustering methods relate sequences to each other, but they do not split sequences into discrete clusters. Thus, a critical step in clonal grouping based on hierarchical clustering is determining the threshold used to partition the sequences into clusters (each representing a single clone). Previous approaches used a fixed threshold such as 80% nucleotide similarity (32), or relied on manual inspection of the distance-to-nearest plot to generate a study-specific threshold (31). Distance-to-nearest plots are generally bimodal, with the two peaks interpreted as clonally related (small distance peak) and unrelated (larger distance peak) sequences. Previously the clustering threshold has been determined manually by looking for the distance that best separates these two peaks. However, this process is time-consuming, can be subjective, and there are often multiple possible thresholds that provide equivalent separation between the peaks. To minimize human bias and to enable rapid evaluation of a range of parameter choices for this study, we developed an automated method to mimic the manual approach. Other general methods that have been used for determining the number of clusters in other types of data include the silhouette (55) or V-fold cross validation (56), but these require many rounds of clustering for optimization and are computationally intractable for the large size of AIRR-Seq datasets. The gap statistic (57) is also not applicable because it requires a null distribution of expected within-cluster dispersion, which is unknown and would require several assumptions to simulate for Ig sequences. Thus, the automated threshold inference based on the distance-to-nearest plot proved most feasible for the data type and is supported by biological intuition.

Hierarchical clustering is an agglomerative (or bottom up) method. Each sequence starts as its own cluster, and the closest pair of clusters is merged together until all sequences are connected. Closeness is defined by a distance metric. Many previous studies used Hamming distance (21, 32), which simply counts the number of differences between two junction sequences. Others attempted to incorporate the intrinsic biases of SHM to account for the presence of hot and cold spots (15, 18). In the present study, we found that incorporating the targeting and substitution biases of SHM into the distance metric did not significantly improve performance compared with nucleotide Hamming distance. It is possible that more sophisticated distance measures could play a more important role under conditions different from those investigated in this study. For example, when the mutation frequency is low, a different metric may better capture the importance of each individual mutation in determining the separation between clones. However, the current results suggest that the additional assumptions and computational cost of more complex distance metrics are unlikely to provide substantial performance improvements.

Although distance is measured between pairs of individual sequences, the linkage method defines how to calculate the closeness between clusters that contain multiple sequences. We evaluated the tradeoffs in the most common linkage methods: single, average, and complete. Single linkage is generally considered to be the most inclusive, and we found that it provides the best overall performance with specificity, sensitivity, and PPV all >99%. However, the appropriate choice of linkage may depend on the biological question being addressed. Complete linkage offers a higher PPV, but at the cost of a significant loss of sensitivity. This may be appropriate for research questions that are highly dependent on the accuracy of calling sequences as part of the same clone. For example, studies that attempt to link small numbers of Ag-specific sequences with clonal relatives or establish migration patterns between compartments with infrequent overlaps may benefit from the high confidence in each clonal connection provided by complete linkage. Nevertheless, the high absolute performance of single linkage should be acceptable for most studies.

The specificity, sensitivity, and PPV of single linkage clustering with Hamming distance are all >99%. However, the errors that are made by this algorithm are not random. We found that Ig sequences incorrectly grouped together as clonally related had disproportionately short junction regions (defined in this article as <30 nt). Because the V gene extends into the junction region by approximately seven nucleotides (44), a higher fraction of the nucleotides in short junctions would be expected to have limited diversity compared with longer junctions, potentially limiting the ability to distinguish between clones. This could be particularly problematic when using a length-normalized distance metric, which we showed was critical to achieve acceptable specificity. However, our analysis showed that the problem went beyond the V segment constituting a higher fraction of junction nucleotides. Clonally unrelated sequences with short junctions have less entropy on a per nucleotide basis compared with similar sequences with longer junctions. This lack of interclonal diversity could be due to a lower mutation frequency, the use of a restricted set of D genes, or fewer untemplated nucleotide additions between the germline gene segments. However, the entropy of clonally related sequences was comparable between short and longer junctions, suggesting that a uniformly lower mutation frequency is not responsible for the lower diversity in short junctions. Current algorithms for inferring germline gene segments still struggle with inference of the D gene (58), making it difficult to determine whether the underlying cause of low diversity is due to D gene usage bias, fewer untemplated nucleotide additions, or another mechanism.

Despite the diversity differences between shorter and longer junctions, using a separate threshold to partition sequences with different junction lengths did not improve performance. As precision-recall curves showed, the single threshold selected by analyzing the entire dataset as a whole almost always optimized the tradeoff between sensitivity and PPV for all junction lengths. Although a few repertoires did have an alternate threshold with slightly improved performance, these thresholds were not evident from the distance-to-nearest distributions. It is possible a method other than hierarchical clustering could better separate clones with shorter junctions, but this would be a minor improvement, as absolute performance of the single linkage hierarchical clustering with Hamming distance was high.

False-positive clonal assignments still occur among sequences with longer junctions, but these appear to have a different underlying cause. In this case the lack of nucleotide diversity can be explained, at least in part, by an overrepresentation of the IGHJ6 gene. This gene extends an extra 10 nt into the junction region (44), causing sequences to appear more similar than others using different J genes. It is possible that a separate analysis of these sequences may improve performance. One possibility for better separating clones that do not have sufficient diversity in the junction region is to require shared mutations in the V or J region, although this would penalize clones that have few mutations overall. Likelihood-based approaches, such as Cloanalyst (28) or partis (29), may help to increase confidence in clones with short junctions or those using IGHJ6, although these approaches are too computationally intensive to use on full AIRR-Seq datasets. Although it has been suggested that partis improves performance relative to hierarchal clustering (29), this study did not use dataset-specific distance thresholds and thus likely underestimated the performance of the clustering-based method.

The comparative analysis presented in the present study suggests clear tradeoffs in the choice of distance and linkage methods. However, it is possible that different tradeoffs would become apparent in data with different clone size distributions, mutation frequencies, and such. The simulation data used to measure sensitivity and PPV were based on lineage tree topologies drawn from only four underlying repertoires. The simulations also assume that Ig sequences maintain the same junction length during clonal evolution, an assumption that was also made in the clustering algorithm. However, recent research indicates that a small percentage of SHM events may lead to changes in junction length within a clone (59). Insertions/deletions may be present in the junction due to sequencing errors, but the inclusion of UMIs followed by computational approaches for sequencing error correction can reduce this impact (6, 60). Few clonal grouping methods deal with junction length differences, and although these effects are also not accounted for in the present study, their influence on performance is expected to be small. Another possible source of bias in the performance on experimental data is the potential presence of so-called “public clones,” or highly similar sequences across individuals. Such sequences may skew specificity estimates that were approximated on publicly available human experimental datasets based on the frequency of inferred groups that spanned individuals. Furthermore, this specificity measure depends on the frequency of highly similar sequences found across individuals, which may differ from the frequency of highly similar sequences found within an individual by chance. Public clones are especially prevalent in L chains due to the relative lack of diversity in the junction region, caused in part by the absence of the D gene segment. Clonal grouping using only L chain sequences is expected to have lower performance than the H chain results shown in this study. Future studies could benefit from using a larger number of AIRR-Seq datasets that span age, tissue, disease state, and others in addition to simulations based on a larger number of underlying experimental Ig repertoires.

In summary, computationally grouping Ig sequences into B cell clones is a critical part of AIRR-seq studies, and enables understanding the structure and affinity maturation of the Ig repertoire. In this study, we developed a framework for comparative analysis of clonal grouping approaches and determined that single linkage hierarchical clustering with length-normalized nucleotide Hamming distance performs well on both human experimental and simulated datasets. This algorithm is available as part of the Change-O and SHazaM packages (35) in our Immcantation tool suite (http://immcantation.readthedocs.io).

Disclosures

The authors have no financial conflicts of interest.

Acknowledgments

We thank Gur Yaari and Moriah Cohen for preprocessing the influenza vaccination data used to generate the simulated data and for helpful comments on the manuscript. We thank the High Performance Computing facilities operated by the Yale Center for Research Computing and Yale’s W. M. Keck Biotechnology Laboratory.

Footnotes

  • The sequences presented in this article have been submitted to the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) under accession number PRJNA349143.

  • S.H.K. was supported by National Institutes of Health Grant R01 AI104739, and N.T.G. was supported by National Institutes of Health Grant R01 AI104739, United States–Israel Binational Science Foundation Grant 2013395, and by Pharmaceutical Research and Manufacturers of America Grant R12886. The Yale High Performance Computing facilities are funded by National Institutes of Health Grants RR19895 and RR029676-01.

  • The online version of this article contains supplemental material.

  • Abbreviations used in this article:

    AIRR-Seq
    adaptive immune receptor repertoire sequencing
    hS1F
    human S1F
    hS5F
    human S5F
    PPV
    positive predictive value
    SHM
    somatic hypermutation
    UMI
    unique molecular identifier
    WNV
    West Nile virus.

  • Received October 31, 2016.
  • Accepted January 4, 2017.
  • Copyright © 2017 by The American Association of Immunologists, Inc.

References

  1. ↵
    1. Volpe J. M.,
    2. T. B. Kepler
    . 2008. Large-scale analysis of human heavy chain V(D)J recombination patterns. Immunome Res. 4: 3.
    OpenUrlCrossRefPubMed
  2. ↵
    1. Kleinstein S. H.,
    2. Y. Louzoun,
    3. M. J. Shlomchik
    . 2003. Estimating hypermutation rates from clonal tree data. J. Immunol. 171: 4639–4649.
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. McKean D.,
    2. K. Huppi,
    3. M. Bell,
    4. L. Staudt,
    5. W. Gerhard,
    6. M. Weigert
    . 1984. Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin. Proc. Natl. Acad. Sci. USA 81: 3180–3184.
    OpenUrlAbstract/FREE Full Text
  4. ↵
    1. Wu Y.-C. B.,
    2. D. Kipling,
    3. D. K. Dunn-Walters
    . 2012. Age-related changes in human peripheral blood IGH repertoire following vaccination. Front. Immunol. 3: 193.
    OpenUrlCrossRefPubMed
  5. ↵
    1. Boyd S. D.,
    2. S. A. Joshi
    . 2014. High-throughput DNA sequencing analysis of antibody repertoires. Microbiol. Spectr. 2: 1–13.
    OpenUrl
  6. ↵
    1. Yaari G.,
    2. S. H. Kleinstein
    . 2015. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 7: 121.
    OpenUrl
  7. ↵
    1. Weinstein J. A.,
    2. N. Jiang,
    3. R. A. White III.,
    4. D. S. Fisher,
    5. S. R. Quake
    . 2009. High-throughput sequencing of the zebrafish antibody repertoire. Science 324: 807–810.
    OpenUrlAbstract/FREE Full Text
  8. ↵
    1. Boyd S. D.,
    2. E. L. Marshall,
    3. J. D. Merker,
    4. J. M. Maniar,
    5. L. N. Zhang,
    6. B. Sahaf,
    7. C. D. Jones,
    8. B. B. Simen,
    9. B. Hanczaruk,
    10. K. D. Nguyen,
    11. et al
    . 2009. Measurement and clinical monitoring of human lymphocyte clonality by massively parallel V-D-J pyrosequencing. Sci. Transl. Med. 1: 12ra23.
    OpenUrlAbstract/FREE Full Text
  9. ↵
    1. Benichou J.,
    2. R. Ben-Hamo,
    3. Y. Louzoun,
    4. S. Efroni
    . 2012. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. Immunology 135: 183–191.
    OpenUrlCrossRefPubMed
  10. ↵
    1. Logan A. C.,
    2. H. Gao,
    3. C. Wang,
    4. B. Sahaf,
    5. C. D. Jones,
    6. E. L. Marshall,
    7. I. Buño,
    8. R. Armstrong,
    9. A. Z. Fire,
    10. K. I. Weinberg,
    11. et al
    . 2011. High-throughput VDJ sequencing for quantification of minimal residual disease in chronic lymphocytic leukemia and immune reconstitution assessment. Proc. Natl. Acad. Sci. USA 108: 21194–21199.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Hershberg U.,
    2. W. Meng,
    3. B. Zhang,
    4. N. Haff,
    5. E. W. St Clair,
    6. P. L. Cohen,
    7. P. D. McNair,
    8. L. Li,
    9. M. C. Levesque,
    10. E. T. Luning Prak
    . 2014. Persistence and selection of an expanded B-cell clone in the setting of rituximab therapy for Sjögren’s syndrome. Arthritis Res. Ther. 16: R51.
    OpenUrlCrossRefPubMed
  12. ↵
    1. Boletis J. N.,
    2. S. Marinaki,
    3. C. Skalioti,
    4. S. S. Lionaki,
    5. A. Iniotaki,
    6. P. P. Sfikakis
    . 2009. Rituximab and mycophenolate mofetil for relapsing proliferative lupus nephritis: a long-term prospective study. Nephrol. Dial. Transplant. 24: 2157–2160.
    OpenUrlAbstract/FREE Full Text
  13. ↵
    1. Ademokun A.,
    2. Y.-C. C. Wu,
    3. V. Martin,
    4. R. Mitra,
    5. U. Sack,
    6. H. Baxendale,
    7. D. Kipling,
    8. D. K. Dunn-Walters
    . 2011. Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages. Aging Cell 10: 922–930.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Wang C.,
    2. Y. Liu,
    3. L. T. Xu,
    4. K. J. Jackson,
    5. K. M. Roskin,
    6. T. D. Pham,
    7. J. Laserson,
    8. E. L. Marshall,
    9. K. Seo,
    10. J.-Y. Lee,
    11. et al
    . 2014. Effects of aging, cytomegalovirus infection, and EBV infection on human B cell repertoires. J. Immunol. 192: 603–611.
    OpenUrlAbstract/FREE Full Text
  15. ↵
    1. Stern J. N. H.,
    2. G. Yaari,
    3. J. A. Vander Heiden,
    4. G. M. Church,
    5. W. F. Donahue,
    6. R. Q. Hintzen,
    7. A. J. Huttner,
    8. J. D. Laman,
    9. R. M. Nagra,
    10. A. Nylander,
    11. et al
    . 2014. B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Sci. Transl. Med. 6: 248ra107.
    OpenUrlAbstract/FREE Full Text
  16. ↵
    1. Wu Y. C. B.,
    2. L. K. James,
    3. J. A. Vander Heiden,
    4. M. Uduman,
    5. S. R. Durham,
    6. S. H. Kleinstein,
    7. D. Kipling,
    8. H. J. Gould
    . 2014. Influence of seasonal exposure to grass pollen on local and peripheral blood IgE repertoires in patients with allergic rhinitis. J. Allergy Clin. Immunol. 134: 604–612.
    OpenUrlCrossRef
  17. ↵
    1. Zhu J.,
    2. G. Ofek,
    3. Y. Yang,
    4. B. Zhang,
    5. M. K. Louder,
    6. G. Lu,
    7. K. McKee,
    8. M. Pancera,
    9. J. Skinner,
    10. Z. Zhang,
    11. et al,
    12. NISC Comparative Sequencing Program
    . 2013. Mining the antibodyome for HIV-1-neutralizing antibodies with next-generation sequencing and phylogenetic pairing of heavy/light chains. Proc. Natl. Acad. Sci. USA 110: 6470–6475.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Tsioris K.,
    2. N. T. Gupta,
    3. A. O. Ogunniyi,
    4. R. M. Zimnisky,
    5. F. Qian,
    6. Y. Yao,
    7. X. Wang,
    8. J. N. H. Stern,
    9. R. Chari,
    10. A. W. Briggs,
    11. et al
    . 2015. Neutralizing antibodies against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing. Integr. Biol. 7: 1587–1597.
    OpenUrl
  19. ↵
    1. Zhu J.,
    2. X. Wu,
    3. B. Zhang,
    4. K. McKee,
    5. S. O’Dell,
    6. C. Soto,
    7. T. Zhou,
    8. J. P. Casazza,
    9. J. C. Mullikin,
    10. P. D. Kwong,
    11. et al,
    12. NISC Comparative Sequencing Program
    . 2013. De novo identification of VRC01 class HIV-1–neutralizing antibodies by next-generation sequencing of B-cell transcripts. Proc. Natl. Acad. Sci. USA 110: E4088–E4097.
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. Trück J.,
    2. M. N. Ramasamy,
    3. J. D. Galson,
    4. R. Rance,
    5. J. Parkhill,
    6. G. Lunter,
    7. A. J. Pollard,
    8. D. F. Kelly
    . 2015. Identification of antigen-specific B cell receptor sequences using public repertoire analysis. J. Immunol. 194: 252–261.
    OpenUrlAbstract/FREE Full Text
  21. ↵
    1. Jiang N.,
    2. J. He,
    3. J. A. Weinstein,
    4. L. Penland,
    5. S. Sasaki,
    6. X.-S. He,
    7. C. L. Dekker,
    8. N.-Y. Zheng,
    9. M. Huang,
    10. M. Sullivan,
    11. et al
    . 2013. Lineage structure of the human antibody repertoire in response to influenza vaccination. Sci. Transl. Med. 5: 171ra19.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Hill M.
    1973. Diversity and evenness: a unifying notation and its consequences. Ecology 54: 427–432.
    OpenUrlCrossRef
  23. ↵
    1. van Dongen J. J. M.,
    2. A. W. Langerak,
    3. M. Brüggemann,
    4. P. A. S. Evans,
    5. M. Hummel,
    6. F. L. Lavender,
    7. E. Delabesse,
    8. F. Davi,
    9. E. Schuuring,
    10. R. García-Sanz,
    11. et al
    . 2003. Design and standardization of PCR primers and protocols for detection of clonal immunoglobulin and T-cell receptor gene recombinations in suspect lymphoproliferations: report of the BIOMED-2 Concerted Action BMH4-CT98-3936. Leukemia 17: 2257–2317.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Nei M.,
    2. S. Kumar
    . 2000. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford, U.K.
  25. ↵
    1. Horns F.,
    2. C. Vollmers,
    3. D. Croote,
    4. S. F. Mackey,
    5. G. E. Swan,
    6. C. L. Dekker,
    7. M. M. Davis,
    8. S. R. Quake
    . 2016. Lineage tracing of human B cells reveals the in vivo landscape of human antibody class switching. eLife 5: e16578.
    OpenUrl
  26. ↵
    1. Looney T. J.,
    2. J.-Y. Lee,
    3. K. M. Roskin,
    4. R. A. Hoh,
    5. J. King,
    6. J. Glanville,
    7. Y. Liu,
    8. T. D. Pham,
    9. C. L. Dekker,
    10. M. M. Davis,
    11. S. D. Boyd
    . 2016. Human B-cell isotype switching origins of IgE. J. Allergy Clin. Immunol. 137: 579–586.e7.
    OpenUrl
  27. ↵
    1. Hershberg U.,
    2. E. T. Luning Prak
    . 2015. The analysis of clonal expansions in normal and autoimmune B cell repertoires. Philos. Trans. R. Soc. Lond. B Biol. Sci. 370: 20140239.
    OpenUrlAbstract/FREE Full Text
  28. ↵
    1. Kepler T. B.
    2013. Reconstructing a B-cell clonal lineage. I. Statistical inference of unobserved ancestors. F1000 Res. 2: 103.
    OpenUrl
  29. ↵
    1. Ralph D.,
    2. F. A. Matsen
    . 2016. Likelihood-based inference of B-cell clonal families. PLoS Comput. Biol. 12: e1005086.
    OpenUrl
  30. ↵
    1. Chen Z.,
    2. A. M. Collins,
    3. Y. Wang,
    4. B. A. Gaëta
    . 2010. Clustering-based identification of clonally-related immunoglobulin gene sequence sets. Immunome Res. 6(Suppl. 1): S4.
    OpenUrl
  31. ↵
    1. Glanville J.,
    2. T. C. Kuo,
    3. H.-C. von Büdingen,
    4. L. Guey,
    5. J. Berka,
    6. P. D. Sundar,
    7. G. Huerta,
    8. G. R. Mehta,
    9. J. R. Oksenberg,
    10. S. L. Hauser,
    11. et al
    . 2011. Naive antibody gene-segment frequencies are heritable and unaltered by chronic lymphocyte ablation. Proc. Natl. Acad. Sci. USA 108: 20066–20071.
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Jiang N.,
    2. J. A. Weinstein,
    3. L. Penland,
    4. R. A. White III.,
    5. D. S. Fisher,
    6. S. R. Quake
    . 2011. Determinism and stochasticity during maturation of the zebrafish antibody repertoire. Proc. Natl. Acad. Sci. USA 108: 5348–5353.
    OpenUrlAbstract/FREE Full Text
  33. ↵
    1. Wu Y.-C.,
    2. D. Kipling,
    3. H. S. Leong,
    4. V. Martin,
    5. A. A. Ademokun,
    6. D. K. Dunn-Walters
    . 2010. High-throughput immunoglobulin repertoire analysis distinguishes between human IgM memory and switched memory B-cell populations. Blood 116: 1070–1078.
    OpenUrlAbstract/FREE Full Text
  34. ↵
    1. Briney B.,
    2. K. Le,
    3. J. Zhu,
    4. D. R. Burton
    . 2016. Clonify: unseeded antibody lineage assignment from next-generation sequencing data. Sci. Rep. 6: 23901.
    OpenUrlCrossRefPubMed
  35. ↵
    1. Gupta N. T.,
    2. J. A. Vander Heiden,
    3. M. Uduman,
    4. D. Gadala-Maria,
    5. G. Yaari,
    6. S. H. Kleinstein
    . 2015. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31: 3356–3358.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Parameswaran P.,
    2. Y. Liu,
    3. K. M. Roskin,
    4. K. K. L. Jackson,
    5. V. P. Dixit,
    6. J.-Y. Lee,
    7. K. L. Artiles,
    8. S. Zompi,
    9. M. J. Vargas,
    10. B. B. Simen,
    11. et al
    . 2013. Convergent antibody signatures in human dengue. Cell Host Microbe 13: 691–700.
    OpenUrlCrossRefPubMed
  37. ↵
    1. Vander Heiden J. A.,
    2. G. Yaari,
    3. M. Uduman,
    4. J. N. H. Stern,
    5. K. C. O’Connor,
    6. D. A. Hafler,
    7. F. Vigneault,
    8. S. H. Kleinstein
    . 2014. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30: 1930–1932.
    OpenUrlAbstract/FREE Full Text
  38. ↵
    1. Alamyar E.,
    2. P. Duroux,
    3. M. P. Lefranc,
    4. V. Giudicelli
    . 2012. IMGT® tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS. Methods Mol. Biol. 882: 569–604.
    OpenUrlCrossRefPubMed
  39. ↵
    1. Vander Heiden J. A.,
    2. P. Stathopoulos,
    3. J. Q. Zhou,
    4. L. Chen,
    5. T. J. Gilbert,
    6. C. R. Bolen,
    7. R. J. Barohn,
    8. M. M. Dimachkie,
    9. E. Ciafaloni,
    10. T. J. Broering
    et al. 2017. Dysregulation of B cell repertoire formation in myasthenia gravis patients revealed through deep sequencing. J. Immunol. 198: 1460–1473.
    OpenUrlAbstract/FREE Full Text
  40. ↵
    1. Laserson U.,
    2. F. Vigneault,
    3. D. Gadala-Maria,
    4. G. Yaari,
    5. M. Uduman,
    6. J. A. Vander Heiden,
    7. W. Kelton,
    8. S. Taek Jung,
    9. Y. Liu,
    10. J. Laserson,
    11. et al
    . 2014. High-resolution antibody dynamics of vaccine-induced immune responses. Proc. Natl. Acad. Sci. USA 111: 4928–4933.
    OpenUrlAbstract/FREE Full Text
  41. ↵
    1. Di Niro R.,
    2. S.-J. Lee,
    3. J. A. Vander Heiden,
    4. R. A. Elsner,
    5. N. Trivedi,
    6. J. M. Bannock,
    7. N. T. Gupta,
    8. S. H. Kleinstein,
    9. F. Vigneault,
    10. T. J. Gilbert,
    11. et al
    . 2015. Salmonella infection drives promiscuous B cell activation followed by extrafollicular affinity maturation. Immunity 43: 120–131.
    OpenUrlCrossRefPubMed
  42. ↵
    1. Edgar R. C.
    2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32: 1792–1797.
    OpenUrlAbstract/FREE Full Text
  43. ↵
    1. Ye J.,
    2. N. Ma,
    3. T. L. Madden,
    4. J. M. Ostell
    . 2013. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 41: W34–W40.
    OpenUrlAbstract/FREE Full Text
  44. ↵
    1. Giudicelli V.,
    2. D. Chaume,
    3. M.-P. Lefranc
    . 2005. IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res. 33: D256–D261.
    OpenUrlAbstract/FREE Full Text
  45. ↵
    1. Yaari G.,
    2. J. A. Vander Heiden,
    3. M. Uduman,
    4. D. Gadala-Maria,
    5. N. Gupta,
    6. J. N. H. Stern,
    7. K. C. O’Connor,
    8. D. A. Hafler,
    9. U. Laserson,
    10. F. Vigneault,
    11. S. H. Kleinstein
    . 2013. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front. Immunol. 4: 358.
    OpenUrlPubMed
  46. ↵
    1. Shapiro G. S.,
    2. M. C. Ellison,
    3. L. J. Wysocki
    . 2003. Sequence-specific targeting of two bases on both DNA strands by the somatic hypermutation mechanism. Mol. Immunol. 40: 287–295.
    OpenUrlCrossRefPubMed
  47. ↵
    1. Wand M. P.,
    2. M. C. Jones
    . 1995. Kernel Smoothing, Vol. 60 of Monographs on Statistics and Applied Probability. Chapman and Hall, London.
  48. ↵
    1. Darlington R.B.
    1970. Is kurtosis really “peakedness?”. Am. Stat. 24: 19–22.
    OpenUrlCrossRef
  49. ↵
    Hansen, B. 2004. Bandwidth selection for nonparametric distribution estimation. Available at: http://www.ssc.wisc.edu/∼bhansen/papers/smooth.pdf. Accessed: March 22, 2016.
  50. ↵
    Guidoum, A. C. 2015. kedd: Kernel Estimator and Bandwidth Selection for Density and Its Derivatives. R Package version 1.0.3. Available at: https://rdrr.io/cran/kedd/. Accessed: January 4, 2016.
  51. ↵
    1. Jain A. K.,
    2. R. C. Dubes
    . 1988. Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River, NJ.
  52. ↵
    1. Tas J. M. J.,
    2. L. Mesin,
    3. G. Pasqual,
    4. S. Targ,
    5. J. T. Jacobsen,
    6. Y. M. Mano,
    7. C. S. Chen,
    8. J.-C. Weill,
    9. C.-A. Reynaud,
    10. E. P. Browne,
    11. et al
    . 2016. Visualizing antibody affinity maturation in germinal centers. Science 351: 1048–1054.
    OpenUrlAbstract/FREE Full Text
  53. ↵
    1. Safonova Y.,
    2. A. Lapidus,
    3. J. Lill
    . 2015. IgSimulator: a versatile immunosequencing simulator. Bioinformatics 31: 3213–3215.
    OpenUrlAbstract/FREE Full Text
  54. ↵
    1. Yaari G.,
    2. M. Uduman,
    3. S. H. Kleinstein
    . 2012. Quantifying selection in high-throughput immunoglobulin sequencing data sets. Nucleic Acids Res. 40: e134.
    OpenUrlAbstract/FREE Full Text
  55. ↵
    1. Rousseeuw P. J.
    1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20: 53–65.
    OpenUrlCrossRef
  56. ↵
    StatSoft, Inc. 2013. Electronic Statistics Textbook. StatSoft, Tulsa, OK. Available at: http://www.statsoft.com/textbook/. Accessed: April 14, 2016.
  57. ↵
    1. Tibshirani R.,
    2. G. Walther,
    3. T. Hastie
    . 2001. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 63: 411–423.
    OpenUrlCrossRef
  58. ↵
    1. Munshaw S.,
    2. T. B. Kepler
    . 2010. SoDA2: a Hidden Markov Model approach for identification of immunoglobulin rearrangements. Bioinformatics 26: 867–872.
    OpenUrlAbstract/FREE Full Text
  59. ↵
    1. Yeap L.-S.,
    2. J. K. Hwang,
    3. Z. Du,
    4. R. M. Meyers,
    5. F.-L. Meng,
    6. A. Jakubauskaitė,
    7. M. Liu,
    8. V. Mani,
    9. D. Neuberg,
    10. T. B. Kepler,
    11. et al
    . 2015. Sequence-intrinsic mechanisms that target AID mutational outcomes on antibody genes. Cell 163: 1124–1137.
    OpenUrlCrossRefPubMed
  60. ↵
    1. Shugay M.,
    2. O. V. Britanova,
    3. E. M. Merzlyak,
    4. M. A. Turchaninova,
    5. I. Z. Mamedov,
    6. T. R. Tuganbaev,
    7. D. A. Bolotin,
    8. D. B. Staroverov,
    9. E. V. Putintseva,
    10. K. Plevova,
    11. et al
    . 2014. Towards error-free profiling of immune repertoires. Nat. Methods 11: 653–655.
    OpenUrlCrossRefPubMed
View Abstract
PreviousNext
Back to top

In this issue

The Journal of Immunology: 198 (6)
The Journal of Immunology
Vol. 198, Issue 6
15 Mar 2017
  • Table of Contents
  • Table of Contents (PDF)
  • About the Cover
  • Advertising (PDF)
  • Back Matter (PDF)
  • Editorial Board (PDF)
  • Front Matter (PDF)
Print
Download PDF
Article Alerts
Sign In to Email Alerts with your Email Address
Email Article

Thank you for your interest in spreading the word about The Journal of Immunology.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data
(Your Name) has forwarded a page to you from The Journal of Immunology
(Your Name) thought you would like to see this page from the The Journal of Immunology web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data
Namita T. Gupta, Kristofor D. Adams, Adrian W. Briggs, Sonia C. Timberlake, Francois Vigneault, Steven H. Kleinstein
The Journal of Immunology March 15, 2017, 198 (6) 2489-2499; DOI: 10.4049/jimmunol.1601850

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data
Namita T. Gupta, Kristofor D. Adams, Adrian W. Briggs, Sonia C. Timberlake, Francois Vigneault, Steven H. Kleinstein
The Journal of Immunology March 15, 2017, 198 (6) 2489-2499; DOI: 10.4049/jimmunol.1601850
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like

Jump to section

  • Article
    • Abstract
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Disclosures
    • Acknowledgments
    • Footnotes
    • References
  • Figures & Data
  • Info & Metrics
  • PDF + SI
  • PDF

Related Articles

Cited By...

More in this TOC Section

  • Conserved IFN Signature between Adult and Pediatric Eosinophilic Esophagitis
  • Network Analysis Reveals a Distinct Axis of Macrophage Activation in Response to Conflicting Inflammatory Cues
  • Maternal and Cord Blood 25-Hydroxyvitamin D3 Are Associated with Increased Cord Blood and Naive and Activated Regulatory T Cells: The Barwon Infant Study
Show more SYSTEMS IMMUNOLOGY

Similar Articles

Navigate

  • Home
  • Current Issue
  • Next in The JI
  • Archive
  • Brief Reviews
  • Pillars of Immunology
  • Translating Immunology

For Authors

  • Submit a Manuscript
  • Instructions for Authors
  • About the Journal
  • Journal Policies
  • Editors

General Information

  • Advertisers
  • Subscribers
  • Rights and Permissions
  • Accessibility Statement
  • FAR 889
  • Privacy Policy
  • Disclaimer

Journal Services

  • Email Alerts
  • RSS Feeds
  • ImmunoCasts
  • Twitter

Copyright © 2021 by The American Association of Immunologists, Inc.

Print ISSN 0022-1767        Online ISSN 1550-6606