It is becoming increasingly apparent that many of the genes in the class III region of the human MHC encode proteins involved in the immune and inflammatory responses. Furthermore, genetic studies have indicated that genes within the class III region, particularly the telomeric segment containing the TNF gene, could contribute to susceptibility to diseases of immune-related etiology. We have sequenced an 82-kb segment of DNA around the TNF gene to identify candidate disease susceptibility genes in this region. The 10 known genes in this region have been precisely positioned with the order allograft inflammatory factor 1, G1, 1C7, leukocyte-specific transcript 1 (B144), lymphotoxin B, TNF, lymphotoxin A, NB6, IKBL, BAT1 (centromere to telomere), and their genomic structures have been defined. Comparison of the G1 genomic region with previously described cDNA and genomic sequences, together with the results of reverse transcriptase-PCR, indicates that three alternative transcripts, G1, allograft inflammatory factor 1, and IFN-γ-responsive transcript, are all derived from this gene. The completion of the sequence of 1C7 (D6S2570) has revealed that this gene encodes a putative novel member of the Ig superfamily. A number of alternatively spliced transcripts of 1C7 were identified by reverse transcriptase-PCR, all of which are expressed in immune-related cell lines. Alternative splicing within the Ig domain-encoding region was seen to result in possible set switching between an IgV domain and an IgC2 domain. Lastly, a previously unidentified gene, homologous to a number of V-ATPase G subunits, has been located 1 kb telomeric of IKBL.
The human MHC spans ∼4 megabases (Mb)3 of DNA in the chromosome band 6p21.3. Of this, the central ∼1.1 Mb is termed the class III region (1). It is becoming increasingly apparent that, in common with the MHC class I and II regions (for review see 2), the class III region contains many genes that encode proteins involved in immune and inflammatory responses (3). These include the C2, C4, and factor B components of the complement system, members of the 70-kDa heat shock protein family, the cytokines TNF, lymphotoxin α (LTα) and LTβ (3), and the inflammatory mediator lysophosphatidic acid acyltransferase (4). Furthermore, genetic studies have indicated that genes within the MHC contribute to immune-related diseases, such as insulin-dependent diabetes mellitus (IDDM), rheumatoid arthritis, ankylosing spondylitis, myasthenia gravis, common variable immunodeficiency (CVID), and IgA deficiency (5, 6). Strong associations have been found between these diseases and alleles of genes in the MHC class II region. However, because linkage disequilibrium extends across the whole of the MHC, disease susceptibility determinants may also exist within the MHC class I and class III regions. In addition, a detailed study using polymorphic microsatellite markers has provided strong evidence for the involvement of genes in the class III region in the development of IDDM (R. E. March, unpublished observations), and a recent linkage study by Schroeder et al. (7) has localized a disease susceptibility locus for IgA deficiency and CVID to the telomeric end of the MHC class III region between G1 and the class I gene HLA-B.
The significance and molecular bases of the MHC-linked disease associations described in the literature are still unclear, but the identification and characterization of all coding and regulatory sequences will be invaluable, and even necessary, for the elucidation of the role of the MHC in the progression of autoimmune diseases. The segment of DNA at the telomeric end of the MHC class III region defined by Schroeder et al. (7) has already been extensively studied (8, 9, 10, 11, 12, 13, 14). The two most extensive of these studies includes that by Shiina et al. (13) and Guillaudeux et al. (14). The study by Shiina et al. (13) involved the complete sequence analysis of 146 kb of DNA between the IKBL and MICA genes from the HLA haplotype A2, B62, Cw10, DR4, whilst the recent study by Guillaudeux et al. (14) involved the sequence analysis of 424 kb of DNA between the TNF gene cluster and a newly identified gene ∼20 kb telomeric of 0tf-3 at the centromeric end of the class I region. However, gaps remained in the sequence data available from the G1-IKBL region. In this study, to complete the analysis of this region and to complement the sequence data already available, in a different haplotype, we have sequenced an ∼82-kb segment of DNA encompassing genes between G1 and BAT1. Transcripts previously mapped to this region included the allograft inflammatory factor 1 (AIF1), G1 (D6S50E), leukocyte-specific transcript 1 (LST-1 (B144)) (D6S49E), 1C7, lymphotoxin (LTB), TNF, lymphotoxin (LTA), NB6, IKBL (NFKBKIL1), and BAT1 (D6S81E). The AIF1 gene encodes a cytokine-responsive macrophage-specific protein (15), while the cytokines TNF, LTα, and LTβ are involved in the inflammatory response (for full review see 16). IKBL is a putative member of the IκB family of proteins that regulate the NF-κB family of transcription factors (17) and may be involved in regulating the expression of cytokine genes. BAT1 encodes a putative nuclear RNA helicase of the DEAD family (18). The products of the G1, LST-1, and 1C7 genes have not yet had a function or protein family membership assigned to them. However, they all appear to be expressed exclusively in immune-related cell lines, suggesting their involvement in the immune response (19, 20, 21). We now report the positioning of the first exon of the 1C7 gene, which had not been identified previously (21) (allocated the name D6S2570E in the Human Gene Nomenclature Database). The characterization of this first exon has allowed us to establish that the 1C7 protein is a putative novel member of the Ig superfamily, and analysis by the RT-PCR has shown that it is expressed at the RNA level in a number of alternatively spliced forms. We also report a new gene, 1 kb telomeric of IKBL, that encodes a putative V-ATPase G subunit (allocated the name ATP6G in the Human Gene Nomenclature Database). This gene also exhibits alternative splicing and variation in the length of its 3′ untranslated region (UTR). Finally, we report the characterization of a number of immune-related transcripts that are encoded within the G1 genomic region and provide evidence to show that these transcripts result from the alternative splicing of a single gene.
Materials and Methods
The two overlapping cosmid clones covering the G1-BAT1 region, TN62 and TN82 (homozygous for the HLA haplotypes, A2, B7, DR2, C2C, BFS, C4A3, C4BQ0), were sequenced using an M13 shotgun strategy (22) with fluorescent dye primer and dye terminator sequencing chemistries (Amersham, Little Chalfont, U.K.). Cosmid DNA was sonicated and fragments of 0.5–1 kb were selected for cloning into M13mp18. Recombinant M13 mp18 phage DNA was purified from culture supernatants using a Vistra DNA Labstation and cycle sequenced using ThermoSequenase (Amersham) in a 96-well format on a Hybaid Omnigene thermocycler (95°C for 5 min followed by 20 cycles of 95°C for 30 s; 60°C for 30 s) in the presence of the fluorescent dye-labeled M13 universal primer (5′-TGACCGGCAGCAAAATG-3′). The sequencing reactions were run on an Applied Biosystems 377 automated DNA sequencer (Applied Biosystems, Foster City, CA), and sequence data were analyzed with the ABI377-dedicated software. Individual sequence traces were processed and reassembled using the programs PREGAP and GAP v4.0-β4 from the Staden suite of software (Medical Research Council Laboratory of Molecular Biology, Cambridge, U.K.).
Ambiguities within the sequence were resolved, and the sequences across areas of single orientation read were confirmed with dye terminator sequencing chemistries, while gaps between contigs were closed by either sequencing the reverse strand of long clones (over 800 nucleotides (nt)) that extended into the gap or by the sequencing of PCR products covering the gaps.
The expression of transcripts was investigated by RT-PCR using total RNA and the Promega reverse transcription system (Promega, Chilworth Research Centre, Southampton, U.K.) according to the manufacturer’s protocol (gene-specific primers can be found in Table I⇓). The cell lines used were: Raji (B cell), Jurkat 6 (T cell), Molt 4 (T cell), HL60 (monocyte), U937 (macrophage), HepG2 (hepatocyte), HeLa (epithelial), HT1080 (epithelial), and SW620 (adenocarcinoma). PCR primers were designed to give products containing more than one exon so that amplification products arising from genomic DNA contamination were easily discernible. The first round cDNA synthesis was performed in a final volume of 20 μl with 1 μg of total RNA; 10 μl of this reaction mix was used in a 50 μl PCR reaction with the transcript-specific primers and amplification conditions listed in Table I⇓. Each transcript-specific RT-PCR reaction was performed in at least triplicate to allow for any variation between reactions. Control amplification reactions with primers derived from β-actin were conducted for each first round cDNA synthesis reaction. The identities of PCR products were confirmed, either by direct dye-terminator sequencing or, when multiple products were obtained, by cloning the RT-PCR products into the pGEM-T EasyVector System (Promega) following the manufacturer’s protocol, then isolating and sequencing the cloned DNA.
The Wisconsin Package Version 9-UNIX (Genetics Computer Group), maintained at the University of Oxford Molecular Biology Data Centre, was used for the majority of the sequence analysis and database interrogation. The DNA sequence generated was screened against the EMBL, SwissProt, PDB, EMBL-EST, and TIGR-EST ([email protected]) databases to position known genes and identify possible new coding regions. Repetitive elements were identified with the aid of the RepeatMasker server (A.F.A. Smit and P. Green, RepeatMasker at http://ftp.genome.washington.edu/RH/RepeatMasker.html), and potential coding regions were defined using the NIX exon prediction program (http://www.hgmp.mrc.ac.uk/Registered/Webapp/nix/) from the Human Genome Mapping Project Resource Centre (Hinxton, U.K.). Predictions of protein secondary structure, solvent accessibility, and transmembrane regions were conducted using the JPred consensus secondary structure prediction server (http://circinus.ebi.ac.uk:8081/) or PredictProtein program ([email protected]). The GCG program SIGCLEAVE and the SMART (Simple Modular Architecture Research Tool) server (http://coot.embl-heidelberg.de/SMART/) were used to identify leader peptides. Sequence motifs and protein domains were identified using a combination of the GCG program MOTIF, the Prosite Profilescan server (http://μlrec3.unil.ch/software/PFSCAN_form.html), and the SMART server. Multiple alignments of amino acid sequences were performed using the Clustalx software (National Center for Biotechnology Information, Bethesda, MD), making use of protein structure information from sequences within the PDB database wherever possible. Alignments were hand-edited using the GCG9 SeqLab multiple alignment editor.
Sequence assembly and alignment of known genes
The complete nucleotide sequences of cosmids TN62 and TN82 were determined from a combined total of 1757 templates. A single contig of 81,800 nt in length was obtained with an overlap of 5,341 nt between the two cosmids and an average depth of ∼9.6 reads per nucleotide sequenced. There were no discrepancies between the sequences generated for the two cosmids across the region of overlap. The complete genomic DNA sequences of cosmids TN62 and TN82 have been deposited in the EMBL database under the accession number HSY14768. Exon positions and other sequence features reported here are included in this database entry.
The precise location and genomic structures of the nine known genes in the region analyzed have been determined and the order of these genes has been shown to be G1/AIF1, 1C7, LST-1 (B144), LTB, TNF, LTA, NB6, IKBL, BAT1, centromere to telomere (Fig. 1⇓). The G1 (19) and AIF1 (23) transcripts were found to be derived from the same genomic region, while the 1C7 gene has been shown to lie on the opposite side of LST-1 (B144) to that previously published (21). Approximately 52 kb of the sequence reported here had previously been deposited in the EMBL database (accession nos. U00921, L11016, Z15026, X02910, U42625, X59350, X02911, M55913, X59351, Z15027, AC004181, and AB000876). The database entries AC004181 and AB000876 contain the sequence data generated by Guillaudeux et al. (14) and Shiina et al. (13), respectively. Both of these sequences overlap the telomeric end of our contig with 99.8% identity, AB000876 over 24,702 nt and AC004181 over 31,438 nt. Further investigation will determine whether any of the nucleotide differences between the generated sequences are haplotype specific.
The G1 genomic region.
A total of 2 kb of sequence encompassing the G1 gene was screened against the DNA sequence databases. Two different human, four different rat, and one pig EMBL cDNA database entries, as well as 13 human, one mouse, and two fish expressed sequence tags (ESTs), aligned over this genomic region with significant similarity. The two human EMBL entries were the AIF1 gene (accession no. U19713) (23) and the IFN-γ-responsive transcript (IRT-1) (accession no. U95213). The 13 human EST entries (accession nos. W58116, W67117, R71716, N47817, W67362, T79488, T69387, N32593, N32605, W67118, W21034, AA091585, and THC167143(TIGR)) fell into two groups: ESTs W21034, AA091585, and THC167143(TIGR) matched the G1 cDNA (19) exactly, while the remaining 10 ESTs were all found to be partial transcripts of AIF1. The G1, AIF1, and IRT-1 cDNA sequences are 500, 639, and 1235 nt in length and encode polypeptides of 93, 147, and 132 amino acids, respectively (19, 23, and accession no. U95213). Fig. 2⇓ shows the genomic organization of G1, AIF1, and IRT-1 plus the exon usage of the three transcripts, which all appear to be splice variants of the same gene.
The five sequences from other organisms within the databases that closely matched the G1 genomic region were AIF1 homologues from pig (P81076), mouse (AA016714), rat (P55009) (24), zebrafish (AA495202; partial cDNA sequence) and carp (C88427; partial cDNA sequence); a rat transcript varying from rat AIF1 over its first 279 nt and 8 amino acids (D82069); and the rat balloon angioplasty responsive transcript (BART-1) (U33471) (25) that differs yet again in its splicing pattern. The latter two rat transcripts (D82069 and U33471) appear to share exonic sequence, and possibly coding sequence, 5′ of the first exon of AIF1. An equivalent sequence has not yet been identified in humans. Furthermore, the BART-1 transcript contains an additional 436 nt of internal sequence: a 158-nt insert at AIF1 position 121; and a 278-nt insert at position 251. The former encodes a putative transmembrane region 5′ of the sequence aligning with G1 and IRT-1. However, this database entry was found to contain a number of sequence discrepancies when compared with the rat AIF1 transcripts (U17919, AB000818, U10894, and I22424), including three frame shift errors within the proposed coding region. After editing the BART-1 database sequence to correct for the identifiable frame shift errors, the length of the predicted BART-1 polypeptide deduced by us is 15 residues shorter (215 amino acids) than U33471 and differs from the database sequence over its first 26 amino acids, although it still encodes a putative transmembrane region. No assessment can be made of the accuracy of the 436 nt of sequence specific to the BART-1 transcript.
Encoded within sequence from exons 4 and 5 of AIF1, which is shared by all the alternative transcripts detected, is a putative EF hand calcium-binding motif. This motif was previously reported by Olavesen et al. (19) and Utans et al. (23) for G1 and AIF1, respectively, and was also detected with the Prosite Profilescan server. The splicing of the IRT-1 transcript causes an insertion of 66 amino acids into the conserved loop of the EF hand domain. A multiple alignment of the splice variants of human AIF1 and the orthologous proteins identified in other species is presented in Fig. 3⇓.
Expression profiles were determined for G1 and AIF1 by RT-PCR using transcript specific primers (Fig. 2⇑C). The results show that G1 is expressed at high levels in the cell lines Jurkat 6, Molt 4, HL60, and U937, and to a slightly lesser extent in the cell line Raji. Although AIF1 expression was detectable in the same cell lines as G1, this was only detected at high levels in the cell lines HL60 and U937. This is consistent with the findings of Utans et al. (23).
In addition to the published 1C7 cDNA sequence (21), two I.M.A.G.E. consortium cDNA clones, 685808 (EST AA262074), and 683963 (ESTs AA237100 and AA236886), both of B cell origin, were found to align within the genomic region immediately centromeric of the LST-1 (B144) gene. Sequence analysis showed that these clones contained inserts of 1120 nt (683963) and 721 nt (685808), respectively, and both aligned with the genomic sequence over 4 exons (Fig. 4⇓A). Both clones extend the published 1C7 cDNA sequence (21, 26) by 306 nt at the 5′ end and contain an additional 1C7 exon. This additional coding sequence lies 2.8 kb upstream of that previously published and contains an in-frame AUG codon preceded by in-frame stop codons. Clone 683963 also extends the 1C7 sequence by 141 nt at the 3′ end to include a polyadenylation signal at position 30,678 and a polyadenylation site at position 30,711. This 1C7 polyadenylation signal lies only 47 nt centromeric of the polyadenylation signal for LST-1, the two genes being transcribed in opposite directions. The three transcripts, derived from the 1C7 gene, have been named 1C7a (clone 683963), 1C7b (the partial 1C7 sequence already published; 21), and 1C7c (clone 685808). These all differ in their last exon (Fig. 4⇓, exons 4I, 4II, and 4III). The 1C7a and 1C7b transcripts, although sharing their last 276 nt, differ in the 5′ splice site of their last exon, with the fourth exon of 1C7b (exon 4II) extending into intron 3 of 1C7a by 55 nt. In contrast, the fourth exon of 1C7c (exon 4I) lies entirely within intron 3 of 1C7a and 1C7b and has a separate stop codon at position 30,298, a polyadenylation signal at position 30,319, and a polyadenylation site at position 30,337 (Fig. 4⇓).
Transcript profiles were generated by RT-PCR using primer pairs specific for 1C7a and 1C7b (Fig. 4⇑B, panel I) or for 1C7c (Fig. 4⇑B, panel II). A total of nine different PCR products were generated, and sequence analysis confirmed that these were all splice variants of 1C7. The exon usage of each variant is indicated alongside the appropriate band in Fig. 4⇑B. Band 1 corresponds to the splice variant 1C7b, band 2 to 1C7a, and band 7 to 1C7c. 1C7a and 1C7b (Fig. 4⇑B, panel I) show similar expression patterns, being expressed predominantly in the immune-related cell lines Raji, Molt 4, and HL60, and to a lesser extent in Jurkat 6. In contrast, 1C7c consistently shows a more specific expression pattern, being detected exclusively in the lymphocyte cell lines Raji and Molt 4 (Fig. 4⇑B, panel II).
Apart from the variation in exon 4 usage described above, RT-PCR identified 1C7 transcripts where exon 2 is divided into two exons (exons 2I and 2II in Fig. 4⇑A). In the smaller PCR products obtained (Fig. 4⇑B, bands 5, 6, and 9), only the 3′ segment of exon 2 (2II) is spliced in. These transcripts encode a truncated polypeptide of only 15 amino acids, due to the incorporation of a stop codon as a result of the splicing of exon 1 to exon 2II, and are probably nonfunctional. In the RT-PCR products 3, 4, and 8 (Fig. 4⇑B), both exons 2I and 2II are used, resulting in a 75-nt deletion in the center of exon 2. Despite this deletion, the open reading frame is retained, resulting in a transcript 75 nt shorter and an encoded polypeptide 25 amino acids shorter than in the corresponding species using the complete exon 2. This variation in usage of exon 2 results in three novel splice variants: 1C7d, 1C7e, and 1C7f (Fig. 4⇑B). Therefore, the six detected transcripts of 1C7 that contain an open reading frame of significant length comprise either four or five exons, including one of three alternatively spliced last exons, and either a single exon 2 or two exons from the exon 2 region. All exons are in phase 1 (27) at their 3′ end except where exon 2 is split into two, in which case the boundary between exons 2I and 2II is of phase 0 (Fig. 5⇓B). The transcripts 1C7a to 1C7f encode six putative protein isoforms of 201, 178, 190, 176, 153, and 165 amino acids, respectively (Fig. 5⇓A).
The amino acid sequence encoded by exon 2 of 1C7 was found to have significant similarity with diverse Ig domain-containing proteins. The highest levels of similarity were observed with the TCR α-chain variable domain (accession no. M27351) for isoforms 1C7a, b, and c (33% identity and 45% similarity over 117 amino acids) and with an IgC2 domain from Perlecan (accession no. Q05793) for the truncated isoforms 1C7d, e, and f (35% identity and 46% similarity over 77 amino acids). Structure based multiple sequence alignments of the putative Ig domain from 1C7a, b, and c with other Ig variable domains and of the putative Ig domain from 1C7d, e, and f with other IgC2 domains are shown in Fig. 6⇓. It can be seen that all the key residues characteristic of Ig domains (as outlined by Harpaz and Chothia, 28) are conserved within the putative Ig domains of 1C7. Of particular interest is the result of the deletion in the central region of exon 2 in 1C7d, e, and f. This changes the features of the Ig domain in 1C7 from those characteristic of an IgV domain to those of an IgC2 domain, i.e., having many of the key residues of an Ig variable domain, but missing the Ig β-strand D and having a significantly shorter β-strand C′ (28).
In addition to an Ig domain, the 1C7 protein isoforms are predicted to have a signal peptide comprising residues 1–18 and a helical transmembrane region spanning residues 138–165 of 1C7a, b, and c and residues 112–140 of 1C7d, e, and f (Fig. 5⇑B). Two N-linked glycosylation sites at Asn residues 42 and 121 are also present in all splice variants. In addition, a proline-rich region encoded in exon 4III and, therefore, exclusive to the C-termini of 1C7a and 1C7e contains a number of potential Src homology 3 (SH3) domain binding motifs (29, 30) and a potential protein kinase C phosphorylation motif at residues 168–171.
In addition to the published LST-1 cDNA sequence, 13 single entry ESTs were found to match with the LST-1 genomic region. These were compiled into seven distinctly different putative transcripts comprising a minimum of nine exons in total (full details are presented and discussed in Neville and Campbell, 31).
The 1.6kb IKBL transcript (accession no. X77909) encodes a polypeptide of 381 amino acids. The structure of exons 1 and 2 has been previously reported (17), but the remaining genomic structure had not been characterized. Comparison of the IKBL cDNA sequence with the genomic sequence determined here showed that the IKBL gene spans 11,176 nt and comprises four exons of 807, 201, 276, and 125 nt in length. Exons 2 and 3 are separated by a particularly large intron (intron 2) of 9189 nt in length, which contains 15 Alu and 4 mammalian-wide interspersed repeat sine repeat elements. There is one conflict in the coding region of exon 4 between the cDNA database entry and the genomic sequence reported here. A change from C–G in the database entry to G–C in the genomic sequence (positions 61,431 and 61,432) results in two adjacent codon changes from CAC–GAC to CAG–CAG and two corresponding amino acid changes: His-238-Gln and Glu-239-Gln.
New potential coding regions
An ATPase G subunit homolog.
The program GRAIL predicted the presence of 3 exons ∼1 kb telomeric of the IKBL gene. The I.M.A.G.E. cDNA clone 726424 (ESTs AA401769 and AA399356) and the American Type Culture Collection cDNA clone 124837 (EST AA324358; Manassas, VA) were found to align with this genomic region and when sequenced were also found to span three exons. However, the clones were found to have different 3′ splice sites for exon 1 and to use different polyadenylation signals (Fig. 7⇓, A and B). Clones 726424 and 124837 contained inserts of 663 and 1320 nt, encoding polypeptides of 77 and 118 amino acids, respectively (Fig. 8⇓). The longer protein shows significant sequence similarity with the vacuolar-ATPase G subunit of Bos taurus (82%), Manduca sexta (75%), Caenorhabditis elegans (78%), Neurospora crassa (60%), and Saccharomyces cerevisiae (54%) over the entire lengths of these proteins. Therefore, the novel gene described here has been named ATP6G; the 6 denoting a vacuolar-type H+ ATPase subunit following the GDB nomenclature and the G denoting a G subunit. The level of conservation between human ATP6G and the orthologous proteins in other species is particularly high over the first 50 amino acids (Fig. 8⇓). The truncated human protein (ATP6Galt) encoded by the splice variant represented by cDNA clone 726424 lacks the first 41 of these residues because the putative initiation AUG codon encoded in clone 124837 is spliced out of clone 726424. The second AUG, and putative initiation AUG of clone 726424, is located in exon 2 (Fig. 7⇓, A and B).
Expression analysis was conducted by RT-PCR using primers from the two alternative 3′ UTRs of ATP6G (Table I⇑). Two PCR products were detected in each profile, and the exon usage of these species was determined by sequence analysis (Fig. 7⇑B, panels I and II). The reverse primer used to generate the PCR products shown in panel I of Fig. 7⇑B is specific for the species with the longer 3′ UTR, as found in clone 124837 (Fig. 7⇑A), while the products in panel II of Fig. 7⇑B were generated using a primer that does not discriminate between the alternative 3′ UTRs as it is positioned within sequence common to both 3′ UTRs upstream of the first polyadenylation signal. It has been confirmed that the upper band in both profiles corresponds to transcripts containing exon 1I (Fig. 7⇑A) (as found in clone 124837), while the lower band of both profiles corresponds to transcripts containing the alternatively spliced exon 1II (as found in clone 726424). Both variants of exon 1 splicing were detected with either the long or short 3′ UTR. However, their expression varies across the nine cell lines examined. In Fig. 7⇑B, panel I (amplification specific for the long 3′ UTR), transcripts containing exon 1I and exon 1II show a similar pattern of expression, with the highest levels being detected in the lymphoid cell lines and little or no expression in the monocyte and macrophage-like cell lines (HL60 and U937, respectively). In contrast, in Fig. 7⇑B, panel II (amplification of both the short and long 3′ UTRs), the expression patterns of transcripts containing exon I and exon II differ. The transcript with exon 1I (upper band) shows a similar pattern of expression to those in Fig. 7⇑B, panel I. However, expression of the transcript with exon 1II (lower band) appears less specific, being detectable at significant levels in all the cell lines tested, the greatest level of expression being seen in the cell line HepG2.
Further potential coding regions.
ESTs from cDNA clones 120912 (American Type Culture Collection) and 1143343 (I.M.A.G.E.) were found to align around the NB6 genomic region; these ESTs were sequenced. Both contained Alu repeats and were, therefore, considered to be derived from genomic DNA contamination.
AIF1 and G1
The comparison of our genomic sequence from the G1 region with the DNA databases indicates that three alternative human transcripts may exist, derived from differential splicing of a total of nine exons. Of these, AIF1 contains six exons, while G1 contains an alternatively spliced form of exon 4 together with exons 5 and 6 only. The distinctly different patterns of expression shown here for G1 and AIF1, together with the fact that three ESTs have been identified that correspond exactly to G1, indicate that G1 and AIF1 are genuine splice variants of the same gene, rather than G1 being an incomplete cDNA. AIF1 is an IFN-γ-inducible molecule expressed within cells of the monocyte lineage, including CD86+ macrophages (23) and dendritic and microglial cells (24). AIF1 has been associated with processes involved in chronic allograft rejection in both humans and rat (23). It has also been shown to both inhibit and enhance insulin secretion and has been found at high levels in macrophages isolated from prediabetic rat pancreatic extracts, but not in normal pancreatic extracts (24). Thus an involvement of AIF1 in the progression of IDDM has been suggested. Furthermore, the BART-1 splice variant from rat is selectively and transiently expressed in response to vascular trauma and has been suggested to play a role in the early to middle stages of vascular restenosis (25).
A number of different features have been identified within the encoded amino acid sequence of AIF1 that may help to define the biological activity of this protein. Firstly, a 44-amino acid segment shared by all the splice variants and homologues of AIF1 contains a cluster of paired basic residues, KR-KK-(G)KR (24), which are characteristic cleavage motifs for peptide hormone precursors (32). The last potential cleavage site is preceded by a glycine residue, which is a characteristic amidation signal (24), and is also a feature of hormone precursor proteins. Interestingly, the IRT-1 splice variant contains a further five paired basic residues (KR-KK-(G)RR-RK-RR) encoded within exon 4III, with the middle site being another possible amidation signal.
Residues 48–108 of AIF1 show strong similarity to an EF hand calcium binding domain (15, 19, 33), which consists of two helix-loop-helix subdomains each containing an EF hand motif across their loop region (34). The calcium-binding domain in AIF1, like many such domains, contains one highly conserved EF hand motif (residues 58–79) and one degenerate or “ancestral” EF hand motif (residues 91–108) that is no longer able to bind calcium (Fig. 3⇑). The conserved EF hand motif of AIF1 (shaded in Fig. 3⇑) deviates from the consensus for this motif at its twelfth (−Z) position, which is almost exclusively an acidic residue, but in AIF1 is a serine. This does not necessarily rule out the binding of calcium by AIF1, as there are a number of examples of functional EF hand domains that deviate from the consensus at this twelfth residue (35, 36, 37). However, the EF hand motif may not be functional in IRT-1 and G1. Although they contain the conserved EF hand loop, they are truncated at their N-termini, resulting in the absence of most of the first helix of this EF hand subdomain, which may affect calcium binding. More significantly though, a 66-amino acid insertion within the IRT-1 splice variant falls within the conserved loop of the EF hand (Fig. 3⇑). Also, if the dibasic endoprotease cleavage motifs are active, the whole EF hand domain of IRT-1 would be disrupted.
Work undertaken so far suggests that the products of the alternative transcripts of AIF1 are involved in the inflammatory response and in a number of disease processes. In addition, the high degree of sequence conservation at the amino acid level between the AIF1 proteins in species as divergent as human and fish (Fig. 3⇑) suggests that AIF1 plays a functionally important role in the cell. Taken together, these two observations highlight AIF1 as a possible candidate disease susceptibility gene for further study. Future work will no doubt shed light on the significance of the prohormone cleavage motifs and the calcium binding ability of the EF hand motif in the different splice variants.
1C7, a putative novel member of the Ig superfamily
All members of the Ig superfamily share a common structure, comprising two β-sheets with a core consisting of β-strands A, B, and E in one sheet and G, F, and C in the other (38). The V, C1, C2 and I Ig sets are distinguished on the basis of other features (38). Harpaz and Chothia (28) defined a group of 46 residues occupying 20 key sites that form the characteristic folds of the IgV set, which they called the “V frame.” Alignment of the 1C7a, b, and c Ig-like domain with other IgV domains (Fig. 6⇑A) highlights the conservation of all the key “V frame” residues in 1C7 (indicated with an asterisk). This alignment was based on a secondary structure prediction that defined the approximate position of all the β-strands in 1C7. In addition, the Ig-like domain is encoded by an exon with a phase 1 splice junction at either end (Fig. 5⇑B), another feature common to Ig domains. This provides convincing evidence that 1C7 is a new member of the Ig superfamily. The expression pattern of this gene further suggests its involvement in the immune response (Fig. 4⇑B). Of particular interest is the existence of the transcripts 1C7d, 1C7e, and 1C7f (Fig. 4⇑B), where the Ig-like domain is split over two exons that together encode a truncated form of the Ig domain, containing features characteristic of the IgC2 set. It retains the core β-strands characteristic of all Ig domains (38) and the conserved EF loop characteristic of the IgV set, but lacks the central region of the domain containing β-strand D and most of β-strand C′ (Fig. 6⇑B). These are all defining features of the IgC2 set (39). As mentioned above, one of the features of Ig domains is that the majority are encoded by a single exon. However, there are exceptions to this rule, and examples of Ig domains that are divided over two exons include CD4, Po, PolyIgR, and NCAM (38). Significantly, in nearly all cases where an intron exists within an Ig domain the splice junction is in a phase other than phase 1, as is the case with 1C7d, e, and f (Fig. 5⇑B). This would preclude the functional splicing of just half the domain within the same gene.
The mature 1C7 peptide is predicted to be a type I integral membrane protein with no cysteine residues N-terminal of its membrane spanning region, other than the two within its single Ig-like domain. This suggests that the 1C7 protein exists as a monomeric membrane bound molecule (Fig. 5⇑C). Other Ig superfamily members with a similar structure that exist as monomers include THY-1, CD83, CD7, and CD79a (40). 1C7 also exhibits a complex pattern of alternative splicing. However, transcripts 5, 6, and 9 (Fig. 4⇑B) are unlikely to encode a functional protein, and further work is needed to assess whether all of the other 1C7 splice variants are functional. Exon 4III in transcripts 1C7a and 1C7e encodes a number of potential SH3 domain binding motifs. This motif is thought to interact with SH3 domain-containing proteins during tyrosine kinase receptor activation (30) as part of a signal transduction cascade. Thus, at least this variant of exon 4 contains features suggestive that it is associated with the expression of a functional protein.
Clearly, further work is needed to confirm the expression of the various isoforms of 1C7 at the protein level and to resolve their tertiary structures as proof of their Ig superfamily membership. But, if both the alternative Ig-like domains of 1C7 do turn out to be functional, then this would appear to be the first identified example of a protein undergoing Ig set switching. This would provide a valuable insight into the evolution of the various sets within the Ig superfamily, particularly the C2 and I sets that have many of the “V frame” features. Finally, the identification of 1C7 as a putative novel member of the Ig superfamily, in a region where genetic studies have suggested the existence of disease susceptibility loci for conditions such as IDDM and CVID (7) (R. E. March, unpublished observations) highlights this as a potentially important candidate disease susceptibility gene, in relation to both autoimmune disorders and immunodeficiencies.
A putative human homolog of the tobacco hornworm V-ATPase G subunit
The analysis of the ∼7-kb gap between IKBL and BAT1 has lead to the identification of a new gene that encodes a protein homologous to V-ATPase G subunits. The vacuolar-type ATPases are H+ translocating ATPases found in most organelles and are involved in a broad range of functions including bone reabsorption; glycosylation in the Golgi; degradation of cellular debris in lysosomes; and the processing of endocytosed receptor-ligand complexes (41). The G subunit has so far been characterized in tobacco hornworm, rat, chicken, and cow and is one of two known peripheral components of the V1 catalytic ATPase complex involved in the catalysis of ATP hydrolysis (41, 42), the second peripheral subunit being the H subunit. Both these subunits have been shown to participate in ATPase activity, but rather than acting together it seems that either subunit alone can initiate ATP hydrolysis (41). The G and H subunits appear to be expressed in different tissue types and are 64% identical at the amino acid level (41). Thus, it is likely that they confer the same activity, but in different locations. Therefore, the absence of RT-PCR products corresponding to the longer form of the ATP6G protein in the monocyte and macrophage cell lines (Fig. 7⇑B, panel I) may suggest that in these cell lines the H rather than the G subunit is constitutively expressed. The level of amino acid sequence conservation between the G and H subunits and between the G subunits of different species is particularly high over the first 50 amino acids (see Fig. 8⇑), suggesting that this region is an important functional domain. This would preclude the truncated isoform of human ATP6G from being functional.
A role in the inflammatory and immune responses has been suggested for V-ATPases (43, 44), and IL-1 has been shown to modulate ATPase activity in a dose-dependent manner (43). This is of particular interest in view of the location of ATP6G in the MHC class III region.
It is becoming increasingly apparent from information already available on genes such as LTB, TNF, and LTA, and from the results described here, that many of the genes located at the telomeric end of the class III region are involved in the immune and/or inflammatory responses and are thus good candidate genes for susceptibility to diseases such as IDDM and CVID. The evidence generated from screening the EST databases and results of RT-PCR has also highlighted the complexity of this region, with many of the genes having alternatively spliced forms (AIF1, 1C7, LST-1, and possibly ATP6G). Finally, the class III region remains the most gene dense region of the human genome with on average 1 gene per 10 kb of DNA. Indeed, the gene density of the 82-kb region discussed here is now 1 gene per 8 kb of DNA.
We thank Helene Jones for invaluable technical assistance relating to DNA sequencing and analysis. We thank Dr. Ruth March, Dr. Begona Aguado, Dr. John Broxholme, and Suzanne Jenkins for helpful advice and discussions. We also thank Drs. Neil Barclay and Alex Bateman for helpful discussions in relation to the Ig superfamily.
↵1 M.J.N is supported by a studentship from the Medical Research Council (U.K.).
↵2 Address correspondence and reprint requests to Dr. R. Duncan Campbell, U.K. Human Genome Mapping Project Resource Centre, Hinxton, Cambridge CB10 1SB, U.K. E-mail address:
↵3 Abbreviations used in this paper: Mb, megabases; LT, lymphotoxin; IDDM, insulin-dependent diabetes mellitus; CVID, common variable immunodeficiency; AIF, allograft inflammatory factor; LST, leukocyte-specific transcript; UTR, untranslated region; EST, expressed sequence tag; IRT, IFN-γ-responsive transcript; BART, balloon angioplasty response transcript; nt, nucleotide(s); SH3, Src homology 3.
- Received October 13, 1998.
- Accepted January 25, 1999.
- Copyright © 1999 by The American Association of Immunologists