Genomic organization of the TRB and TRG loci
The current highly contiguous reference genome ARS1 (BioProject ID: PRJNA290100) was used to identify the TRB and the TRG loci in goats (Capra hircus). Pre-existing data on physical mapping assigned these loci to goat chromosome 4 [27, 31]. Thus, using the corresponding sheep genomic sequences as reference, we retrieved from the Capra hircus chromosome 4 genomic scaffold (GenBank ID: CM004565) the TRB and TRG genomic regions and established their structures.
The TRB locus
To determine the genomic organization of the goat TRB locus, a sequence approximately 560 kb in length was retrieved from the whole chromosome 4 contig, comprising the IMGT 5′ borne (MOXD2 gene) and the IMGT 3′ borne (EPHB6 gene) that flank all mammalian TRB loci studied to date. We identified and annotated all TRB genes using the sheep sequence as a reference [11, 12]. The general structural organization of the goat TRB locus is conserved with respect to the other artiodactyl species [11,12,13,14, 16, 17], with a library of TRBV genes positioned at the 5′ end of three D-J-C clusters, that is followed by a single TRBV gene located at the 3′ end in an inverted transcriptional orientation (Fig. 1). Moreover, we identified and annotated the trypsin-like serine protease (TRY) genes that are typically interspersed among the mammalian TRB genes. Four TRY genes are located downstream of TRBV1, and one TRY gene is located upstream of the D-J-C region (Fig. 1). The classification, position and predicted functionality of all related and unrelated TRB genes are reported in Additional file 1: Table S1 and Additional file 2: Table S2, respectively.
Classification and phylogenetic analysis of the TRBV genes
We annotated a pool of 91 TRBV genes grouped into 27 distinct subgroups according to the criterion that sequences with nucleotide identity above 75% belong to the same subgroup [32]. Five subgroups are multimembers with a massive expansion of the TRBV5 (30 genes), TRBV6 (29 genes) and TRBV21 (6 genes), while both the TRBV7 and TRBV12 subgroups consist of two genes. Out of the 91 TRBV genes, 54 (approximately 60%) are predicted to be functional genes as defined by the IMGT rules (see “Methods”), and 37 (40%) are not functional (pseudogenes and ORF) (Additional file 3: Table S3). The deduced amino acid sequences of the potential functional germline TRBV genes, ORF and in-frame pseudogenes are shown in Additional file 4: Figure S1, where they are aligned according to IMGT unique numbering for the V-REGION [33] to maximize the percentage of identity.
To validate the classification and the membership of the goat TRBV genes to the subgroups, the evolutionary relationship of these genes was investigated in the context of the artiodactyl order by comparing all goat gene sequences with the corresponding sheep, pig and dromedary sequences, adopting two selection criteria: (i) only potential functional genes and in-frame pseudogenes were included, and (ii) only one gene was selected per subgroup. Thus, the V-REGION nucleotide sequences of all selected TRBV genes were combined in the same alignment, and an unrooted phylogenetic tree was constructed using the NJ method [34] (Fig. 2). The tree shows that each of the 27 goat subgroups forms a monophyletic group, when present, with corresponding sheep, pig and dromedary genes, consistent with the birth of distinct subgroups prior to the divergence of the different species.
Moreover, in accordance with the accepted phylogeny of these species, the goat and sheep orthologous genes are more closely related to each other than to their pig and dromedary counterparts in all groupings but one (TRBV29). As in sheep, the TRBV23 subgroup is missing in the goat locus. Hence, the phylogenetic clustering confirmed the classification of each goat TRBV subgroup as orthologous to the corresponding artiodactyl subgroup.
A second phylogenetic tree was constructed to investigate the evolutionary origin of the goat multimember TRBV subgroups. Thus, the V-REGION nucleotide sequences of all goat TRBV5, TRBV6 and TRBV21 subgroup genes were compared with the sequences of all genes included in the corresponding sheep, pig and dromedary TRBV subgroups to obtain an NJ phylogenetic tree [34] (Fig. 3). The tree shows three major monophyletic groupings (A-C), corresponding to the three TRBV subgroups. Taking into account that the same subgroups are also expanded in the sheep locus, it is evident that within each grouping, goat and sheep genes intermingle. In particular, in branch A, each goat TRBV21 gene is tightly related to a corresponding sheep gene, indicating the existence of a common ancestor of the two species in which the duplication events occurred. On the other hand, the grouped dromedary TRBV21 genes appear to have evolved in a species-specific manner. Likewise, within branches B and C, very often a goat TRBV5 or TRBV6 gene forms a monophyletic clade with a corresponding sheep gene (as shown by the alternated yellow and green circles in Fig. 3), indicating that most duplication events were shared between the two species, whereas a minor number of duplication events affected the goat and sheep lineages independently. However, no such striking gene duplications have affected swine or dromedary species.
Genomic architecture of the TRBV-cluster
The goat TRBV region was aligned against itself with the PipMaker program to identify, through identity lines, the model of duplications that have generated the expansion of the TRBV gene subgroups (Additional file 5-6: Figure S2A-B). At first glance, an inspection of the dot-plot matrix reveals a background spread throughout the region probably due to intergenic sequences involved in the massive duplicative events of the TRBV genes.
However, the matrix confirms the high level of nucleotide identity between TRBV genes as indicated by dots and diagonal lines that correspond with gene location. In particular, we observed parallel lines forming two square shapes (shown in red in Additional file 5: Figure S2A) that identify multiple tandem duplications.
The first and wider square occupies an area of approximately 182 kb where the TRBV5 and TRBV6 genes have given rise to a complex series of tandem duplicative events (Additional file 6: Figure S2B). The tandem duplication seems to have involved a single TRBV5 or TRBV6 gene, or to consist of a duplicated TRBV5-TRBV6 unit. Furthermore, the presence of parallel lines of different sizes indicates the presence of higher order repetitive units, which include various arrangements (alternations) of the TRBV5 and TRBV6 genes, such as the 4x(TRBV5-TRBV6) unit contained in the regions from TRBV5–9 to TRBV6–10, from TRBV5–13 to TRBV6–14, from TRBV5–20 to TRBV6–23 and from TRBV5–14 to TRBV6–15, as indicated by longer parallel lines.
The other square shape identifies the tandem duplications of the TRBV21 subgroup genes (Additional file 5: Figure S2A).
Description of the D-J-C region
The goat TRBD, TRBJ and TRBC genes were distributed within three in-tandem D-J-C clusters positioned at the 3′ end of the TRB locus, with D-J-C cluster 3 positioned between D-J-C clusters 1 and 2 (Fig. 1).
The nucleotide and deduced amino acid sequences of the three TRBD genes identified in the region are shown in Additional file 7: Figure S3A. Their length is identical to the corresponding sheep genes, with the exception of the TRBD1, which has one fewer nucleotide in goat. The three TRBD genes consist of 13 bp (TRBD1), 17 bp (TRBD3) and 16 bp (TRBD2) G-rich stretches that can be productively read through their three coding phases, and they encode till 4 glycine residues, depending on the phase. The RS sequences that flank the 5′ and 3′ sides of the coding region are well conserved with respect to the consensus sequence.
The nucleotides and deduced amino acid sequences of all the TRBJ genes identified in the region are shown in Additional file 8: Figure S3B. The TRBJ genes were classified, based on the international nomenclature (IMGT®, http://www.imgt.org, [38, 39]), according to the subsequent TRBC gene and numbered in agreement with the corresponding sheep TRBJ clusters. All TRBJ genes are between 44 bp and 53 bp in length and most contain the canonical FGXG amino acid motif whose presence defines the functionality of the J genes. The exceptions were TRBJ1–3 and TRBJ3–6, which lack the FGXG J-MOTIF. Each TRBJ gene is flanked by a 12 RS sequence at the 5′ end and by a donor splice site at the 3′ end. All the RSs are well conserved with respect to the consensus sequence, except for the non-canonical RS of TRBJ3–6 and the TRBJ1–3 and TRBJ2–7 heptamers, where the TG located in the last two nucleotide positions is mutated to GC. The TRBJ3–6 gene also contains a non-canonical donor-splice site.
Like the TRBC genes of all known mammalian species, the three goat TRBC genes are nearly identical and are composed of four exons and three introns. The goat TRBC genes encode a similar protein of 178 amino acids (AA). The difference of 16 nucleotides between them results in seven AA changes, with six clustered in the extracellular region (C-domain) and one in the cytoplasmic region (Additional file 9: Figure S3C). As in sheep, the C-domain encoded by exon (EX) 1 is 130 AA in length due to the identical length of the FG loop compared to the other artiodactyl species [14, 16]. The C region also comprises a connecting region of 21 AA (encoded by EX2 and the 5′ part of EX3) with a cysteine involved in interchain disulphide binding, a transmembrane region of 21 AA (encoded by the 3′ part of EX3 and the first codon of EX4) and a cytoplasmic region of 5 AA (encoded by EX4).
Analysis of cDNAs
To assess the quality of the genome assembly used to characterize the goat TRB locus and to evaluate the in silico predicted functional competency of the genes, we analysed the goat cDNA clone dataset available at IMGT/LIGM-DB (http://www.imgt.org). Due to the absence of a characterized goat germline reference sequence, each cDNA sequence in the database was annotated with a provisional name, indicated by the letter S, that identifies the TRBV subgroups (“IMGT gene” in Additional file 10: Table S4). Hence, the comparison between germline and cDNA sequences allowed us to evaluate the participation of the germline TRBV and TRBJ genes in the generation of the expressed repertoire and to update the IMGT classification of the genes.
Only five cDNAs perfectly matched the corresponding germline TRBV genes sequences (Additional file 10: Table S4). Twenty expressed TRBV genes showed 97 to 99% nucleotide identity to their corresponding germline genes. We referred to these as alleles based on the assumption that sequences sharing > 97% of nucleotide identity represent the same gene [14, 16]. In contrast, the highest percentage of nucleotide identity between the TRBV gene sequences of two cDNA clones (TCRVB1 and TCRVB53 in Additional file 10: Table S4) and the TRBV5 and TRBV7 germline gene sequences, respectively, was < 97%, indicating that these expressed genes may represent products of additional genes absent in the current assembly, even if polymorphism conditions cannot be ruled out. However, the functional competency of 21 TRBV genes belonging to 17 subgroups was confirmed. Among them was the TRBV3 gene, which has a stop-codon at position 108 (Additional file 4: Figure S1A) that disappears during somatic recombination, resulting in a productive transcript (TCRVB54 clone in Additional file 10: Table S4). As expected, no corresponding cDNAs were found for the germline TRBV genes classified as pseudogenes.
All the TRBJ genes were unambiguously classified with respect to their corresponding genomic sequences. A total of 10 out of 17 functional TRBJ sequences were detected. The TRBJ2 cluster is the most highly represented (19/27), while six clones retain a member of the TRBJ3 cluster and three clones retain the TRBJ1 gene set. Notably, TRBJ2–7, classified as an ORF for a non-canonical heptamer, was found to be rearranged in five different cDNA clones. Furthermore, all three TRBC genes were identified among the cDNAs, each spliced to the rearranged TRBJ of its own D-J-C cluster.
For the determination of the TRBD genes, the nucleotide sequences of the CDR3-IMGT of each expressed product, from codon 105 to 117 according to the unique numbering [33], were excised and analysed in detail (Fig. 4). The deduced amino acid sequences of the CDR3 loop reveal that it is heterogeneous in regard to amino acid composition and length without specific differences in relation to TRBV or TRBJ gene usage. The mean length of the CDR3 loop was 12.5 AA (with a range of 9 to 16 AA), which is in line with the previous analyses of sheep [40], pig [16] and dromedary [14] TRB transcripts.
Compared to genomic TRBD, the sequences located in the CDR3 regions were considered to belong to a TRBD gene if they constituted a stretch of at least five consecutive nucleotides. Under this criterion, the TRBD gene was unambiguously identified in 16 out of 27 sequences (59%), with TRBD1 present in 7 clones, TRBD3 in six clones and TRBD2 in three clones. The remaining 11 sequences either did not have an identifiable TRBD gene (9 clones), or it was not possible to distinguish between TRBD3 and TRBD2 (2 clones) because of their similar germline sequences (Fig. 4). The absence of a TRBD region could be interpreted as the presence of a direct V-J junction. However, it is also possible that nucleotide trimming masked the initial participation of the TRBD gene during rearrangement. Four cDNA clones within the D region exhibited a nucleotide substitution with respect to the germline sequences.
Because the genomic organization of the D-J-C region is known, a formal interpretation of the D-J-C arrangement was also possible. Intra-cluster rearrangements are appreciable in seven clones, with three TRBD2-TRBJ2, two TRBD1-TRBJ1 and two TRBD3-TRBJ3 recombination. Nine rearrangements can be interpreted by direct 5′ to 3′ joining across the clusters (inter-cluster rearrangements), with four TRBD1-TRBJ2 and one TRBD1-TRBJ3, and four TRBD3-TRBJ2 recombinations.
TRG1 and TRG2 genomic organization
The genomic structure of the goat TRG loci was determined with an approach similar to the one used to identify the TRB locus. Using the sheep TRG1 (GenBank accession number DQ992075) and TRG2 (GenBank accession number DQ992074) sequences as reference, we identified two distinct genomic regions on the Capra hircus chromosome 4 genomic scaffold. The first region (pos: c37490000–37,659,887) corresponds to the TRG1 locus, and the second (pos.: 70113000–70,256,000) is representative of the TRG2 locus, confirming previous FISH mapping data of two TRG regions in Bovidae (i.e., sheep, goat, cattle and river buffalo [27]). Moreover, the availability of the whole sequence of goat chromosome 4 has allowed us to measure the exact distance between the two paralogous TRG loci (approximately 32 Mb) and to realize that they lie in an opposite transcriptional orientation on the chromosome.
The TRG1 locus extends for approximately 170 kb and comprises 10 TRGV, 7 TRGJ and 3 TRGC genes distributed in three V-J-J-C cassettes which we have classified, proceeding from the 5′ to 3′ end of the locus, as TRGC5, TRGC3 and TRGC4 in accordance with the corresponding sheep cassettes (Fig. 5). Six TRGV genes are located in the TRGC5 cassette, two in the TRGC3 cassette and one in the TRGC4 cassette. Three TRGV (TRGV11, TRGV3–2 and TRGV10) and two TRGJ (TRGJ5–2 and TRGJ3–2) genes are predicted to be pseudogenes, whereas TRGC5 appears to be an ORF due to its lack of a stop codon (Additional file 11: Table S5). However, by analysing the TRG genes in a different goat genome assembly (CHIR_2.0, GenBank ID: CM001713.2), the predicted functionality of each TRG1 gene was the same, with the exception of TRGV4 and TRGC5, which were both predicted to be functional in the CHIR_2.0 genome assembly (data not shown). Although we cannot rule out that these differences are due to allelic polymorphisms, it is possible that sequencing errors may account for the disparity. In both assemblies, as in other mammalian TRG loci [23], the Amphiphysin (AMPH) gene represents the IMGT 5′ borne of the goat TRG1 locus, while, the U6 small nuclear RNA-associated (LSM8) gene, found and at the 3′ end, may be a candidate gene for the IMGT 3′ borne (Fig. 5).
The TRG2 locus is 110 kb long and contains four V-J-J-C cassettes (Fig. 5), one cassette more than the corresponding sheep locus [26, 41]. We classified the four cassettes, proceeding from the 5′ to 3′ end of the locus as TRGC1, TRGC2A TRGC2B and TRGC6 based on sequence analysis and on homology with the corresponding sheep cassettes. Notably, while the last cassette is clearly homologous to the sheep TRGC6 for the exact correspondence between the genes, less obvious is the correspondence of the first three goat cassettes. A sequence comparison by the Kalign program (at https://www.ebi.ac.uk/), showed that these three cassettes conserve a high level of nucleotide identity (> 96%) as result of recent duplication events, with the highest percentage between the second and the third cassettes (97,32%). These data are also supported by the phylogenetic tree shown in Additional file 12: Figure S4, in which the goat TRGC1, TRGC2A TRGC2B are compared to each other and to the corresponding sheep TRGC1 and TRGC2 cassettes. The tree shows that the goat and sheep sequences form two distinct monophyletic groups in accordance with the phylogeny. Moreover, the goat TRGC2A and TRGC2B cassettes are tightly related and form a paraphyletic group with respect to the TRGC1 cassette in the same species. For this reason, we classified the second and the third cassettes as TRGC2A and TRGC2B, whereas the first cassette is named TRGC1 in accordance with the sheep locus. Unfortunately, the fragmented nature at the 5′ region of the TRG2 locus in CHIR_2.0 prevented us to confirming the presence of the additional cassette in this genomic assembly.
The four cassettes have a typical organization, consisting of one TRGV, two TRGJ and one TRGC genes. All genes are classified as functional, with the exception of the TRGJ2A-1 ORF and the TRGC2A pseudogene (Additional file 11: Table S5). Three out of four TRGV genes belong to a same subgroup (TRGV5). Located 15 kb from the last TRGC gene is the Related to steroidogenic acute regulatory protein D3-N-terminal like (STARD3NL) gene. The IMGT 5′ borne is not detectable since no gene has been annotated that could be used as a marker associated with the TRG2 locus. The cortactin binding protein 2 (CTTNBP2) gene was annotated in position 69,617,142..69789671, approximately 495 kb from TRG2.
The classification, position and predicted functionality of all TRG genes are reported in Additional file 11: Table S5.
Classification of the TRG genes
Taking into account both TRG loci, a total of 14 germline TRGV genes were annotated and assigned to 11 distinct subgroups on the basis of nucleotide identity. All subgroups consist of a single TRGV gene, with the exception of the TRGV3 and TRGV5 subgroups, which consist of two and three genes, respectively. The subgroups were classified by comparing the nucleotide sequences of the goat genes and sheep genes (Additional file 13: Figure S5). Therefore, according to the high percentage of nucleotide identity (> 95%), we classified each goat TRGV subgroup in accordance with its corresponding sheep subgroup.
The deduced amino acid sequences of all goat TRGV genes, except the TRGV11 pseudogene, are shown in Additional file 14: Figure S6A with the corresponding sheep genes. The structure of each goat gene is highly similar to that of the corresponding sheep gene with few amino acid variations. As in sheep, the goat TRGV1 and TRGV5 subgroups exhibit leucine 41 instead of tryptophan 41 in FR2-IMGT. Similarly, the hydrophobic amino acid 89 is leucine, phenylalanine or methionine within FR3-IMGT of the corresponding genes.
Additional file 15: Figure S6B reports the nucleotides and deduced amino acid sequences of all TRGJ genes identified in the region. All the functional genes are 50 or 60 bp long, except for TRGJ5–1 and TRGJ4–1, which are 53 and 47 bp long, respectively. The functional genes conserve the canonical FG(N or K)XG(A) amino acid motif, whose presence defines the functionality of J genes. The only exceptions are the TRBJ5–2 and TRGJ2A-1 genes, which have a different amino acid in the last position of the motif and were therefore classified as ORFs (Additional file 11: Table S5). The TRGJ3–2 is the only pseudogene, based on the presence of a stop codon. Each TRGJ gene is flanked by a 12 RS at the 5′ end and by a donor splice site at the 3′ end. The RS are less conserved in various positions with respect to the consensus sequence, such as the third nucleotide of the “cac” sequence of the heptamer or the poly-A tract of the TRGJ6–1 nonamer (Additional file 15: Figure S6B). However, a comparison between J-RS of the goat and the sheep functional TRGJ genes shows similar nucleotide sequences in heptamers and nonamers (Additional file 17: Figure S7). The third nucleotide of the “cac” sequence shows a similar frequency pattern between goat and sheep J-heptamers. Similarly, the poly-A tract of the nonamers is comparable.
The structure of the goat TRGC genes is similar to that of the orthologous sheep genes (Additional file 16: Figure S6C). The first exon (EX1) encodes the C-domain, while the first part of the connecting region is encoded by one (EX2A), two (EX2A and EX2C) or three (EX2A, EX2B and EX2C) exons. As in the sheep gene sequences, the EX2A of all TRGC genes (excluding the TRGC5 gene) contains a TTE(K or G)P(A) P motif in single, duplicate or triplicate form. The only difference between the goat and sheep sequences is the presence of a duplicate motif within the goat TRGC1 gene where the corresponding sheep gene contains a single motif (Additional file 16: Figure S6C). The remaining portion of the connecting region, the transmembrane region, and the cytoplasmatic region are encoded by EX3.
Comparison of the goat and sheep TRG loci
We compared the genomic structure of the goat and sheep TRG loci by aligning the whole goat sequences (from the first TRGV gene to the last TRGC gene located at the 3′ end) with their sheep TRG1 (Additional file 18: Figure S8A) and TRG2 (Additional file 19: Figure S8B) counterparts using the PipMaker program and evaluating their alignment, which is expressed as a dot-plot sequence comparison graph.
The resulting goat/sheep TRG1 locus dot-plot matrix shows an almost perfect main diagonal line, with only a few small interruptions, indicating a high level of identity for which each base matches with itself. The largest break is due to a deletion in the sheep TRGV2-TRGV9 region (blue triangle in Additional file 18: Figure S8A). Moreover, parallel lines of similarity were found in correspondence with each of the J-J-C regions (red boxes), which comprise the enhancer-like (En) elements that were found at the 3′ end of each sheep TRGC gene (green rectangles). Reduced lines of similarity with the J-J-C-En regions are also present between the TRGV2 and TRGV9 genes (intermittent red line box in Additional file 18: Figure S8A), likely due to a reminiscence of a complete cassette. Furthermore, the V-cluster of the goat and sheep TRGC5 cassettes show extensive collinearity between them with no evident similarity lines with the TRGV gene sequences of the TRGC3 and TRGC4 cassettes. This is in accordance with the hypothetical evolutionary model for the ovine TRG loci described by Vaccarelli et al. [26] that designates the V genes of the sheep TRGC5 cassette as the result of ancestral duplicative events involving only these V genes. Conversely, dashes of homology (blue boxes in Additional file 18: Figure S8A) highlight the similarity between the TRGV genes belonging to the goat and sheep TRGC3 and TRGC4 cassettes.
Additionally, the comparison of the genomic structure of the goat and sheep TRG2 loci shows an evident similarity confirmed by the length of the uninterrupted lines corresponding to the homologous gene cassettes (Additional file 19: Figure S8B). It is possible to trace along the matrix a continuous main pseudo-diagonal starting from the continuous homology line between the TRGC1 cassette (arrow 1 in Additional file 19: Figure S8B) across the two parallel lines of homology between the goat TRGC2A and TRGC2B and the single sheep TRGC2 cassette (red rectangle in Additional file 19: Figure S8B), to the corresponding homologous line between the TRGC6 cassettes (arrow 3 in Additional file 19: Figure S8B). The only interruptions are due to the missing EX2 in the TRGC2A gene with respect to the sheep TRGC1 and TRGC2 genes and to a deletion in the sheep TRGJ6-TRGC6 region (violet triangles in Additional file 19: Figure S8B).
Finally, the homology of each cassette in the TRG2 locus extends in correspondence of the sheep enhancer sequences (green rectangles in Additional file 19: Figure S8B).