Phylogeny of teleost connexins reveals highly inconsistent intra- and interspecies use of nomenclature and misassemblies in recent teleost chromosome assemblies

Background Based on an initial collecting of database sequences from the gap junction protein gene family (also called connexin genes) in a few teleosts, the naming of these sequences appeared variable. The reasons could be (i) that the structure in this family is variable across teleosts, or (ii) unfortunate naming. Rather clear rules for the naming of genes in fish and mammals have been outlined by nomenclature committees, including the naming of orthologous and ohnologous genes. We therefore analyzed the connexin gene family in teleosts in more detail. We covered the range of divergence times in teleosts (eel, Atlantic herring, zebrafish, Atlantic cod, three-spined stickleback, Japanese pufferfish and spotted pufferfish; listed from early divergence to late divergence). Results The gene family pattern of connexin genes is similar across the analyzed teleosts. However, (i) several nomenclature systems are used, (ii) specific orthologous groups contain genes that are named differently in different species, (iii) several distinct genes have the same name in a species, and (iv) some genes have incorrect names. The latter includes a human connexin pseudogene, claimed as GJA4P, but which in reality is Cx39.2P (a delta subfamily gene often called GJD2like). We point out the ohnologous pairs of genes in teleosts, and we suggest a more consistent nomenclature following the outlined rules from the nomenclature committees. We further show that connexin sequences can indicate some errors in two high-quality chromosome assemblies that became available very recently. Conclusions Minimal consistency exists in the present practice of naming teleost connexin genes. A consistent and unified nomenclature would be an advantage for future automatic annotations and would make various types of subsequent genetic analyses easier. Additionally, roughly 5% of the connexin sequences point out misassemblies in the new high-quality chromosome assemblies from herring and cod.

The chronology of the sequences are according to the Greek nomenclature. Both the Greek nomenclature and the size nomenclature are indicated, and the GenBank accession number is given for each entry.
Yellow: Conserved domains as defined by Cruciani and Mikalsen (2007) Green: Conserved cysteine codons (cysteine signature) Grey: 15 nt added at the ends of the conserved domains Turquoise: Splice site. Other colors are explained where necessary.
Yellow: Conserved domains as defined by Cruciani and Mikalsen (2007) Green: Conserved cysteine codons (cysteine signature) Grey: 15 nt added at the ends of the conserved domains Turquoise: Splice site. Other colors are explained where necessary.
>Fr-gjc1-45-XM_003964814 Modified according to prediction in Ensembl (which has omitted other parts of the sequence). In July 2019 this sequence was made obsolete, and replaced by XM_029836267 (and other transcription variants). There is only one nucleotide difference between our modification of XM_003964814 and the new XM_029836267, marked in purple. As we have used the accession number XM_003964814 in all analyses (which were done before July 2019), for the purpose of this manuscript we keep the obsolete accession number.
>Fr-gje1-XM_011611785 This prediction is probably erroneous with regard to the first exon. We have replaced this with the more likely first exon, which is separated from exon 2 by a approx 235 nt intron. As far as possible, the names of the sequences are taken from the Ensembl predictions. Where there is a prediction (although we might have modified it) without a name, we include NN (no name) as a prescript, use the most common name of the ortholog sequence (usually from zebrafish), and end the name with an abbreviated Ensembl gene prediction number. Where there is no prediction in Ensembl and no predicted (or experimentally found) sequences in GenBank with a name, we include NP (not predicted) as a prescript, and use the most common name of the ortholog sequence (usually from zebrafish). The Ensembl gene abbreviation is done as follows: ENSNIG00000015676 = G15676.
Yellow: Conserved domains as defined by Cruciani and Mikalsen (2007) Green: Conserved cysteine codons (cysteine signature) Grey: 15 nt added at the ends of the conserved domains Turquoise: Splice site. Other colors are explained where necessary.
>Gm-NN-gja1-G09844 Our modification.Underlined: Predicted as intron by Ensembl; here included as part of cds. Lower case letters: Located on an unplaced contig (in the scaffold, there is just a row of Ns). The chromosomal cod assembly in GenBank (GCF_902167405) and the subsequent gene prediction XM_030362165 confirmed that the lower case letter sequence indeed is a likely part of the cds. In fact, XM_030362165 predicts that the lower case sequence should be extended approx 90 nt in 5´-direction. Splice site. ATGGGAGACTGGAGCGCTCTGGGGAAACTGCTGGACAAAGTCCAGGCCTACTCCACAGCC  GGAGGCAAGGTATGGCTCTCCGTCCTCTTCATCTTCCGTATCCTGGTCATCGGTACTGCG  GTGGAGTCTGCGTGGGGCGACGAGCAGTCGGCCTTCAAGTGCAACACCGCCCAGCCGGGC  TGTGAGAACGTCTGCTACGACAGCTCCTTCCCCATCAGCCACGCACGCTTCTGGGTCCTG  CAGATCATCTTCGTCTCCACGCCAACGCTGCTCTACCTCTGCCACATCTTCTACCTCATC  CACAAGGAGGAAAAGatgaagtacggcatcgagaagaacggaaaggtgaagatgaaagga  gctctgctcaggacctacatcttcagcatcctgctcaagtccttctttgagGTGGGCTTC  CTACTGCTGCAGTGGCACATCTACGGCTTCAGCCTGGCGTCGCGCTACGAGTGCGAGGCG  TACCCCTGCCCCCACCGCACCGACTGCTTCCTGTCGCGGCCCACCGAGAAGACCATCTTC  ATCGTCTTCATGCTGGTGGTCTCCCTGGTGTCCCTGCTGCTCAACCTCATCGAGCTCTTC  TACGTCACCTACAAGTGGGTCAAGGACACCATGAGGGCGTCCGAGGGCCAGCAGCTCCAC  CCCCGCCTCCGCCTGCTGCCGGGGGCCGGAGGAGGAGTGGGAGGAGGAGGAGTGGGAGGA  GAAGAAGGAGGAGCGCCGTACCACTACTGCAACGGCTGCCCCCCCCCCTCCGCCCCTGTC  TACAACCTGGATGCCACGGCGACGGTGGCAAGGGGCGACTCGGTGAACCACTACAACAAG  ACGGCGAGCGAACAGAACTGGACCAACTTCAGCACGGAGCAGAACCAGCTGGGCCGCTCC  CCGCCCCGTCGCCACGGCAGCCAGCGCAGTACCGGCAAGAACAACAACAACAACAATAAC  CACAACAACAAGGCCGGCGCCAACGCCAGTGACTGCAACCGCGACAGCCCTGCCTTCCTG  GGCCCCGCCTTCGTAGGCCCCGCCCACGGCCAGCCACCGCCGTCGGACAAGGTGGAGACC  AAGGAGCTTCACCTCCTCCGGGGGCTGGAGCCGCGGCCCGGCAGCCGCGCCTCCAGCCGC  GCCCGCACTGACGACCTGGACATCTGA >Gm-cx43-G20304 Our modification. Extended in 3'-direction. No reasonable stop codon in frame, but in other reading frames, there are translated sequences that become reasonable similar with other GJA1 orthologs. Hence, potential small intron or sequencing error towards 3'-end .  ATGGGTGACTGGAGTGCTCTGGGCCGCCTGCTGGACAAGGTCCAGGCCTACTCCACCGCT  GGGGGGAAGGTGTGGCTCTCCGTCCTCTTCATCTTCAGGATCCTGGTCCTTGGGACGGCC  GTGGAGTCCGCCTGGGGCGACGAGCAGTCGGCCTTCAACTGCAACACTCAGCAGCCCGGC  TGCGAGAACGTATGCTATGACAAATCCTTCCCCATCTCCCATGTGCGCTTCTGGGTGCTG  CAGATCATCTTCGTGTCCACGCCCACGCTGCTGTACCTGGCCCACGTCTTCTACCTGATG  AGGAAGGAGCAGAAGCTGAACAGGAAGGAGGAAATGCTGAAGGCCGTGCAGAACGATGGC  GGCGACGTTGACATCCCGCTGAGGAAGATCGAGATGAAGAAGCTGAAGCACGGCCTGGAG  GAGCACGGCAAGGTGAAGATGAAGGGCGCCCTGCTGAGAACCTACATCGTCAGCATCTTC  TTCAAGTCCATGTTCGAGGTGGGCTTCCTGGTCATCCAGTGGTACATATACGGCTTCAGT  CTGGCAGCGGTGTACACCTGCGAGAGAGAACCCTGTCCCCACAGGGTGGACTGTTTCCTG  TCTCGGCCCACAGAGAAGACGGTGTTCATCATCTTCATGCTGGTGGTGTCGCTGGTGTCC  CTGCTGCTCAACGTCATCGAGCTCTTCTACGTGTTCTTCAAGAGGATCAAGGACCGTGTG  AAGGGCCGCCAGCCGCCCACCCTCTACCCCAGCGCTGGCACCCTGAGCCATACCCCCAAA  GATCTTTCCACAGCCAAGTACGCCTACTACAATGGCTGCTCCTCCCCCACCGCCCCGCTC  TCGCCCATGTCCCCGCCGGGCTACAAGCTGGCCACGGGCGAGCGCGGTACCGGCTCATGT  CGCAACTACAACAAGCAAGCCACCGAGCAGAACTGGACCAACTATTCCACGGAGCAGAAC  CAGCTGGGCCAGCACGGCGCGGGCAGCACTATCTCAAACTCCCACGCGCAGGCTTTTGAT  TTCCCCGACGATACGCACGAGCATAAGAAACTGACGTCATCCGCAGCTGCACACGAGATG >Gm-NN-gja3-G09100-2 Our modification. Splice sites. This Ensembl prediction contains two separate and unique connexins sequences, the present and a cx30.3 sequence. atgggtgactggagctttctgggacgccttctggagaatgctcaggaacactcaactgtg atcggcaaggtgtggctgaccgtcctcttcatcttccgcattctggtgctgggcgcggcc gcagaggaggtgtggggagacgagcagtcggacttcacctgcaacacgcagcagcccggt tgcgagaacgtctgctacgaccaggccttccccatctcccacgtgcgcttctgggtgctg cagatcatcttcgtgtccacgcccacgctcatctacctgggccacgtgctgcacatcgtg cgcatggaggagaagcggcgtgagaaggaggaggagctgcggaaggcgggctggcgcagc gaggagctcctcgggcaNNNNGGAGGCGGGAAGAAGGAGAGGCCGCCGATCCGCGACGAG CACGGGAAGATCCGCATCCGCGGGGCGCTGCTCCGGACCTACGTCTTCAACATCATCTTC AAGACCCTTCTGGAGGTGGGCTTCATCCTGGGCCAGTACTCCCTCTACGGCTTCCGCCTC AAGCCGCTGTACAAGTGCGGCCGCTGGCCTTGCCCCAACACGGTGGACTGCTTCATCTCC AGGCCCACTGAGAAAACCATCTTCATCATCTTCATGCTGGTGGTGGCCTGCATCTCCCTG CTGCTCAACCTGCTAGAGATGTACCACCTGGGCTGGAAGAAGGTCAAACACAGCGTCACC CACAAGTTCGCGGCTGACTGCGGGTCCCTGCGGCTGGGCCCCGGCGACGACGCCGGCGAC CCCCGGGCGGTCCCCGAGTGCGCCACCCTGGTTTCGGACCACTGCCTGCAAGGCTACACC GGCAGGAGCACCATGGAGCGGGTCCGCTACCTGCCCGTCCAGAACTCCTC >Gm-gja3-G04087 Our modification. Ensembl-predicted introns are included (underlined). There is probably an intron or something wrong in the 3'-end (after the conserved domain), but we have not tried to solve the problem here. In the first conserved domain at the position indicated by lower case "ga", the Ensembl sequence indicates a row of approx. 100 Ns. "ga" has been found by Blast against GenBank cod wgs. ATGGGCGACTGGAGCTTTCTGGGCCGGCTTCTTGAGAACGCGCAGGAGCACTCGACGGTG ATCGGCAAGGTCTGGCTCACCGTCCTCTTCATCTTCCGCATCCTAGTGCTGGGTGCCGCA  GCAGAGGAGGTGTGGGGCgaCGAGCAGTCGGACTTCACCTGCAACACGCAGCAGCCCGGT  TGCGAGAACGTCTGCTATGACCAGGCCTTCCCCATCTCCCACATCCGCTTCTGGGTGCTG  CAGATCATCTTTGTGTCCACTCCCACGCTCATCTACCTGGGCCACGTGCTGCACATCGTG  CGCATGGAGGAGAAGCGCAAGGAGAAGGAGGAGGAGCACCGCAAGGTCAGCGGGTTCCCC  GATGACAAGGAGCTGCCGTACCGGAACGGGGGCGGCGGTAAAAAGGTGAAGCCGCCGATC  AGAGACGAGCACGGCAAAATCCGCATCCGCGGGGCCTTGCTGCGTACCTACGTGTTCAAC  ATCATCTTCAAGACTCTGTTTGAGGTGGGCTTCATCCTGGGCCAGTACTTCCTGTACGGC  TTCTCGCTGCGGCCGCTCTACAAGTGCTCCCGTTGGCCGTGCCCCAACACGGTGGACTGC  TTTATCTCCAGGCCCACGGAGAAGACTATCTTCATCATATTCATGCTTGTTGTGGCTTGT  GTGTCGCTTTTACTCAACCTGCTGGAGATCTACCACCTGGGCTGGAAGAAGCTGAAGCAG  GGCGTGTACCACCCCGACCACCTGCTGCGGGCCGCCGGCCAGCTGGCCACGCCGGAGGGC  GTGGCCTCGCTAGGGGCCCCGGCTCTCCTCAACTACCCCCCCACCTACAGCCACATAGCG  GCCGGCATGGGGTCCCCCACCGACGCCGAGTTCAAGATGGAGGAGCTCCAGCGGGAGGAG  GGGGCGCGGACGCCTCCCCCGACTCCCCCGGCCGCCCACTACTACATCAGCAGCAACAAC  AACCACCGTCTGGCCGCAGAGCAGAACTGGGCCAACCTGGCCACCGAGCAGCACACCCGC  CAGATGAAGGCCACCTCCCCCACCCCCACGTCCTTCTCCTCCTCAAGCAGTGAAGCGGCC  CCGCCCTGCTCAACTAGCCCCACCCCCTTAATGGCAACCCCGGGCAACGCTGCAGCCCCC  GGTGATGTGGCGACCAGCGGCGACGGAGCCGGCCTGACCCCCGAGCCGGGCCAGCGGGAG  GAAGAGGATGTCACCATGGCGACGGTGGAGATGCACCTGGAGGGGGTGTTCCCGGACCCC  CGGCGTCTTAGCAGAGCCAGTAGAAGCAGCATCCGCGCCCGGCACGATGACCTCGCCATC  We here use the Japanese eel linkage groups (essentially equals a chromosome level assembly) as identification in addition to the naming of each sequence.
Note that some entries present in GenBank have wrong subfamily designation. In humans and koala the sequence is said to belong to the alpha subfamily, in Egyptian rousette it is said to belong to the gamma subfamily, while in black flying fox it is said to belong to the delta subfamily (which is correct). The corresponding opossum sequence (Md-GJD2like-39.2-XM_001376506), which was the first cx39.2 sequence found in mammals (Cruciani and Mikalsen, 2005), is depicted in Suppl. Fig. 3. Several of these sequences are (supposed) pseudogenes (indicated by GJA4P, cx39.2P). To show the exact alignments of these sequences used in the phylogenetic analyses, which showed that they belonged to a single orthologous group, we have indicated the gaps (-). The gaps have been adjusted to fit the codon borders as much as possible. Be aware that most alignment tools remove gaps before performing alignment. The previously non-predicted sequences (indicated by "NP") were found by blasting other cx39.2 sequences into Ensembl genomes or GenBank wgs (using Placentalia, marsupials, bats, or Afrotheria as species groups).  Fig. 13. Comparisons of human "GJA4P" against connexin39.2 and GJA4.
The cx39.2 sequences given in Suppl. Fig. 12 were translated to protein and aligned. Among the pseudogenes, only the human sequence is included, as aligning several pseudogenes strongly decreases the total number of identities (*) or similarities (: or .). Also the corresponding sequence from eel .2) was included. ?, corresponds to a codon that contains one or more unknown nucleotides or a gap. <, corresponds to a stop codon. n, the first conserved domain is N-terminal to n, and the second conserved domain is C-terminal to n; thus this n corresponds largely to the intracellular loop. The Muscle (https://www.ebi.ac.uk/Tools/msa/muscle/) identity matrix is found in Suppl. Table 8.  Suppl. Fig. 13B. Alignment of conserved domains in human "GJA4P" (NG_026166) against GJA4 (connexin37) from human and eel at protein level.
The human GJA4P cx39.2 sequence given in Suppl. Fig. 12 were translated to protein and aligned with eel cx39.2, human GJA4, and the two eel-gja4 (cx39.4) sequences. Identities (*) or similarities (: or .) are indicated below the alignment. . ?, corresponds to a codon that contains one or more unknown nucleotides or a gap. <, corresponds to a stop codon. n, the first conserved domain is Nterminal to n, and the second conserved domain is C-terminal to n; thus this n corresponds largely to the intracellular loop. The Muscle (https://www.ebi.ac.uk/Tools/msa/muscle/) identity matrix is shown in Suppl. Table. 9. Suppl. Fig. 16. Searching for positions of connexins lacking in chromosomal assemblies.
Suppl. Fig. 16A. Problem in cod assembly of chromosome 20 at assumed position of gja5. Cod scaffold HE571867 contains gja5 in position 173000-174000. This scaffold was aligned with cod chromosome 20 assembly LR633962 position 0 to 2,000,000 using the alignment option in Blast and word size 32. Dot plot is one of the options on the results page. The position of gja5 on HE571867 is indicated by the red dotted line. There is an obvious lack of alignment between the scaffold and the chromosomal assembly in the area where gja5 was expected, and there is an inversion in the sequence corresponding to the scaffold.

Fig. 17. Legend:
The following annotation is used for the compressed branches: UPPER CASE, mammals. The classic size nomenclature for humans is also shown in parentheses. Lower case, teleosts. For the teleosts, some of the previous or commonly used names in the group are given first, and after the dash, our suggested Greek nomenclature name. If we find ohnologs in the group, this is indicated by a/b after the suggested name (e.g., gja1a/b). Note that ohnologies may only apply to some of the investigated species (i.e., ohnologs may not be found in all species). For example, for gja4, ohnology has only been established in eel. In one case only herring support ohnology (gjd5), which potentially means that the other member of the pair must have been lost three times (in eel [diverged before herring], zebrafish [diverged together with herring], and in the line leading to later diverging fishes (the remaining species in this investigation). In two cases, we indicate that duplicated genes within the group probably have been generated by tandem gene duplication (gja12.1/2 and gja13.1/2). The tree was made using the Neighbor-Joining method at amino acid level. The substitution model was JTT, and the rate variation among sites was modelled with a gamma distribution = 1.0. To simplify the tree, all sequences within the GJE1 group were excluded, together with the pseudogenes within the Cx39.2 (GJD5) group, except for the human pseudogene with accession number NG_026166. Additionally, a single sequence that often branched off from the stem of the corresponding group was excluded (Aj-NN-32.3b-BEWY01000019). Similarly, sequences that disturbed a clear dichotomy for GJA1/gja1 (Mm-gja6-NM_01001496, Dr-gja1like-XM_NM_688906, Ch-gja1like-XM_012836783, Gm-NN-gja1-G09844, Aj-CXA1-BEWY01000007) were excluded. This gave a total of 347 sequences in this tree. The root branches of the gjd subfamily have been fused using the root function in the MEGA Tree Explorer.