Comparative genomic analysis of eutherian fibroblast growth factor genes

Premzl, Marko

doi:10.1186/s12864-020-06958-4

Research article
Open access
Published: 05 August 2020

Comparative genomic analysis of eutherian fibroblast growth factor genes

Marko Premzl ORCID: orcid.org/0000-0002-3362-689X¹

BMC Genomics volume 21, Article number: 542 (2020) Cite this article

1188 Accesses
2 Citations
Metrics details

Abstract

Background

The eutherian fibroblast growth factors were implicated as key regulators in developmental processes. However, there were major disagreements in descriptions of comprehensive eutherian fibroblast growth factors gene data sets including either 18 or 22 homologues. The present analysis attempted to revise and update comprehensive eutherian fibroblast growth factor gene data sets, and address and resolve major discrepancies in their descriptions using eutherian comparative genomic analysis protocol and 35 public eutherian reference genomic sequence data sets.

Results

Among 577 potential coding sequences, the tests of reliability of eutherian public genomic sequences annotated most comprehensive curated eutherian third-party data gene data set of fibroblast growth factor genes including 267 complete coding sequences. The present study first described 8 superclusters including 22 eutherian fibroblast growth factor major gene clusters, proposing their updated classification and nomenclature.

Conclusions

The integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis argued that comprehensive eutherian fibroblast growth factor gene data set classifications included 22 rather than 18 homologues.

Background

The eutherian fibroblast growth factors or FGFs were implicated as key developmental regulators [1,2,3]. First, the 15 paradigmatic paracrine or canonical fibroblast growth factors FGF1–10, FGF16–18, FGF20 and FGF22 were described as ligands to single-chain receptor tyrosine kinases named FGF receptors or FGFRs [2,3,4,5,6,7,8,9,10,11]. After paracrine FGF ligand and heparan sulphate glycosaminoglycan binding, the dimerized FGFRs become activated through autophosphorylation, interacting with cytosolic adaptor proteins and intracellular signaling cascades. Such transmembrane signal transduction was implicated in regulation of embryogenesis, implantation, gastrulation, body plan formation, branching morphogenesis and organogenesis, as well as in pathogeneses of human hereditary diseases including deafness, Kallmann syndrome, lacrimo-auriculo-dentodigital syndrome and different skeletal syndromes, and in tumorigenesis. Second, there were 3 endocrine fibroblast growth factors FGF19, FGF21 and FGF23 binding FGFRs and klotho protein cofactors [2, 3, 7, 12]. The endocrine FGFs were implicated in metabolism regulation including phosphate and vitamin D homeostasis, cholesterol and bile acid homeostasis and glucose and lipid homeostasis, as well as in pathogenesis of autosomal dominant hypophosphataemic rickets. Third, the 4 intracellular fibroblast growth factors named fibroblast homologous factors included FGF11 or FHF3, FGF12 or FHF1, FGF13 or FHF2 and FGF14 or FGF4 [1, 3, 13,14,15,16]. The intracellular FGFs were described as regulators of nervous system development and function including integration and encoding of complex synaptic inputs into action potential outputs in central nervous system neurons, and implicated in pathogenesis of early-onset spinocerebellar ataxia. The molecular evolution and protein structure analyses indicated that eutherian FGFs folded into β-trefoil protein tertiary structures including 11 or 12 β-strands [1,2,3, 7, 12, 13, 17,18,19,20,21,22,23,24,25,26,27,28]. However, there were major disagreements in descriptions of comprehensive eutherian FGF gene data sets. Specifically, Belov and Mohammadi [2] and Beenken and Mohammadi [7] argued that bona fide eutherian FGF homologues included 18 secreted paracrine and endocrine FGFs. On the other hand, the eutherian FGF classifications by Goldfarb [1] and Ornitz and Itoh [3] included both 18 secreted FGFs and 4 intracellular FGFs.

Undoubtedly, the public eutherian reference genomic sequence data sets advanced biological and medical sciences [29,30,31,32,33,34]. Indeed, the comparative genomics momentum was maintained by considerable international efforts in production and analysis of public eutherian reference genomic sequence data sets. For example, the initial sequencing and analysis of human genome attempted to revise and update human genes, and uncover potential new drugs, drug targets and molecular markers in medical diagnostics [35, 36]. Nevertheless, due to the incompleteness of eutherian reference genomic sequence assemblies [35, 37] and potential genomic sequence errors [36, 38], future updates and revisions of public eutherian reference genomic sequence data sets were expected. Inevitably, the potential genomic sequence errors including analytical and bioinformatical errors (erroneous gene annotations, genomic sequence misassemblies) and Sanger DNA sequencing method errors (artefactual nucleotide deletions, insertions and substitutions) could compromise unquestionable utility of public eutherian reference genomic sequence data sets. For example, Gajer et al. [39] described so-called lexicographical bias in some genomic sequence assemblers. In addition, the potential genomic sequence errors affecting phylogenetic analyses [40] were observed more frequently in reference genomic sequence assemblies including lower genomic sequence redundancies [41,42,43]. Thus, the eutherian comparative genomic analysis protocol was established as guidance in protection against potential genomic sequence errors in public eutherian reference genomic sequence data sets [44,45,46]. Using public eutherian reference genomic sequence data sets, the protocol published new test of reliability of public eutherian genomic sequences using genomic sequence redundancies, and new test of protein molecular evolution using relative synonymous codon usage statistics. The protocol revised and updated 12 eutherian gene data sets implicated in major physiological and pathological processes, including 1853 published complete coding sequences. Of note, there was positive correlation between genomic sequence redundancies of 35 public eutherian reference genomic sequence data sets respectively and published complete coding sequence numbers [46].

Therefore, the present analysis attempted to revise and update comprehensive eutherian FGF gene data sets, and address and resolve major disagreements in their descriptions using eutherian comparative genomic analysis protocol and 35 public eutherian reference genomic sequence data sets.

Results

Gene annotations

The tests of reliability of eutherian public genomic sequences annotated 267 FGF complete coding sequences among 577 FGF potential coding sequences (Fig. 1). The most comprehensive curated eutherian FGF third-party data gene data set was deposited in European Nucleotide Archive under accessions: LR130242-LR130508 [47, 48] (Additional file 1).

The present study first described 8 superclusters FGF1–8 including 22 major gene clusters of eutherian FGF genes, proposing their updated nomenclature (Fig. 1). The supercluster FGF1 included 4 major gene clusters FGF1A (11 FGF12 or FHF1 genes), FGF1B (9 FGF14 or FGF4 genes), FGF1C (11 FGF13 or FHF2 genes) and FGF1D (15 FGF11 or FHF3 genes) (Additional file 2A-D). The supercluster FGF2 included 2 major gene clusters FGF2A (8 FGF2 genes) and FGF2B (20 FGF1 genes) (Additional file 2E-F). The supercluster FGF3 included 1 major gene cluster FGF3A (17 FGF5 genes) (Additional file 2G). The supercluster FGF4 included 3 major gene clusters FGF4A (11 FGF20 genes), FGF4B (16 FGF9 genes) and FGF4C (14 FGF16 genes) (Additional file 2H-J). The supercluster FGF5 included 4 major gene clusters FGF5A (14 FGF10 genes), FGF5B (16 FGF7 genes), FGF5C (7 FGF3 genes) and FGF5D (9 FGF22 genes) (Additional file 2 K-N). The supercluster FGF6 included 3 major gene clusters FGF6A (5 FGF18 genes), FGF6B (12 FGF17 genes) and FGF6C (7 FGF8 genes) (Additional file 2O-Q). The supercluster FGF7 included 2 major gene clusters FGF7A (8 FGF4 genes) and FGF7B (17 FGF6 genes) (Additional file 2R-S). Finally, The supercluster FGF8 included 3 major gene clusters FGF8A (12 FGF19 genes), FGF8B (12 FGF23 genes) and FGF8C (16 FGF21 genes) (Additional file 2 T-V).

The present study included new genomics tests of contiguity of eutherian public genomic sequences that analysed numbers of coding exons in FGF genes and their relative orientation (Additional files 1 and 2). The analysis including 903 FGF coding exons indicated that there were no coding exon misassemblies among 267 eutherian genomic sequences harbouring FGF complete coding sequences. The eutherian FGF genes included either 5 coding exons (5 major gene clusters FGF1A-D and FGF6A) or 3 coding exons (17 other major gene clusters). The eutherian FGF coding exon numbers were constant within major gene clusters, and there was no evidence of differential gene expansions indicating that 22 eutherian FGF major gene clusters respectively included orthologues. For example, whereas the human FGF1A gene included 5 coding exons along 264,215 bp (Additional file 2A), human FGF7A gene included 3 coding exons along 1776 bp (Additional file 2R).

Therefore, the present study annotating 22 eutherian FGF major gene clusters agreed with Goldfarb [1] and Ornitz and Itoh [3] but disagreed with Belov and Mohammadi [2] and Beenken and Mohammadi [7].

Phylogenetic analysis

The present minimum evolution phylogenetic tree calculations (Fig. 1) and calculations of pairwise nucleotide sequence identity patterns (Additional file 3) first classified 22 eutherian FGF major gene clusters among 8 superclusters FGF1–8. The clustering of major gene clusters FGF1A-D within supercluster FGF1 agreed with subfamily FGF11 descriptions [3, 23], Smallwood et al. [13], Ornitz and Itoh [21], subfamily Fgf11/12/13/14 description [25] and Nam et al. [28]. The clustering of major gene clusters FGF2A-B within supercluster FGF2 agreed with subfamily FGF1 descriptions [3, 23], Smallwood et al. [13], Coulier et al. [17], Ornitz and Itoh [21], subfamily Fgf1/2 description [25] and Nam et al. [28]. The supercluster FGF3 description including 1 major gene cluster FGF3A agreed with Nam et al. [28] but disagreed with phylogenetic analyses of Ornitz and Itoh [3, 21], Coulier et al. [17] and Itoh and Ornitz [23, 25]. The clustering of major gene clusters FGF4A-C within supercluster FGF4 agreed with subfamily FGF9 descriptions [3, 23], Ornitz and Itoh [21] and subfamily Fgf9/16/20 description [25] but disagreed with Nam et al. [28]. The clustering of major gene clusters FGF5A-D within supercluster FGF5 disagreed with phylogenetic analyses of Ornitz and Itoh [3, 21], Itoh and Ornitz [23, 25] and Nam et al. [28]. The clustering of major gene clusters FGF6A-C within supercluster FGF6 agreed with subfamily FGF8 descriptions [3, 23], Ornitz and Itoh [21], subfamily Fgf8/17/18 description [25] and Nam et al. [28]. The clustering of major gene clusters FGF7A-B within supercluster FGF7 agreed with Smallwood et al. [13], Coulier et al. [17], Ornitz and Itoh [21] and Nam et al. [28] but disagreed with Ornitz and Itoh [3] and Itoh and Ornitz [23, 25]. Finally, the clustering of major gene clusters FGF8A-C within supercluster FGF8 agreed with Ornitz and Itoh [21] but disagreed with Ornitz and Itoh [3], Itoh and Ornitz [23, 25] and Nam et al. [28].

Indeed, the calculations of pairwise nucleotide sequence identity patterns confirmed present phylogenetic classification of eutherian FGF genes (Additional file 3). The eutherian FGF gene data set included average pairwise nucleotide sequence identity ā = 0,3 (a_max = 1, a_min = 0,115, ā_ad = 0,094) [1,2,3, 7, 12, 13, 17, 21, 23, 25,26,27,28]. Among 22 eutherian FGF major gene clusters respectively, there were nucleotide sequence identity patterns of very close eutherian orthologues (FGF1A-B, FGF4B), close eutherian orthologues (FGF1C-D, FGF2A-B, FGF4A, FGF4C, FGF5B, FGF6A, FGF7B), typical eutherian orthologues (FGF3A, FGF5A, FGF5C-D, FGF6B-C, FGF7A, FGF8A, FGF8C) and distant eutherian orthologues (FGF8B). In comparisons between eutherian FGF major gene clusters within superclusters, there were nucleotide sequence identity patterns of very close eutherian homologues (superclusters FGF1–2, FGF4, FGF7), very close and close eutherian homologues (supercluster FGF6), close and typical eutherian homologues (supercluster FGF5) and typical eutherian homologues (supercluster FGF8). Finally, in comparisons between eutherian FGF major gene clusters between superclusters, there were nucleotide sequence identity patterns of close, typical, distant and very distant eutherian homologues.

Therefore, the present phylogenetic analysis proposed updated classification of eutherian FGF genes.

Protein molecular evolution analysis

The protein molecular evolution analysis used protein primary structure features as major alignment landmarks in eutherian FGF protein amino acid sequence alignments, including common cysteine amino acid residues, common exon-intron splice site amino acid sites and common predicted N-glycosylation sites (Fig. 2) (Additional file 4). There were between 1 and 9 common cysteine amino acid residues included among eutherian FGF major protein clusters respectively. For example, whereas the major protein cluster FGF5D included 1 common cysteine amino acid residue, major protein cluster FGF5A included 9 common cysteine amino acid residues. There were either 4 common exon-intron splice site amino acid sites (5 major protein clusters FGF1A-D and FGF6A) or 2 common exon-intron splice site amino acid sites (17 other major protein clusters) among eutherian FGF major protein clusters respectively. Finally, there were between 0 and 2 common predicted N-glycosylation sites among eutherian FGF major protein clusters respectively.

Next, the tests of protein molecular evolution first calculated relative synonymous codon usage statistics (R) of eutherian FGF gene data set using 267 FGF complete coding sequences (Additional file 4), and described 20 amino acid codons including R ≤ 0,7 as not preferable amino acid codons (Fig. 3a). The tests used human FGF1A protein primary structure as reference protein amino acid sequence (Fig. 3b). Among 243 human FGF1A protein amino acid residues, the tests of protein molecular evolution described 19 invariant amino acid sites, viz.: M1, C41, C55, P68, Q69, L70, K71, G72, I73, V74, T75, L77, G112, M129, G133, C145, Y159, G181 and C206, as well as 3 forward amino acid sites S101, E149 and Y208. First, the human FGF1A amino acid sites M1, L77, G133, C145 and Y159 were invariant among 267 eutherian FGF protein primary structures (except that M1 was invariant among 266 FGF protein primary structures). For example, the human FGF1A invariant amino acid sites L77, G133 and C145 were described by Goetz et al. [12, 24], Smallwood et al. [13], Coulier et al. [17], Venkataraman et al. [18], Plotnikov et al. [19] and Olsen et al. [22]. Furthermore, the human FGF1A amino acid sites G112 and M129 respectively were invariant among 21 eutherian FGF major protein clusters. For example, the human FGF1A amino acid site G112 was homologous to human FGF2B amino amino acid site G67 that was implicated in interactions between FGF2B ligand and FGFR2 receptor [19, 20]. In addition, the human FGF1A amino acid site G181 that was invariant among 7 eutherian FGF1–7 protein superclusters was described as first glycine amino acid residue in paracrine FGF glycine box protein amino acid sequence motif G-x(4)-G-x(2)-S/T [2]. The human FGF1A amino acid sites P68, Q69, L70, K71, G72, I73, V74 and T75 were invariant among 4 eutherian FGF1A-D major protein clusters. For example, the human FGF1A amino acid sites K71 and I73 were described as residues engaged in voltage-gated sodium channel binding [24]. Finally, the human FGF1A forward amino acid sites S101 and E149 were described among 267 eutherian FGF protein primary structures, and forward amino acid site Y208 was described among 2 eutherian FGF1–2 protein superclusters. For example, the human FGF1A forward amino acid site E149 was homologous to human FGF2A amino amino acid site E105 that was implicated in hydrogen bonding between FGF2A ligand and D3 domain of FGFR2 receptor [19, 26].

Therefore, the tests of protein molecular evolution using relative synonymous codon usage statistics described amino acid sites implicated as critical in FGF protein secondary, tertiary and quaternary structural features.

Discussion

The major disagreements in descriptions of comprehensive eutherian FGF gene data sets included classifications of either 18 FGF genes [2, 7] or 22 FGF genes [1, 3]. The present analysis attempted to address and resolve these discrepancies using eutherian comparative genomic analysis protocol and public eutherian reference genomic sequence data sets [29,30,31,32,33,34,35,36, 44,45,46]. The advantages of eutherian reference genomic sequence data sets were well established phylogeny [29, 30, 34] and calibrated taxon sampling including genomic sequence redundancies that were applicable in tests of reliability of eutherian public genomic sequences [31,32,33]. Therefore, the tests of reliability of eutherian public genomic sequences annotated most comprehensive curated eutherian third-party data gene data set of FGF genes that included 267 complete coding sequences among 577 potential coding sequences. Second, the present study first described 8 superclusters of eutherian FGF genes that included 22 major gene clusters, proposing their updated nomenclature. Third, the new genomics tests of contiguity of eutherian public genomic sequences included 903 coding exons, and annotated either 3 or 5 coding exons in eutherian FGF genes including no evidence of differential gene expansions. Fourth, the present phylogenetic analysis proposed updated classification of eutherian FGF genes. Finally, the tests of protein molecular evolution using relative synonymous codon usage statistics described 19 invariant amino acid sites and 3 forward amino acid sites in reference human FGF1A protein primary structure, including amino acid residues described as critical in FGF protein secondary, tertiary and quaternary structural features. In conclusion, the present comparative genomic analysis integrating gene annotations, phylogenetic analysis and protein molecular evolution analysis argued that 22 FGF genes [1, 3], rather than 18 FGF genes [2, 7], were included in comprehensive eutherian FGF gene data set classifications.

Methods

Eutherian comparative genomic analysis protocol

The eutherian comparative genomic analysis protocol RRID:SCR_014401 integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis with tests of reliability of eutherian public genomic sequences, tests of contiguity of eutherian public genomic sequences and tests of protein molecular evolution into one framework of eutherian gene descriptions (Fig. 4) [44,45,46].