Mitochondrial genome structure and composition in 70 fishes: a key resource for fisheries management in the South Atlantic

Background Phylogenetic gaps of public databases of reference sequences are a major obstacle for comparative genomics and management of marine resources, particularly in the Global South, where economically important fisheries and conservation flagship species often lack closely-related references. We applied target-enrichment to obtain complete mitochondrial genomes of marine ichthyofauna from the Brazilian coast selected based on economic significance, conservation status and lack of phylogenetically-close references. These included sardines (Dorosomatidae, Alosidae), mackerels (Scombridae) croakers (Sciaenidae), groupers (Epinephelidae) and snappers (Lutjanidae). Results Custom baits were designed to enrich mitochondrial DNA across a broad phylogenetic range of fishes. Sequencing generated approximately 100k reads per sample, which were assembled in a total of 70 complete mitochondrial genomes and include fifty-two new additions to GenBank, including five species with no previous mitochondrial data. Departures from the typical gene content and order occurred in only three taxa and mostly involved tRNA gene duplications. Start-codons for all genes, except Cytochrome C Oxidase subunit I (COI), were consistently ATG, whilst a wide range of stop-codons deviated from the prevailing TAA. Phylogenetic analysis confirmed assembly accuracy and revealed signs of cryptic diversification within the Mullus genus. Lineage delimitation methods using Sardinella aurita and S. brasiliensis mitochondrial genomes support a single Operational Taxonomic Unit. Conclusions Target enrichment was highly efficient, providing complete novel mitochondrial genomes with little sequencing effort. These sequences are deposited in public databases to enable subsequent studies in population genetics and adaptation of Latin American fish species and serve as a vital resource for conservation and management programs that rely on molecular data for species and genus-level identification. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10035-5.

. Variation in starting positions of 37 genes, including 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes, and the D-loop region, across the 69 successfully assembled mitochondrial genomes.Although two genomes presented minor deviations in gene order (Oligoplites saurus with two tRNA-Met and Pseudopercis numida with three tRNA-Leu), the boxplot analysis revealed a pattern of variation across the different genes.This variation can be attributed to the differences in the evolutionary rate of different regions of the genome, which is expected due to differences in the selective pressures and functional constraints acting on these regions. .

SAMPLES
Table S1.Species sampled for this study with its locality and date.

SUPPLEMENTARY METHODS
Even though fishes are the group of animals with the major quantity of mitochondrial genomes in RefSeq GenBank database, the total amount represents only 9.7% of fish valid species [1,2].This information was checked by applying the filters "((mitochondrion [Filter]) AND fish[Organism]) AND srcdb_refseq[Properties]" in GenBank database, which returned 3552 species.There are 36.422valid fish species according to [1].
Supplementary Tables Table S2.Methodology summary containing DNA quality (concentration and integrity), library concentration and capture assessment (sequencing pools with respective i5/i7 index and sequence, and the resulting number reads).Table S3.Detailed assembling approach, showing the assembler utilized, the reference and seed sequences (when assembled with NOVOPlasty) and whether created a circularized contig or not.

Methodology
The length distributions of the raw assemblies were analyzed.Assemblies with extremely low or high lengths were screened for indications of misassembly by comparison to the lengths of the closest-related available mitogenome (CRM) sequences.The CRMs were identified by blasting the whole mtDNA sequences and retrieving the top hit sorted by percentage identity and restricted to query cover > 80%.To check the annotation process, the annotated mitochondrial genomes were manually inspected.The size of each tRNA, rRNA and CDS were compared with their CRM.Also, the stop and the start codon of each CDS were first manually inspected in AliView v1.27 to verify other parameters of misassemble, such as truncations and duplications of protein-coding genes and the presence of stop-codons in the middle of sequences.We investigated possible deviations in gene order by analyzing the start position of all genes in each genome.To verify the quality and coverage of the target-capture sequencing, the raw reads were mapped to the assembled genome using Burrows-Wheeler Aligner (BWA) [3] and SAMTools [4].To correct potential mismatches and other issues, we added the Pilon software in the mapping pipeline, a tool specially designed for assembling improvement [5].
To check the species identity of the mitochondrial genomes assembled here, a BLAST search was performed in the NCBI database using only the Cytochrome Oxidase subunit 1 (COI) sequences.The first hit was considered as the identity of the sample when with similarity >99%.A phylogenetic species identification was also performed by maximum likelihood for COI in RAxML-NG v.1.0.3-master [6] to confirm the phylogenetic placement of the samples using the monophyly as the criteria for species identification.The final sequence matrix consisted of 293 sequences and was aligned in MAFFT v7.407 (70 COI sequences from teleosts obtained here, 202 BLAST species hits, 17 COI sequences manually retrieved from GenBank and BOLD databases and 5 mammal sequences used as outgroups).COI sequences were manually retrieved from the GenBank and BOLD databases in cases where BLAST top hits did not match our query sequence species, even though sequences of the corresponding species were available in GenBank.The GTR+G+I was the nucleotide substitution model selected using MEGA X [7] according to the Bayesian information criterion [BIC;8].In cases where the COI tree did not have sufficient phylogenetic resolution, we developed phylogenies of other mitochondrial markers (Cytochrome b, Cytochrome Oxidase subunit II, 12s rRNA and 16s rRNA) for datasets containing only sequences of the genus initially assigned to the unidentified sample.
To inspect phylogenetic position, unusual groupings, and branch lengths a maximum likelihood phylogenetic analysis was performed for the whole mitochondrial genome in RAxML-NG v. 1.0.3-master [6].The final sequences matrix consisted of 267 complete mitochondrial sequences and was aligned in MAFFT v7.407 (69 sequences from teleost species obtained here that did not present any critical problems in other stages of the quality control and their top 3 BLAST species hits, the 25 complete mitochondrial genomes used to design the baits and their top 2 BLAST hits, 5 mammal sequences extracted from GenBank, one lungfish and one coelacanth).In order to remove alignment gaps, the automatic pipeline of trimAl v1.4.rev15 was applied.The GTR+G+I was the nucleotide substitution model selected according to the Bayesian information criterion (BIC; Schwarz, 1978) applied in the MEGA X software [7].
To investigate the relationship between Sardinella aurita and S. brasiliensis, two phylogenetic inferences by maximum likelihood were performed in RAxML-NG v.1.0.3-master [6]: 1) COI phylogeny including the species sequenced here and all Sardinella species available in GenBank and 2) mitogenome phylogeny including all mitochondrial genomes from Sardinella sp.available in GenBank and the ones produced here.An analysis of lineage delimitation was also performed within the Sardinella species group using the same dataset applied for the COI phylogenetic analysis and a second dataset including 12S sequences by applying two methods: 1) the Automatic Barcode Gap Discovery (ABGD; Puillandre et al. 2012).The ABGD method creates a genetic distance matrix (intra-and interspecific pair-wise distances) based on a pre-defined barcode gap that can split lineages into possible species.The analysis was performed in the web server (https://bioinfo.mnhn.fr/abi/public/abgd/abgdweb.html), using the Jukes-Cantor (JC69) as the nucleotide substitution model, with a relative gap width of 1.5 and intraspecific prior ranges from 0.001 to 0.1 in 10 steps.
2) Single-threshold generalized mixed Yule coalescent [GMYC;9].The GMYC method computes the maximum likelihood of models incorporating interspecific diversification and intraspecific clade occurrence on ultrametric time-tree.The analysis was performed with the R package splits v. 1.0-20 [SPecies' LImits by Threshold Statistics; 10] using the ultrametric tree produced in TreeAnnotator as the input file (see details below).

Results
All the 69 genomes had all 13 protein code genes in integrity.The main start-codon was the ATG for all genes except for COI, that has a GTG start-codon in 82,6% of the genomes obtained here (Figure 5a).A GTG start-codon was also found in other genes (except for CytB, COII, COIII, ND1 and ND5), but in minor quantity.ATPase6 was the second gene with less predominancy of ATG start-codon: 11 genomes had an ATPase6 starting with CTG, 2 with GTG and 1 with TTG.CytB also showed slight variation, with 5 genomes with a CTA start-codon, 3 TTA and 2 GTG.However, more variation was observed in the stop-codons (Figure 5b), with four complete stop-codons recorded (TAA, TAG, AGA, AGGhigher to lower quantity) and two incompletes (T--and TA-).The presence of a complete stop-codon was more common in the genes COI, ATPase 8, ND1, ND4L, ND5 and ND6, with the stop-codon TAA at the end of all ND4L obtained here.The complete summary of gene annotation, size and stop-and start-codons can be found in tables S4, S7 and S8 in Supporting Information 1.
The BLAST search of the entire mtDNAs sequenced here retrieved the right congeneric species (Table S6).The complete reference mitochondrial genomes were 16,921 bp in average length (σ = 193; min: 16,617max: 17,337), the expected size for fish mtDNA (Figure 6).The average length of the CDSs and tRNAs separated is consistent with the fish average length as well.In addition, the quantity and size of protein coding (n = 13) genes, rRNA (n = 2) genes and tRNAs (n = 22) is similar to the typical teleost mitogenome (Tables S4  and S5, Supporting Information 1).
Similarity analysis and phylogenetic inference of the COI gene confirmed the identity and phylogenetic placement of all species that had their mtDNAs successfully sequenced but Umbrina canosai, which was confirmed with 16S (Figure S4), since the COI sequences from both GenBank and BOLD did not group with other Umbrina sp.sequences.We could not verify the species that do not have a sequence in GenBank or BOLD.

Outline
We collected samples of Sardinella aurita and Sardinella brasiliensis to contribute with the current discussion regarding the reclassification of the Sardinella Sardinella subgenus (Sardinella aurita, S. brasiliensis, S. lemuru, S. longiceps and S. neglecta).The analysis performed in the last main reclassification of the species N Stern, J Douek, M Goren and B Rinkevich [11] was made according to one nuclear gene (ribosomal 1 st intro gene S7) and three fragments of mitochondrial genes: 671 bp of the Cytochrome Oxidase subunit 1 (COI), 458 bp of the Cytochrome B (Cytb), and 587 bp of the ribosomal 16S.

Methodology
One specimen of each species (S. aurita and S. brasiliensis) had their mitochondrial genome successfully sequenced in the target-enrichment capture.To proceed with the lineage delimitation analyses, we first performed a phylogenetic analysis of whole mitochondrial genome of Sardinella species in RAxML-NG v. 1.0.3-master (Kozlov et al., 2019).Our dataset comprised the two mitochondrial genomes obtained here for S. aurita and S. brasiliensis, 17 Sardinella mitogenomes from GenBank database and the Engraulis ringens (MH732975.1)as the outgroup.The final matrix of 20 mitochondrial genomes was aligned in MAFFT v7.407.The GTR+G+I was the nucleotide substitution model selected, according to the Bayesian information criterion [BIC;8] applied in the MEGA X [7].
We also extracted two mitochondrial genes to perform delimitation analysis: 1) COI, that has the largest database available for analysis, and 2) 12S, which is the gene not employed in Stern et al. 2018 that has more Sardinella species available at GenBank.We added here the Sardinella brasiliensis 12S sequence to the database to enable the lineage delimitation analysis in this gene.
The datasets comprised the two sequences extracted from our mitogenomes, species from GenBank (48 for COI and 38 for 12S) and the Chirocentrus dorab (AHY80612.1 for the COI tree and AP006229.1 for the 12S tree) as the outgroup comprising a total of 51 sequences for COI and 41 for 12S.A phylogenetic tree for the COI gene was first built in IQ-TREE [12].For that, sequences were aligned in MAFFT v7.407.The HKY+G+I was selected as the nucleotide substitution model, according to the same methodology described above.

ABGD
The ABGD model calculates a pairwise distance matrix, from which initial partitions are made from a maximum intraspecific distance P given by the barcode gap detected at each partition.From the initial partitions, the same barcode gap detection method is applied recursively, creating recursive partitions that tend to comprise a larger number of lineages.The initial partitions are usually more conservative (possibly overlumpers), and the recursive ones more generalist (possibly oversplitters).The analysis was performed in the web server (https://bioinfo.mnhn.fr/abi/public/abgd/abgdweb.html), using the Jukes-Cantor (JC69) as the nucleotide substitution model, with the standard parameters of relative gap (1.5) and intraspecific prior values (Pmin = 0.001, Pmax = 0.1, P steps = 10).

GMYC
To apply the GMYC method, we first built a phylogenetic tree by Bayesian inference using BEAST 2.6.7 [14].To perform the phylogenetic inferences, we created NEXUS files corresponding to the alignment of the sequences in MEGA X [7] and then the XML format files in BEAUti 2.6.7 [14].We applied the standard parameters for analysis, including the JC69 as nucleotide substitution model, the Strict Clock as molecular clock model and the Yule model as tree prior model.The output files provided by BEAST included files with the extension ".trees", which we used as the input file in TreeAnnotator 2.6.7 [14] to generate a single tree (the maximum credibility tree -MCT) from the summary of information contained in trees sample set produced by BEAST.The Burnin' percentage parameter was set to 10%, the minimum posterior probability threshold for each node was kept at 0, and the node heights were set to the median heights option.The NEXUS-formatted MCT was applied for GMYC delimitation analysis.We conducted the GMYC analyses in the R package splits v. 10].The delimitation model was implemented using the gmyc function, which has the single threshold method as its default setting.

Results & Discussion
The phylogenetic inference of the COI gene (Figure S9) of Sardinella species separated the subgenus Sardinella Sardinella into 5 groups according to their genetic distance, and the two specimens of S. aurita and S. brasiliensis sequenced here diverge in small branch sizes, possibly indicating little genetic distance.The phylogenetic tree of the complete mitochondrial genomes (Figure S10) did not show enough taxonomic resolution for species delimitation, since the number of available sequences was too small.Based on the genetic distances calculated in the ABGD analysis of Sardinella sequences (Figure S5), the species of Sardinella analyzed with COI (n = 18) and 12S (n = 11) were split into 9 to 12 (COI) and 4 to 9 (12S) groups (Figure S6).The species Sardinella aurita, S. brasiliensis, S. longiceps and S. lemuru are grouped in all ABGD partitions with COI gene, even in the seventh recursive partition (probably an oversplitter).These results contribute to the hypothesis that the subgenus Sardinella Sardinella is one single cosmopolitan species (S. aurita).However, when we performed the analysis based on a matrix of sequences from a marker yet not evaluated before, the 12S gene, the subgenus Sardinella Sardinella has its species separated into two groups in all ABGD partitions: 1) Sardinella aurita, S. brasiliensis and S. longiceps and 2) S. lemuru and S. zunasi.The GMYC method has separated the Sardinella species into 10 lineages using COI sequences as input (Table S7) and 6 lineages using 12S rRNA sequences as input (Table S8), creating a single group for Sardinella Sardinella subgenus with COI gene and two groups with 12S gene, showing a consistent result between the two genes analyzed.Even though S. lemuru and S. zunasi share the same habitat in the Pacific Ocean, there is no record that these two species form a single clade.S. brasiliensis and S. aurita remained together in all delimitation analyses, which corroborates with the ongoing classification of S. aurita species (aurita + brasiliensis).However, S. longiceps also remained grouped with S. aurita and S. brasiliensis, which does not support the ongoing classification of S. longiceps as a single species (longiceps + neglecta).There were no S. neglecta sequences available for analysis.The overall results of the delimitation analysis point out a cosmopolitan Sardinella aurita.However, there are some inconsistencies within the analyses.We recommend new analyses applying a large set of specimens and molecular markers to clarify such incongruencies.
The remaining Sardinella species included in the analyses also demand an improved delimitation analysis with more samples and markers, as its results also highlighted some inconsistencies between genes, the current literature and the two delimitation models applied here [1,15].

GMYC Results
The Sardinella species analyzed here comprise 10 maximum likelihood entities applying the COI gene, i.e. 10 hypothetical lineages, and 6 entities applying the 12S gene (Figure S7 and Figure S8).The number of maximum likelihood clusters was 9 for COI and 5 for 12S (no. of clusters = no. of entities, except for those composed of only 1 individual, which in this case was one).The analysis using CO1 sequences was significant (p-value = 0.001, with P < 0.05 indicating that the null hypothesis of that is only one species is rejected).However, the analysis using the 12S sequences was not significant (p-value = 0.158), indicating that the null hypothesis of that is only one species is accepted.The lack of significance of the 12S analysis can be due to the low samples number available, which show the need of more work increasing the current database of genetic resources for fish species.

PHYLOGENETIC IMPLICATIONS
Our phylogenetic inference of the complete mitochondrial genomes grouped the species correctly in their taxonomic and evolutionary positions.The phylogenetic trees could support the monophyly of 11 orders and 36 families assembled here.Perciformes, Acanthuriformes and Tetraodontiformes were split as polyphyletic orders.Two families of the order Beloniformes (Hemiramphidae and Belonidae) were identified as polyphyletic and included sequences obtained in this study.The Lotidae family was also recovered as polyphiletic, but only included sequences retrieved from GenBank.
We did not consider the polyphlyletic structure of Perciformes, Tetraodontiformes and Acanthuriformes as a methodological error in our analysis, in light of previous research that have likewise highlighted their inability to recover monophyly or their controversial taxonomic status.The two orders belong to the series Eupercaria, the largest series of percomorphs, being among the most diverse of the series [16].Perciformes is the largest order in Eupercaria and is polyphyletic, according to the Eschmeyer's Catalog of Fishes updated classification, with 19 families classified as sedis mutabilis, which means that the classification may change with further studies and accurate data (Fricke et al., 2022).However, studies applying a large multilocus nuclear dataset is recovering the order as monophyletic [17,18] and the current status of the order remain controversial.The interrelationships in Tetraodontiformes lineages are also known to be controversial, but the order is classified as monophyletic in strongly supported works (Betancur-R et al., 2017;Near et al., 2013).However, the polyphyly seen in this study for mitochondrial genome is also supported in previous research, since not only mitochondrial genes has been failing in recover monophyly in the order, but the whole mitochondrial genome itself [19,20].The monophyly of Acanthuriformes has also been questioned before, both by studies applying multilocus datasets associated with morphological synapomorphies (Betancur-R et al., 2017) or analyzing sequences from only five nuclear genes [21].We therefore found evidence based on whole mitochondrial genomes that corroborate with previous research.
The recovery of the Hemiramphidae and Belonidae families as non-monophyletic groups was also not considered as an error in our analysis.The needlefishes (Belonidae) and the sauries (Scomberesocidae) form the Scomberesocoidea superfamily [22,23], while the superfamily Exocoetoidea includes the flyingfishes (Exocoetidae), halfbeaks (Hemiramphidae) and viviparous halfbeaks (Zenarchopteridae) [24].However, the monophyly of needlefishes (Belonidae) and halfbeaks (Hemiramphidae) has been previously refuted [25], based on a molecular phylogeny of 54 beloniform species recovered using fragments of two nuclear and two mitochondrial genes.In this topology, the observed polyphyly of Belonidae occurred due to the inclusion of sauries, which belong to the family Scomberesocidae, and the Zenarchopteridae was recovered as sister species of needlefishes/sauries. On the other hand, the Hemiramphidae family was even more problematic, including three main clades: Zenarchopteridae, the Hemiramphus clade and the Hyporhamphus/Arrhamphus clade.Although our results also recovered the inclusion of sauries in the Belonidae family, its polyphyly was also due to the inclusion of the Zenarchopteridae clade.As for the Hemiramphidae family, it was considered nonmonophyletic in our analysis due to the clustering of species belonging to the Exocoetidae family.Despite some incongruences found between our results and the ones found by Lovejoy et al. (2004), it is likely that some beloniform families still lack a well-established taxonomic framework to support its monophyly.Our phylogenetic inference also sheds light into the taxonomic classification of Sardinella species, as discussed on the next section.
The phylogenetic inference of the COI gene also raises an important taxonomic discussion in the Mullidae family.The Argentine goatfish (Mullus argentinae) is a demersal fish with a narrow distribution from Argentine to northeastern Brazil (Fricke et al., 2022).Here, our M. argentinae collected in Southeastern Brazil (São Paulo state) clustered with the individuals collected in Argentina, but not with other individuals from the Southern Brazil (Santa Catarina state) (Figure S11), which may be an indication that the Argentine goatfish comprises cryptic species.

Figure S3 .
Figure S3.Phylogenetic inference based on the Cytochrome oxidase I (COI) sequences extracted from the mitochondrial genomes captured here, performed to confirm the identity of our mitochondrial genomes.Our sequences are highlighted in red and bootstrap values are displayed on each branch.

Figure S4 .
Figure S4.Phylogenetic inference of Umbrina and Argyrosomus 16S ribosomal RNA gene sequences obtained here (in red) and retrieved from GenBank database.The branch that supports the Umbrina canosai 16S rRNA sequence captured and the sequence retrieved form GenBank has zero-length, indicating nearly identical sequences.This analysis confirmed the identity of our Umbrina canosai sample (A1), which was not possible in our COI phylogeny.

Figure S6 .
Figure S6.Representation of the recursive and initial partitions for Sardinella (a) CO1 and (b) 12S analyses.The x-axis represents the maximum intraspecific divergence P calculated for the molecular operational taxonomic units delimited at each recursive partition (red) and initial partition (yellow), and the y-axis represents the number of groups (i.e.hypothetical species) delimited at each partition.

Figure
Figure S8 Lineages delimited in the GMYC analyses of the (a) COI and (b) 12S genes.

Figure S9 .
Figure S9.Phylogenetic inference of the Sardinella Cytochrome oxidase I (COI) sequences available in GenBank database and the ones captured here, in red.The Sardinella Sardinella subgenus is highlighted in purple color.

Figure S10 .
Figure S10.Phylogenetic inference of the Sardinella mitochondrial genomes available in GenBank database and the ones captured here, in red.The Sardinella Sardinella subgenus is highlighted with purple color.

Figure S11 .
Figure S11.Phylogenetic inference of the Mullus Cytochrome oxidase I (COI) gene sequences available in GenBank and BOLD databases and the one captured here, in red.From the 7 sequences of the species Mullus argentinae available in BOLD database, 5 of them were obtained from samples collected in Argentina (FARG547-08, FARG545-08, FARG546-08, FARG548-08 and FARG549-08), and are clustered with the sequence captured here, obtained from a sample collected in southern Brazil (state of Santa Catarina).The other 2 BOLD sequences of M. argentinae (MFSP1895-11 and MFSP1984-11) were obtained from samples collected in southeast Brazil (state of São Paulo).

Table S4 .
Summary of corrections from Pilon pipeline.

Table S5 .
Similarity analysis on Basic Local Alignment Search Tool -nucleotide (BLASTn).

Table S6 .
Overall summary about the completeness of the current database, explaining our contribution to it.

Table S8 .
Lineages delimited by GMYC analyses of Sardinella 12S sequences, including the individuals that compose each one.