Siberian larch (Larix sibirica Ledeb.) mitochondrial genome assembled using both short and long nucleotide sequence reads is currently the largest known mitogenome
BMC Genomics volume 21, Article number: 654 (2020)
Plant mitochondrial genomes (mitogenomes) can be structurally complex while their size can vary from ~ 222 Kbp in Brassica napus to 11.3 Mbp in Silene conica. To date, in comparison with the number of plant species, only a few plant mitogenomes have been sequenced and released, particularly for conifers (the Pinaceae family). Conifers cover an ancient group of land plants that includes about 600 species, and which are of great ecological and economical value. Among them, Siberian larch (Larix sibirica Ledeb.) represents one of the keystone species in Siberian boreal forests. Yet, despite its importance for evolutionary and population studies, the mitogenome of Siberian larch has not yet been assembled and studied.
Two sources of DNA sequences were used to search for mitochondrial DNA (mtDNA) sequences: mtDNA enriched samples and nucleotide reads generated in the de novo whole genome sequencing project, respectively. The assembly of the Siberian larch mitogenome contained nine contigs, with the shortest and the largest contigs being 24,767 bp and 4,008,762 bp, respectively. The total size of the genome was estimated at 11.7 Mbp. In total, 40 protein-coding, 34 tRNA, and 3 rRNA genes and numerous repetitive elements (REs) were annotated in this mitogenome. In total, 864 C-to-U RNA editing sites were found for 38 out of 40 protein-coding genes. The immense size of this genome, currently the largest reported, can be partly explained by variable numbers of mobile genetic elements, and introns, but unlikely by plasmid-related sequences. We found few plasmid-like insertions representing only 0.11% of the entire Siberian larch mitogenome.
Our study showed that the size of the Siberian larch mitogenome is much larger than in other so far studied Gymnosperms, and in the same range as for the annual flowering plant Silene conica (11.3 Mbp). Similar to other species, the Siberian larch mitogenome contains relatively few genes, and despite its huge size, the repeated and low complexity regions cover only 14.46% of the mitogenome sequence.
Since the first plant mitogenome of the common liverwort has been sequenced , not as many mitogenomes were sequenced and assembled in plants as in animals. Although more than 9000 complete mitogenomes have been deposited in NCBI Genbank (July 2020), only 235 of them belong to plants (https://www.ncbi.nlm.nih.gov/genome), and the vast majority of these mitogenomes are from angiosperms. To date, the complete assembly and accurate annotation of plant mitogenomes remain challenging due to their complex and often perplexing structure.
Mitogenome size in seed plants can vary by at least one order of magnitude ranging from ~ 222 Kbp in Brassica napus  and ~ 316 Kbp in Allium cepa  to ~ 3.9 Mbp in Amborella trichopoda  and a striking ~ 11.3 Mbp in Silene conica . Such dispersion may be explained by the abundance of noncoding and repeated elements . In contrast to the relatively compact and tightly packed genomes of most animal mitochondria (~ 15–20 Kbp) , plant mitogenomes are enriched with introns, intergenic sequences, repetitive and mobile elements , and show a wide diversity in gene content and genomic architecture [9, 10], although coding sequences are relatively conserved in the core mitochondrial genes [11,12,13].
Both plant and animal mitogenomes were used to be considered as a single circular chromosome, and until recently, this model predominated in mitogenome assembly strategies. Today, however, substantial evidence has been accumulating in support of the notion that these genomes possess more complex and multifarious structure. For several plant species, the multi-chromosomal structure of mitogenome has been shown . For instance, the mitogenome of cucumber Cucumis sativus can be assembled only into three circular chromosomes of 1.6 Mbp, 84 Kbp, and 45 Kbp in size, respectively , and the mitogenome of Silene noctiflora (6.7 Mbp) is fragmented into more than 50 circular chromosomes . Interestingly, mitogenomes of numerous vascular plants exhibit presence of alien sequences received either by means of intracellular gene transfer from plastid [16, 17] or horizontal gene transfer from mitochondria of other species . Most of these foreign sequences are noncoding or pseudogenes, though sometimes they occur in coding regions . Conversely, gene transfer from mitogenome to plastid is considered to be rare and has been reported only in a few cases [19,20,21,22]. Moreover, genes of plant mitochondria are commonly considered to evolve more slowly than plastidial or nuclear ones, while the mutation rate in coding regions of plant mitogenomes is about two orders of magnitude lower than in genes of animal mitochondria [23,24,25]. With that being said, it has been reported that the rate at which plant mitogenomes accumulate substitutions can substantially vary both between genes and species [26, 27].
The utility of mitogenome sequences as a source of genetic markers has been extensively documented and is thus unquestionable [28, 29]. Many mitochondrial genes such as atp1, cob, cox1, cox2, and cox3 are widely used to resolve phylogenetic relationships between lineages, conduct biodiversity analyses and construct phylogeographic and evolutionary history of species. Furthermore, the maternal inheritance of mitochondria in plants  as well as the large number of copies per cell strengthened their use for various applications .
The number of evolutionary and systematic studies based on comparative analysis of complete plant mitogenome sequences has significantly increased in the last decade, notably due to the advent of better sequencing methods. However, there are still very little published comparative mitogenome studies in gymnosperms, one of the oldest among the major plant clades comprising 14 families with more than 1000 species. Among all plant mitochondria genomes available, few gymnosperm mitogenomes have been fully assembled and properly annotated: Cycas taitungensis (415 Kbp) , a ginkgo tree Ginkgo biloba (347 Kbp) , a gnetophyte Welwitschia mirabilis (979 Kbp) , the Japanese yew Taxus cuspidata (469 Kbp)  and six conifer species (the Pinaceae family), such as white spruce – Picea glauca (5.9 Mbp) , Norway spruce – Picea abies (4.3 Mbp  or 4.9 Mbp ), loblolly pine – Pinus taeda (1.25 Mbp  or 1.19 Mbp [https://www.ncbi.nlm.nih.gov/nuccore/NC_039746.1/]), sugar pine – Pinus lambertiana (3.9 Mbp) , Scots pine – Pinus sylvestris (986 Kbp [https://www.ncbi.nlm.nih.gov/assembly/GCA_900143225.1/]), and Siberian larch – Larix sibirica (11.7 Mbp) presented here.
In this study, we report on the sequencing, assembly, and annotation of the mitogenome of Siberian larch, a key species for the Siberian boreal forest and ecosystem. Furthermore, several methods were employed to verify and evaluate the correctness of the assemblies.
Sequencing and preliminary assembly using short paired-end (PE) and mate-pair (MP) Illumina reads
Based on ~ 19.7 million of paired-end (PE) nucleotide reads produced by Illumina HiSeq 2000 using mitochondrial DNA (mtDNA) enriched samples and ~ 625.5 million of mate-pair (MP) nucleotide reads produced by Illumina HiSeq 2000 for the whole nuclear genome assembly  a preliminary draft assembly with total length of 545 Mbp was generated using the CLC Assembly Cell and the BESST software. After mapping the draft assembly contigs to mitogenomes of seed plants and filtering out contamination with nuclear and plastid contigs, the Siberian larch mitogenome was assembled into 53 scaffolds with total length of 11.09 Mbp.
Final hybrid re-assembly using both long MinION and short paired-end (PE) Illumina reads
To improve the assembly of the Siberian larch mitogenome, we used the same PE reads as in the preliminary CLC assembly generated by Illumina HiSeq 2000 and long reads generated by Oxford Nanopore MinION and MaSuRCa v.3.2.8 pipeline. MP reads were used only for the preliminary CLC assembly, but not for the MaSuRCa assembly. After mapping the MaSuRCa assembly to plant mitochondrial database, we finally assembled nine mitochondrial contigs resulting in a total length of 11.7 Mbp. The largest contig was 4,008,762 bp long.
To further evaluate the accuracy of the final assembly, we ran REAPR v1.0.18 , which reported 92.13% of error free bases in the Siberian larch mitogenome assembly (Additional file 1). It is comparable to 86% for the GRCh37 human reference genome or 90.3% for C. elegans genome.
We used RepeatModeler and TEclass to discover and classify highly repetitive elements (REs) in the assembled contigs, respectively. RepeatModeler found 122 mobile elements, among which TEclass identified 27 DNA transposons, 38 long terminal repeats (LTRs), 12 long interspersed nuclear elements (LINEs), and 9 short interspersed nuclear elements (SINEs). In total, 7691 REs were identified using a combined repeat library, which represent 11.14% of the 11.7 Mbp mitogenome assembly. LTRs, LINEs, simple repeats, and class II DNA transposons were the most abundant among classified repeats (4.05, 2.05, 0.80, and 0.66%, respectively; Fig. 1, Table 1). In total, interspersed repeats composed 7.9% of Siberian larch mitogenome.
LTRs were represented mainly by Gypsy, Copia, DIRS, and Gymny families. The latter was previously found in pine genome and is related to Athila elements in Arabidopsis . Among non-LTR retrotransposons SINE, I, and Penelope together comprise about 20.56% of the non-LTRs, which is 0.53% of the entire mitogenome. RepeatMasker found three Penelope elements, all with relatively high score and length between 57 and 813 bp. The Penelope-like elements are considered to be common in animals but were also found recently in conifers, particularly in loblolly pine genome . It was proposed that these elements were transferred to a conifers’ ancestor approximately 340 mya .
Class II DNA transposons composed 0.66% of the mitogenome assembly and were represented by EnSpm, Harbinger, hAT, Helitron, MuDR, TcMar, and Maverick (Polinton) repeat families (Table 1). Only one 18 bp long Polinton element was found in the mitogenome assembly. Although Polintons are not typical for plant genomes, some Maverick-related elements were found in the cytoplasm and mitochondria of some plants and fungi [44, 45].
Search for plasmids
Only a small 42 bp long insert matching plasmid sequence identified in mitochondria of Picea abies (L.) Karst (NCBI GenBank accession number AJ225562) was found in contig 1 in the 608,296–608,337 position, yet with 100% identity (Table 2). Among 891 broad host range plasmids (NCBI GenBank taxid:36549), no plasmid-like inserts were identified in the contigs using megablast search for highly similar sequences, and only a few rather short plasmid-like inserts were identified by discontiguous megablast used to search for more dissimilar sequences and by blastn used to search for somewhat similar sequences counting in total for 730 bp and 12,940 bp, respectively (Table 2). In total, considering somewhat similar sequences, plasmid-like insertions represented only 0.11% of the entire mitogenome.
Annotation of the Siberian larch mitogenome identified a set of rRNA, tRNA, and protein-coding genes that are typical for gymnosperms: atp1, atp4, atp6, atp8, atp9, ccmB, ccmC, ccmFc, ccmFn, cob, cox1, cox2, cox3, matR, mttB, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7, nad9, rpl2, rpl5, rpl16, rps1, rps2, rps3, rps4, rps7, rps10, rps11, rps12, rps13, rps14, rps19, sdh3, and sdh4. Two copies of atp8 were found.
Plastid-derived DNA sequences (MTPTs) were searched by comparing the mtDNA against the chloroplast DNA (cpDNA) of Siberian larch . Twenty MTPTs with total length of 43,951 bp were found in five mitochondrial contigs (Table 3).
RNA sequencing and C-to-U RNA editing
The transcriptome data used to verify RNA editing sites are provided in Table 4.
Among 438 RNA editing sites predicted by the PREP-Mt program almost 90% were also verified by aligning RNA-seq reads to the annotated mitochondrial genes and then by calling variants using the CLC Genomic Workbench. Many additional sites were found only by variant-calling. They were likely specific for Siberian larch and included in the final annotation. In total, 864 C-to-U RNA editing sites were found for 38 among 40 protein-coding genes. The RNA editing was not found only for two genes, nad1 and nad4. Maximum number of the editing sites (80) was observed for the nad5 gene. Editing the RNA helped us identify start codons in atp6, rps3, rps10, rps19, and mttB genes and stop codons in nad4L, sdh3, and atp9 genes.
The whole genome sequencing data were used in the previous studies of conifer mitogenomes, while sequencing and assembly of Cycas, Ginkgo, and Welwitschia mitogenomes were carried out specifically using enriched mtDNA. Our study benefits from both approaches by simply having more mitochondrial reads, but mtDNA enrichment was not as critical as having long reads.
Almost all studies that used whole genome sequencing data followed similar strategies to assemble and verify putative mitochondrial sequences. They separated mitochondrial sequences from nuclear and chloroplast by their length, depth of coverage, and GC content [35, 36, 38], additionally checking for contamination by aligning obtained sequences either to the NCBI GenBank nucleotide (nt) database  or to custom database of plant mitogenomes . Annotation process for abovementioned species included BLAST search against database of homologous genes from available plant mt genomes [32, 36] or use of MAKER and Prokka software . For identification of tRNAs, tRNAscan-SE, and ARAGORN were commonly used. To analyze repeats such programs as REPuter, RepeatMasker, RepeatModeler, and Tandem Repeat Finder were among most used. The combined machine learning and in silico enrichment of mitochondrial-like long reads were also used to assemble the mitogenome of Norway spruce .
Gene content reported for available annotated mitogenomes varies noticeably as well as the amount of repeats, showing no correlation between genome size, number of genes and repeats (Table 5).
Mitochondrial genomes often contain large amounts of plastid-derived DNA sequences (MTPTs). Analysis of the first gymnosperm mitochondrial genome of Cycas taitungensis and 10 other plants revealed that first DNA transfer from cpDNA to mtDNA occurred at least as far back as to the common ancestor of extant gymnosperms and angiosperms, about 300 mya . Later, an extended analysis of MTPTs among 73 plant species was carried out, however, only one gymnosperm mitogenome was used there . Mitogenome of Siberian larch also contained MTPTs. The longest MTPT is 10 Kbp long and contains the plastid genes trnL, trnT, rps4, trnS, ycf3, psaA, and psaB (Table 3). The discovered MTPTs would facilitate further studies of conifer mitogenomes and the understanding of organelle genome evolution.
The reasons for extremely large size of mitogenomes in plants are still not fully understood, but at least can be partly explained by a variable number of mobile genetic elements, introns and plasmid-related sequences  and could be affected by different factors, such as proliferation of retrotransposons, generation of repetitive DNA by recombination, and transferring of foreign sequences from plastid or nuclear genomes or via horizontal exchange of mitochondrial DNA (see  for discussion). However, it is unlikely that plasmids contributed much into the Siberian larch mitogenome size as we found relatively few plasmid-like insertions, representing only 0.11% of the entire mitogenome. Therefore, it seems that identified plasmids cannot explain much of that “dark matter” mtDNA.
As part of the genome annotation process, we inferred RNA editing sites for protein-coding genes. The number of RNA editing sites predicted by PREP-Mt for the protein-coding genes of the Siberian larch mitogenome was within the range predicted with the same cut-off 0.2 in other gymnosperms, but varied greatly between different species from 225 in Welwitschia to 1102 in Taxus, 1179 in Pinus, 1206 in Cycas, and 1306 in Ginkgo (see Table 1 in . The number of RNA editing sites correlated neither with mitogenome size nor with GC content, but more data are needed to make stronger conclusions.
In the present study, we aimed to produce a high-quality assembly of the Siberian larch mitogenome. The combination of mtDNA enrichment with short and long reads allowed us to obtain a genome assembly, which consists of 9 contigs with a total length of 11.7 Mbp. Our assembly does not consist of one sequence, which is called the master ring, but many recent publications showed that presentation of plant mitogenome as a single circular molecule is not accurate due to its complexity and dynamic mixture of mtDNA forms within a single plant .
To facilitate mitogenome assembly, DNA was collected from a fraction enriched with mitochondria, however, such method did not allow the obtention of highly purified high molecular weight (HMW) mtDNA needed for long read sequencing. Moreover, the obtained nucleotide sequences contained a significant contamination from nuclear and chloroplast genomes. It appears that the protocol we used to isolate mtDNA from conifer needles would require further adjustments. Such improvements would likely focus on the damaging effect of larch phenolic compounds on mitochondrial membranes during the homogenization step, which would thus increase the amount of intact organelles. In turn, this should result in a larger yield of intact mtDNA molecules. To achieve this, it seems necessary to (i) shorten the mitochondrial isolation procedure as much as possible and (ii) find a better buffer composition for both extraction and washing media. Note that the use of fresh material instead of cryo-preserved one would also likely contribute to a better isolation procedure.
However, it is very important to notice that even with the best possible protocol for mtDNA enrichment and intact mtDNA molecules, it would still be very hard (if not currently impossible) to verify whether a particular large plant mitogenome assembly represents a single circular molecule or not. All subgenomic circles could theoretically be detected, if (and only if) extremely long reads comparable with the sizes of these subgenomic circles would be generated, which is still not feasible considering the huge length of conifer mitogenome. For unambiguous alternative assemblies reads should be as long as several Mbp. However, even though mitogenome would be assembled as a single circular structure, this would not guarantee its circular nature. Additionally, there could be alternative physical structures, which can all produce circular genome maps - such as head-to-tail concatamers and circularly permuted linear molecules (see Fig. 1 in ).
Mitogenomes of conifer species are still poorly studied - only five genomes have been published to date. The sizes of published mitogenomes vary widely: from 986 Kbp in Scots pine (https://www.ncbi.nlm.nih.gov/assembly/GCA_900143225.1) to 11.7 Mbp in Siberian larch (this study). We succeeded in producing a high-quality assembly of the Siberian larch mitogenome using both short and long nucleotide sequencing reads generated by the Illumina HiSeq 2000 and MinION. The final assembly of the Siberian larch mitogenome consists of 9 contigs with a total length of 11,662,539 bp (N50 = 3,031,766 bp). The longest contig is 4,008,762 bp, the shortest - 24,767 bp. Finally, 40 protein-coding, 34 tRNA, and 3 rRNA genes were annotated. This mitogenome is currently the largest one among publicly known. The assembled genome is of sufficient quality for further detailed studies and comparative analyses with other plant mitogenomes.
Mitogenome of Siberian larch was assembled from DNA sequences obtained from needles collected from a reference Siberian larch tree used in the de novo whole genome sequencing project  using two approaches. First, the total genomic DNA was isolated following mtDNA enrichment through isolation and purification of mitochondria. This DNA was used for sequencing on HiSeq 2000 platform (Illumina, Inc., San Diego, California, USA). Second, we isolated the HMW total genomic DNA to construct MP libraries for Illumina sequencing and to obtain long reads using MinION (Oxford Nanopore Technologies, Inc., Oxford, United Kingdom).
MtDNA enrichment through isolation and purification of mitochondria
To enrich total DNA with mtDNA the following mitochondria isolation protocol was used. The Siberian larch needles were stored either at 4 °C for 4–13 days or at − 80 °C for 10–12 months after they were collected from a reference Siberian larch tree . For long term storage, the needles were placed in a cryopreservation medium (10 mM MOPS, 5% DMSO, 5% glycerol, pH 7.4). The isolation procedure was based on the protocol reported in Sullivan et al. , with a few amendments as hereafter described. Before the mitochondria isolation procedure, the cryopreserved needles were thawed for 40–50 min by immersing them in warm water followed by the treatment for 1 min with 96% ethanol. Then, they were washed twice with distilled water. The weight of the plant material used was about 20 g. Further, the needles were cut into 1 cm pieces. Needles were homogenized in portions of 7 g each in 50 ml of chilled extraction medium (EM) containing 0.35 M mannitol, 30 mM MOPS, 1.25 mM EGTA, 2 mM sodium metabisulfite, 2.5 mM MgCl2, 0.2% (w/v) BSA, 0.3% (w/v) PVP-40, 0.3% (w/v) PVPP, 3 mM DTT, 3 mM cysteine, pH 7.4. Homogenization was carried out in a blender at 4 °C in series of 9 times for 3 s each and 3 times for 10 s each, followed by the grinding in a cooled mortar using pestle for 50–60 s. The homogenate was filtered through 4 layers of gauze and 2 layers of 50 μm nylon gauze. Then, 50 ml of EM were added to the homogenate while filtering (the total weight ratio of the needles to EM volume at the end of the homogenization was 1:16). The homogenate was centrifuged at 2500 g in F0685 Beckman Rotor for 5 min. The supernatant was centrifuged at 5000 g for 10 min. The pellet was discarded. The supernatant was centrifuged again at 13,000 g for 15 min. The pellet was resuspended in 100 ml of the washing medium (WM): 0.3 M mannitol, 30 mM MOPS, 0.2% (w/v) BSA, pH 7.4. The resulting suspension was centrifuged at 13,000 g for 15 min. The pellet was resuspended in 25 ml of the WM. The suspension was then centrifuged at 13,000 g for 15 min. The pellet was resuspended in 1 ml of the WM. The resulting suspension was centrifuged at 11,000 g for 10 min (HL 081 F45–24-11 Rotor) using the 5415R centrifuge (Eppendorf AG, Hamburg, Germany). The final mitochondrial pellet was resuspended in the WM without BSA in a 1: 1 weight/volume ratio.
MtDNA extraction from isolated mitochondria
To isolate mtDNA, 3 μl of DNase from Sigma-Aldrich (St. Louis, Missouri, USA) (1 mg/1 ml) and 3 μl of 1 M MgCl2 were added to 200 μl of the final mitochondrial suspension. The mixture was incubated for 20 min at 25 °C. Then, 1 ml of 0.3 M mannitol, 30 mM MOPS, 10 mM EDTA, 10 mM EGTA, pH 7.4 was added. The resulting mixture was centrifuged at 11,000 g for 5 min. The pellet was resuspended in 1 ml of 0.3 M mannitol, 30 mM MOPS, pH 7.4 and centrifuged again at 11,000 g for 5 min. Then, DNA was isolated from the resulting pellet using DNeasy Plant Mini Kit from QIAGEN (Hilden, Germany) according to the manufacturer’s protocol.
DNA isolation for long reads
The isolation of high-quality HMW DNA for further sequencing on MinION was carried out using the modified CTAB-method . Fresh needles (0.05–0.07 g per tube) were placed in 700 μL pre-warmed (65 °C) 2% CTAB isolation buffer (2% CTAB 1.4 M NaCl, 100 mM Tris pH 8.0, 20 mM EDTA) with the addition of 120 μl Proteinase K from Bioron GmbH (Römerberg, Germany). The mixture was transferred to a 1.5-mL microcentrifuge tube and incubated at 65 °C for 2 h. Then, 4 μl RNase was added to the mixture and incubated for 30 min at 65 °C. DNA was purified once with phenol-chloroform (1:1) and twice with one volume chloroform and then DNA was precipitated with one volume isopropanol. The obtained pellet was washed with 70% EtOH, dried, and dissolved in 50 μL TE buffer. Then, DNA solution from 12 tubes was combined in a single tube and concentrated in 100 μL TE buffer using AMPure XP beads (Beckman Coulter Inc., Brea, CA, USA) at 2:1 ratio. Resulting DNA solution was run on a 1% low-melting temperature agarose gel in 0.1% TAE buffer, and a HMW fraction was excised with a sterile razor blade, placed in a microcentrifuge tube, frozen at − 70 °C, and then melted. The frosting-melting cycle was repeated three times, and after the third thawing, the tube was centrifuged, and the aqueous layer was transferred to a new tube. Then, phenol-chloroform purification and precipitation with isopropanol was performed one again. DNA purity was examined with absorbance ratios using NanoPhotometer P300 Spectrophotometer (Implen GmbH, München, Germany), and concentration was measured using Qubit Fluorimeter (Thermo Fisher Scientific Inc., Waltham, Massachusetts, USA). Finally, the Ligation Sequencing kit 1D (SQK-LSK108) was used to prepare library for the Nanopore sequencing. Elution of DNA from agarose gel was necessary to reduce contaminants co-purification together with DNA from larch needle samples. Skipping this step always led to very fast (in 15–20 min) dying of the population of nanopores on a flow cell.
All sequencing was performed in the Laboratory of Forest Genomics (Genome Research and Education Center, Siberian Federal University, Krasnoyarsk, Russia). The PE library with the mean insertion size of 700 bp was prepared from the enriched mtDNA using the Illumina TruSeq DNA LT Sample Prep Kit according to the Illumina TruSeq DNA Sample Preparation Guide (Illumina Inc., San Diego, CA). Three MP libraries were prepared from the total non-enriched DNA using Nextera Mate Pair Library Prep Kit (Illumina Inc., San Diego, CA). These PE and 3 MP libraries were sequenced with 2 × 100 cycles on the HiSeq 2000 platform using the Illumina TruSeq SBS Kit v3 (Illumina Inc., San Diego, CA). More detailed data on Illumina libraries and obtained reads are provided in Table 6.
The total non-enriched DNA library was sequenced on the MinION (Oxford Nanopore Technologies Inc., Oxford, United Kingdom) with use of R9 FlowCells (FLO-MIN106).
RNA isolation, sequencing and assembly
RNA was isolated from autumn buds of a reference Siberian larch tree  using the Qiagen RNeasy Mini Kit (Qiagen, Hilden, Germany). The RNA-seq library was prepared using the TruSeq RNA Sample Preparation Kit v2 (Illumina Inc., San Diego, CA). In brief, mRNA was purified and fragmented followed by cDNA synthesis. Then, end repair, adapter ligation, and size selection using AMPure XP beads were done. The cDNA was PCR amplified. The sequencing of the obtained library was carried out on the MiSeq platform with the Illumina MiSeq Reagent Kit v2 (2 × 150-cycles) (Illumina Inc., San Diego, CA). The RNA reads were trimmed using Phred quality score 30 and base call accuracy 99.9% using FastQC v. 0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The de novo transcriptome assembly was generated using Trinity v. 2.8.4 (https://github.com/trinityrnaseq/trinityrnaseq/wiki).
Preliminary assembly using Illumina PE and MP short reads
The quality of Illumina reads was assessed using FastQC v. 0.11.5 . Adapter sequences were trimmed and short reads were filtered using Trimmomatic v. 0.36  with minimum quality of 19 and minimum length of 35 bp. Data on the PE and MP libraries and sequencing reads are presented in Table 6.
The obtained PE and MP sequence reads were assembled de novo into contigs using the CLC Assembly Cell v. 5.0.0 . Scaffolding was performed using BESST . Gap closing was done using Sealer (https://github.com/bcgsc/abyss/tree/master/Sealer). The resulting assembly of Illumina reads consisted of 1,216,421 contigs with a total length of 516,8 Mbp, N50 = 436 bp and the longest contig 1,449,395 bp.
Hybrid assembly using both Illumina short PE and MinION long reads
The basecalling and the quality evaluation for MinION reads were performed using Albacore . After trimming and filtering, the average read length was 4523 bp, the average quality was 9.1, the total length - 6.2 Gbp, and the longest read - 77,840 bp.
The hybrid assembly using long MinION reads and short PE Illumina reads was carried out using MaSuRCA v. 3.2.8 . This hybrid assembly consisted of 10,909 contigs with a total length of 55,3 Mbp, N50 = 7516 bp and the longest contig 4,008,762 bp.
To mine mitochondrial contigs from this assembly, the BLAST search against all mitochondrial plant sequences available in the NCBI GenBank was used. After mapping this assembly to plant mitochondrial database, we finally assembled 9 mitochondrial contigs with total length of 11.7 Mbp. The largest contig was 4,008,762 bp long.
Hybrid assembly evaluation
To further evaluate the accuracy of the hybrid assembly we ran REAPR v1.0.18 . It uses paired reads mapping information to search for low or exceedingly high coverage regions, mis-orientation of read pairs, high local SNP densities, and correlated SNPs to indicate mis-assemblies and collapsed repeats. The program scans for the four types of error: a region with or without a gap that triggered an FCD error and a region with low fragment coverage that does or does not contain a gap.
Mitogenome of Siberian larch was checked for homology with other plant mitogenomes existing in the NCBI GenBank database using the NCBI BLAST algorithm. Mitofy  was also used for Siberian larch mitogenome annotation, but only 11 genes were found with this tool. The tRNA genes were discovered using ARAGORN  and tRNAscan-SE  tools. Ribosomal RNA (rRNA) were annotated using RNAmmer .
RNA editing sites were predicted using PREP-Mt  with a cutoff value of 0.2. They were verified by aligning RNA-seq reads to the annotated mitochondrial genes and calling variants with the CLC Genomic Workbench.
To search for MTPTs we used CLC Genomics Workbench whole genome alignment tool and Siberian larch chloroplast genome sequence .
Repetitive element (RE) analysis
RepeatModeler v.1.0.11  was used to search for REs in the mitogenome assembly. TEclass online service  was employed to classify unknown repeated elements from de novo RE library generated by RepeatModeler. In addition to this RepeatModeler derived library and RepBase library , MIPS Repeat Element Database library , Custom Plant Repeat Database , and Pine Interspersed Repeats Resource library PIER v1.0 [38, 70] were used to generate a combined repeat library to run RepeatMasker v. 4.0.6 . Custom R script was applied to parse RepeatMasker results according to RepBase classification .
Search for plasmids
To test whether the Siberian larch mitogenome contains plasmid-like sequences all nine contigs assembled in this study were blasted against 1) a linear plasmid sequence identified in mitochondria of Picea abies (L.) Karst and deposited at the NCBI GenBank under accession number AJ225562 and 2) 891 broad host range plasmids (NCBI GenBank taxid:36549). The blast was performed with different stringency searching for highly similar sequences (megablast), more dissimilar sequences (discontiguous megablast), and somewhat similar sequences (blastn).
Availability of data and materials
All sequences for nine Siberian larch mitochondrial contigs described in this study are available in the NCBI GenBank under the accession numbers MT797187-MT797195.
Basic local alignment search tool
Bovine serum albumin
- bp :
Carbonyl cyanide 3-chlorophenylhydrazone
Threo-1,4-Dimercapto-2,3-butanediol or DL-Dithiothreitol
Ethylene-bis (oxyethylenenitrilo) tetraacetic acid)
High molecular weight
Kilo base pair
- LINE :
Long interspersed nuclear element
Long terminal repeat
3-(N-Morpholino) propanesulfonic acid
Million years ago
Open reading frame
- SINE :
Interspersed nuclear element
Mitochondrial plastid DNA
- atp1, atp4, atp6, atp8 and atp9 :
Genes for ATP synthase subunits 1, 4, 6, 8 and 9
- ccmC, ccmC, ccmFc and ccmFn :
Genes for cytochrome c biogenesis proteins B, C, FC and FN
- cob :
Gene for cytochrome b
- cox1, cox2 and cox3 :
Genes for cytochrome c oxidase subunits 1, 2 and 3
- matR :
Gene for maturase
- mttB :
transport membrane protein B
- nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7 and nad9 :
Mitochondrial genes for NADH dehydrogenase subunits 1–7, 9 and 4 L
- psaA :
gene for photosystem I P700 apoprotein A1
- psaB :
gene for photosystem I P700 apoprotein A2
- rpl2, rpl5 and rpl16 :
Genes for ribosomal proteins L2, L5 and L16
- rps1, rps2, rps3, rps4, rps7, rps10, rps11, rps12, rps13, rps14 and rps19 :
Genes for ribosomal proteins S1, S2, S3, S4, S7, S10, S11, S12, S13, S14, and S19
- sdh3 and sdh4 :
Genes for succinate dehydrogenase cytochrome subunits 3 and 4
- trnL :
tRNA gene for leucine
- trnT :
tRNA gene for threonine
- trnS :
tRNA gene for serine
- ycf3 :
gene for photosystem I assembly protein Ycf3
Oda K, Yamato K, Ohta E, Nakamura Y, Takemura M, Nozato N, et al. Gene organization deduced from the complete sequence of liverwort Marchantia polymorpha mitochondrial DNA: a primitive form of plant mitochondrial genome. J Mol Biol. 1992;223(1):1–7.
Handa H. The complete nucleotide sequence and RNA editing content of the mitochondrial genome of rapeseed (Brassica napus L.): comparative analysis of the mitochondrial genomes of rapeseed and Arabidopsis thaliana. Nucleic Acids Res. 2003;31(20):5907–16..
Kim B, Kim K, Yang TJ, Kim S. Completion of the mitochondrial genome sequence of onion (Allium cepa L.) containing the CMS-S male-sterile cytoplasm and identification of an independent event of the ccmFN gene split. Curr Genet. 2016;62(4):873–85.
Rice DW, Alverson AJ, Richardson AO, Young GJ, Sanchez-Puerta MV, Munzinger J, et al. Horizontal transfer of entire genomes via mitochondrial fusion in the angiosperm Amborella. Science. 2013;342(6165):1468–73.
Sloan DB, Alverson AJ, Chuckalovcak JP, Wu M, DE MC, Palmer JD, et al. Rapid evolution of enormous, multichromosomal genomes in flowering plant mitochondria with exceptionally high mutation rates. PLoS Biol. 2012;10(1):e100124.
Smith DR, Keeling PJ. Mitochondrial and plastid genome architecture: reoccurring themes, but significant differences at the extremes. Proc Natl Acad Sci U S A. 2015;112(33):10177–84.
Boore JL. Animal mitochondrial genomes. Nucleic Acids Res. 1999;27(8):1767–80.
Mower JP, Sloan DB, Alverson AJ. Plant mitochondrial genome diversity: the genomics revolution. In: Wendel J, Greilhuber J, Dolezel J, Leitch I, editors. Plant Genome Diversity. Volume 1. Vienna: Springer; 2012. p. 123–44.
Adams KL, Qiu Y-L, Stoutemyer M, Palmer JD. Punctuated evolution of mitochondrial gene content: high and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proc Natl Acad Sci U S A. 2002;99(15):9905–12.
Adams KL, Palmer JD. Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol Phylogenet Evol. 2003;29(3):380–95.
Gualberto JM, Newton KJ. Plant mitochondrial genomes: Dynamics and mechanisms of mutation. Annu Rev Plant Biol. 2017;68:225–52.
Christensen AC. Mitochondrial DNA repair and genome evolution. Ann Plant Rev. 2018;50:11–32.
Chevigny N, Schatz-Daas D, Lotfi F, Gualberto JM. DNA repair and the stability of the plant mitochondrial genome. Int J Mol Sci. 2020;21(1):328.
Liberatore KL, Dukowic-Schulze S, Miller ME, Chen C, Kianian SF. The role of mitochondria in plant development and stress tolerance. Free Radic Biol Med. 2016;100:238–56.
Alverson AJ, Rice DW, Dickinson S, Barry K, Palmer JD. Origins and recombination of the bacterial-sized multichromosomal mitochondrial genome of cucumber. Plant Cell. 2011;23(7):2499–513.
Alverson AJ, Wei X, Rice DW, Stern DB, Barry K, Palmer JD. Insights into the evolution of mitochondrial genome size from complete sequences of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol Biol Evol. 2010;27(6):1436–48.
Wang D, Rousseau-Gueutin M, Timmis JN. Plastid sequences contribute to some plant mitochondrial genes. Mol Biol Evol. 2012;29(7):1707–11.
Gandini CL, Sanchez-Puerta MV. Foreign plastid sequences in plant mitochondria are frequently acquired via mitochondrion-to-mitochondrion horizontal transfer. Sci Rep. 2017;7(1):43402.
Iorizzo M, Grzebelus D, Senalik D, Szklarczyk M, Spooner D, Simon P. Against the traffic: the first evidence for mitochondrial DNA transfer into the plastid genome. Mob Genet Elements. 2012;2(6):261–6.
Straub SCK, Cronn RC, Edwards C, Fishbein M, Liston A. Horizontal transfer of DNA from the mitochondrial to the plastid genome and its subsequent evolution in milkweeds (Apocynaceae). Genome Biol Evol. 2013;5(10):1872–85.
Ma P-F, Zhang Y-X, Guo Z-H, Li D-Z. Evidence for horizontal transfer of mitochondrial DNA to the plastid genome in a bamboo genus. Sci Rep. 2015, 5(1):11608.
Burke SV, Wysocki WP, Zuloaga FO, Craine JM, Pires JC, Edger PP, et al. Evolutionary relationships in Panicoid grasses based on plastome phylogenomics (Panicoideae; Poaceae). BMC Plant Biol. 2016;16(1):140.
Smith DR. Mutation rates in plastid genomes: they are lower than you might think. Genome Biol Evol. 2015;7(5):1227–34.
Mower JP, Touzet P, Gummow JS, Delph LF, Palmer JD, Extensive variation in synonymous substitution rates in mitochondrial genes of seed plants. BMC Evol Biol. 2007;7:135.
Wolfe KH, Li WH, Sharp PM. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad Sci U S A. 1987;84(24):9054–8.
Zhu A, Guo W, Jain K, Mower JP. Unprecedented heterogeneity in the synonymous substitution rate within a plant genome. Mol Biol Evol. 2014;31(5):1228–36.
Lynch M, Koskella B, Schaack S. Mutation pressure and the evolution of organelle genomic architecture. Science. 2006;311(5768):1727–30.
Van de Paer C, Bouchez O, Besnard G. Prospects on the evolutionary mitogenomics of plants: A case study on the olive family (Oleaceae). Mol Ecol Resour. 2018;18(3):407–23.
Preston MD, Campino S, Assefa SA, Echeverry DF, Ocholla H, Amambua-Ngwa A, et al. A barcode of organellar genome polymorphisms identifies the geographic origin of Plasmodium falciparum strains. Nat Commun. 2014;5(1):4052.
Whittle C-A, Johnston MO. Male-driven evolution of mitochondrial and chloroplastidial DNA sequences in plants. Mol Biol Evol. 2002;19(6):938–49.
Smith DR. The past, present and future of mitochondrial genomics: have we sequenced enough mtDNAs? Brief Funct Genomics. 2016;15(1):47–54.
Chaw SM, Shih AC, Wang D, Wu YW, Liu SM, Chou TY. The mitochondrial genome of the gymnosperm Cycas taitungensis contains a novel family of short interspersed elements, Bpu sequences, and abundant RNA editing sites. Mol Biol Evol. 2008;25(3):603–15.
Guo W, Grewe F, Fan W, Young GJ, Knoop V, Palmer JD, et al. Ginkgo and Welwitschia mitogenomes reveal extreme contrasts in gymnosperm mitochondrial evolution. Mol Biol Evol. 2016;33(6):1448–60.
Kan S-L, Shen T-T, Gong P, Ran J-H, Wang X-Q. The complete mitochondrial genome of Taxus cuspidata (Taxaceae): eight protein-coding genes have transferred to the nuclear genome. BMC Evol Biol. 2020;20:–10.
Jackman SD, Warren RL, Gibb EA, Vandervalk BP, Mohamadi H, Chu J, et al. Organellar genomes of white spruce (Picea glauca): assembly and annotation. Genome Biol Evol. 2016;8(1):29–41.
Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, et al. The Norway spruce genome sequence and conifer genome evolution. Nature. 2013;497(7451):579–84.
Sullivan AR, Eldfjell Y, Schiffthaler B, Delhomme N, Asp T, Hebelstrup KH, Keech O, Öberg L, Møller IM, Arvestad L, Street NR, Wang X-R. The mitogenome of Norway spruce and a reappraisal of mitochondrial recombination in plants. Genome Biol Evol. 2020;12(1):3586–98.
Neale DB, Wegrzyn JL, Stevens KA, Zimin AV, Puiu D, Crepeau MW, et al. Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol. 2014;15(3):R59.
Stevens KA, Wegrzyn JL, Zimin A, Puiu D, Crepeau M, Cardeno C, et al. Sequence of the sugar pine megagenome. Genetics. 2016;204(4):1613–26.
Kuzmin DA, Feranchuk SI, Sharov VV, Cybin AN, Makolov SV, Putintseva YA, Oreshkova NV, Krutovsky KV. Stepwise large genome assembly approach: A case of Siberian larch (Larix sibirica Ledeb.). BMC Bioinformatics. 2019;20(Suppl. 1):37.
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14(5):R47.
Morse AM, Peterson DG, Islam-Faridi MN, Smith KE, Magbanua Z, Garcia SA, et al. Evolution of genome size and complexity in Pinus. PLoS One. 2009;4(2):e4332.
Lin X, Faridi N, Casola C. An ancient trans-kingdom horizontal transfer of Penelope-like retroelements from arthropods to conifers. Genome Biol Evol. 2016;8(4):1252–66.
Pritham EJ, Putliwala T, Feschotte C. Mavericks, a novel class of giant transposable elements widespread in eukaryotes and related to DNA viruses. Gene. 2007;390(1–2):3–17.
Haapa-Paananen S, Wahlberg N, Savilahti H. Phylogenetic analysis of Maverick/Polinton giant transposons across organisms. Mol Phylogenet Evol. 2014;78:271–4.
Bondar EI, Putintseva YA, Oreshkova NV, Krutovsky KV, et al. BMC Bioinformatics. 2019;20(S1):38.
Unseld M, Marienfeld JR, Brandt P, Brennicke A. The mitochondrial genome of Arabidopsis thaliana contains 57 genes in 366,924 nucleotides. Nat Genet. 1997;15:57–61.
Goryunov DV, Goryunova SV, Kuznetsova OI, Logacheva MD, Milyutina IA, Fedorova AV, Ignatov MS, Troitsky AV. Complete mitochondrial genome sequence of the “copper moss” Mielichhoferia elongata reveals independent nad7 gene functionality loss. PeerJ. 2018;6:e4350.
Wang D, Wu Y-W, Shih AC-C, Wu C-S, Wang Y-N, Chaw S-M. Transfer of chloroplast genomic dna to mitochondrial genome occurred at least 300 mya. Mol Biol Evol. 2007;24(9):2040–8.
Wang X-C, Chen H, Yang D, Liu C. Diversity of mitochondrial plastid DNAs (MTPTs) in seed plants. Mitochondrial DNA Part A. 2018;29(4):635–42.
Warren JM, Simmons MP, Wu Z, Sloan DB. Linear plasmids and the rate of sequence evolution in plant mitochondrial genomes. Genome Biol Evol. 2016;8(2):364–74.
Kozik A, Rowan BA, Lavelle D, Berke L, Schranz ME, Michelmore RW, et al. The alternative reality of plant mitochondrial DNA: One ring does not rule them all. PLOS Genet. 2019;15(8):e1008373.
Sloan DB. One ring to rule them all? Genome sequencing provides new insights into the ‘master circle’ model of plant mitochondrial DNA structure. New Phytol. 2013;200(4):978–85.
Devey ME, Bell JC, Smith DN, Neale DB, Moran GF. A genetic linkage map for Pinus radiata based on RFLP, RAPD, and microsatellite markers. Theor Appl Genet. 1996;92(6):673–9.
. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed 15 Dec 2019.
Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible read trimming tool for Illumina NGS data. Bioinformatics. 2014;30(15):2114–20.
White paper on de novo assembly in CLC Assembly Cell 4.0. QIAGEN, Aarhus, Denmark. 2016. https://digitalinsights.qiagen.com/files/whitepapers/whitepaper-denovo-assembly.pdf. Accessed 16 Aug 2016.
Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L. BESST - Efficient scaffolding of large fragmented assemblies. BMC Bioinformatics. 2014;15:281.
Pomerantz A, Peñafiel N, Arteaga A, Bustamante L, Pichardo F, Coloma LA, et al. Real-time DNA barcoding in a rainforest using nanopore sequencing: opportunities for rapid biodiversity assessments and local capacity building. Gigascience. 2018;7(4):giy033.
Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013;29(21):2669–77.
Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32(1):11–6.
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25(5):955–64.
Lagesen K, Hallin P, Rødland EA, Stærfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35(9):3100–8.
Mower JP. The PREP suite: predictive RNA editors for plant mitochondrial genes, chloroplast genes and user-defined alignments. Nucleic Acids Res. 2009;37:W253–9.
Smit AFA, Hubley R. RepeatModeler Open-1.0. 2008–2015 Available from: http://www.repeatmasker.org. Accessed 1 Feb 2015.
Abrusan G, Grundmann N, DeMester L, Makalowski W. TEclass--a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics. 2009;25(10):1329–30.
Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6(1):11.
Nussbaumer T, Martis MM, Roessner SK, Pfeifer M, Bader KC, Sharma S, et al. MIPS PlantsDB: a database framework for comparative plant genome research. Nucleic Acids Res. 2013;41(D1):D1144–51.
Wegrzyn JL, Lin BY, Zieve JJ, Dougherty WM, Martínez-García PJ, Koriabine M, et al. Insights into the loblolly pine genome: Characterization of BAC and fosmid sequences. PLoS One. 2013;8(9):e72439.
Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013–2015 Available from: http://www.repeatmasker.org. Accessed 1 Dec 2015.
We acknowledge support by the German Research Foundation (DFG) and the Open Access Publication Funds of the University of Göttingen.
This study was supported by research grants № 14.Y26.31.0004 from the Russian Federation Government for the “Genomics of the key boreal forest conifer species and their major phytopathogens in the Russian Federation” project and № 16–04-01400 from the Russian Foundation for Basic Research. OK was supported by TC4F and the KEMPE Foundations. Open Access funding enabled and organized by Projekt DEAL.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Putintseva, Y.A., Bondar, E.I., Simonov, E.P. et al. Siberian larch (Larix sibirica Ledeb.) mitochondrial genome assembled using both short and long nucleotide sequence reads is currently the largest known mitogenome. BMC Genomics 21, 654 (2020). https://doi.org/10.1186/s12864-020-07061-4
- Larix sibirica
- Long reads
- Mitochondrial genome
- Nucleotide sequence
- RNA editing