Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria
© Ekblom et al.; licensee BioMed Central Ltd. 2014
Received: 17 January 2014
Accepted: 9 June 2014
Published: 12 June 2014
Genome and transcriptome sequencing applications that rely on variation in sequence depth can be negatively affected if there are systematic biases in coverage. We have investigated patterns of local variation in sequencing coverage by utilising ultra-deep sequencing (>100,000X) of mtDNA obtained during sequencing of two vertebrate genomes, wolverine (Gulo gulo) and collared flycatcher (Ficedula albicollis). With such extreme depth, stochastic variation in coverage should be negligible, which allows us to provide a very detailed, fine-scale picture of sequence dependent coverage variation and sequencing error rates.
Sequencing coverage showed up to six-fold variation across the complete mtDNA and this variation was highly repeatable in sequencing of multiple individuals of the same species. Moreover, coverage in orthologous regions was correlated between the two species and was negatively correlated with GC content. We also found a negative correlation between the site-specific sequencing error rate and coverage, with certain sequence motifs “CCNGCC” being particularly prone to high rates of error and low coverage.
Our results demonstrate that inherent sequence characteristics govern variation in coverage and suggest that some of this variation, like GC content, should be controlled for in, for example, RNA-Seq and detection of copy number variation.
Many applications of high throughput sequencing, such as expression profiling , splicing inference , copy number variation (CNV) identification  and repeat element annotation  rely on observed variation in the depth of sequencing coverage within the genome or transcriptome. In addition, variance in coverage is of importance to de-novo assembly pipelines and read mapping strategies, since local regions of unusually high coverage may be interpreted as duplications or masked as repeat elements [5, 6]. As a consequence, applications such as these may be impaired by random fluctuation and systematic bias in sequencing coverage across the genome. However, the details and extent of these effects are currently not well understood.
It is known that the PCR step involved in sequencing-by-synthesis methods introduces coverage bias related to GC content [7–9], possibly due to the formation of secondary structures of single stranded DNA . Such GC dependent bias is seen on a wide variety of scales ranging from individual nucleotides to complete sequencing reads and even large (up to 100 kb) genomic regions . Systematic bias could also be introduced during the DNA fragmentation step or caused by DNA isolation efficacy , local DNA structure, variation in sequence quality and map-ability of sequence reads . Some efforts have been made to control for these biases during downstream computational analyses in various NGS (next generation sequencing) applications [13–16], and laboratory protocols have also been developed to reduce this problem .
In addition to variation in coverage, there may be sequence dependent variation in nucleotide specific error rates. Such systematic patterns of sequencing errors can also have consequences for downstream applications as errors may be taken for low frequency SNPs, even when sequencing coverage is high . GC rich regions and sites close to the ends of sequence reads typically show elevated errors rates  and it has also been shown that certain sequence patterns, especially inverted repeats and “GGC” motifs are associated with an elevated rate of Illumina sequencing errors . Such sequence specific miscalls probably arise due to specific inhibition of polymerase binding . Homopolymer runs cause problems for technologies utilising a terminator free chemistry (such as Roche 454 and Ion Torrent), and specific error profiles exist for other sequencing technologies as well . The effects of such technology specific error patterns on sequencing coverage and read assembly algorithms remains poorly described. As sequencing reads with high error rates are more likely to be removed during trimming stages, regions with high error rates may also get decreased sequencing coverage .
Mitochondrial DNA (mtDNA) is abundant in most cell types and whole genome, exome or transcriptome data from high throughput sequencing can therefore be efficiently mined for complete mitochondrial sequences, as shown in a number of recent studies [22–29]. Due to the haploid nature of the mitochondrial genome, such data is also comparatively easy to assemble and analyse. Here we take advantage of ultra-deep sequencing of mtDNA from two vertebrate species to evaluate fine scale variation in coverage and patterns of sequencing errors. We utilise data from several independently sequenced individuals in order to describe systematic coverage bias across the mtDNA sequence. Due to the extreme depth of our sequencing data (>100,000X), it is possible to provide very precise estimates of coverage bias and sequencing error rates. Furthermore, a comparison in sequence coverage between two distantly related species (one bird and one mammal) allows us to evaluate to what extent sequencing bias is conserved across evolutionary distant lineages.
Results and discussion
Sequencing the wolverine mitochondrial DNA
Information about sequencing coverage and read mapping for the sampled individuals
Total amount of trimmed sequencing data: no reads (no base pairs)
Number of reads mapping to mtDNA
Average mtDNA sequencing coverage
975 million reads (93.0 Gbp)
Additional ten individuals (mean)
152 million reads (16.4 Gbp)
952 million reads (93.8 Gbp)
Our consensus mtDNA sequence from the wolverine differed from the previously published wolverine mtDNA sequence  by 23 SNPs and 12 indels. Given the high sequencing coverage underlying the consensus sequence, we are fairly confident that it is error free. Moreover, no mitochondrial sequence variation in non-repetitive regions was detected in sequencing of 10 additional samples from different parts of the same Scandinavian population (see below). Since the previously sequenced individual was allegedly also sampled from the Scandinavian population , the observed discrepancies are most likely explained by sequencing errors in the NCBI entry. Complete mtDNA homogeneity has previously been found when analysing a small region of the mitochondrial genome in Scandinavian wolverines .
A k-mer count (k = 25) of mtDNA reads showed a peak at a multiplicity at 52,000 (the mean k-mer coverage of the sequencing data), representing a mean depth of sequencing coverage of 68,000X (see Materials and Methods for details of this calculation). Based on the shape of the bimodal distribution (Additional file 1: Figure S2), k-mers with multiplicity less than 8,000 were inferred to be the result of sequencing errors. Out of a total of 861 million k-mers in the data, 24 million were thus considered to be the result of sequencing errors. This gives an average per-base error rate of 0.11% across the whole mtDNA data.
Variation in coverage across the mtDNA sequence
Individual G and C nucleotides had lower average sequencing coverage (mean = 111,750) compared to A and T nucleotides (mean = 114,853; t = 10.51, df = 13,765, p < < 0.001, Additional file 1: Figure S4), also when controlling for GC content of the surrounding region (Additional file 1: Table S1). Under-representation of GC-rich regions has been suggested to be a result of bias in the initial PCR steps during library preparation [20, 36] and of miscalls originating from the sequencer . These local fluctuations in sequencing coverage may affect applications where coverage is used as a proxy for biological phenomena. For instance regions of high coverage (due to low GC content) could be more likely to be interpreted as genomic duplications in CNV analyses . Moreover estimates of expression levels of GC-poor genes may be inflated in RNA-Seq experiments . As GC-rich regions are common around promoters and transcription start sites, under-representation of sequencing there may hamper biological interpretation and annotation of low coverage genome data .
Reproducibility of sequencing coverage
Strong repeatability in the pattern of heterogeneity in coverage across a sequence would evidence that the variation is mainly due to inherent characteristics of particular sites or regions of that sequence, rather than representing stochastic variation in library preparation and/or amplification. One means to further test this is to investigate among-individual consistency in sequencing coverage. We therefore sequenced 10 additional wolverines to a mean trimmed read depth of 5.5X coverage for nuclear DNA and 4,200X coverage for mtDNA. The per-site variation in coverage of mtDNA was again considerable, ranging between three-fold and six-fold for the 10 individuals. Variation was highly correlated between the individual sequenced at high coverage and the additional 10 individuals (Spearman’s r ranging from 0.53 to 0.68, df = 15,786, p < < 0.001, Figure 1). The correlation was stronger when comparing among the 10 additional individuals (rS ranging from 0.70 to 0.91; paired t-test, t = −14.67, df = 9, p < 0.001). Several peaks and dips of coverage were concordant among the independently sequenced individuals (for example at around positions 2,000, 7,000 and 9,000; Additional file 1: Figure S5), even on a very fine scale. Other regions (around positions 4,600, 11,400 and 15,500) showed more variation among individuals (Additional file 1: Figure S6). It is thus evident that although there is some random fluctuations in coverage, a large proportion of the variation is due to systematic and sequence dependent biases.
Conservation of sequencing coverage bias across two distant vertebrate taxa
Site-specific variation in sequencing errors
For all positions (and in all sequenced individuals) in the wolverine mtDNA sequence we found that a small proportion of the reads had an aberrant nucleotide compared to the reference sequence. A large majority of sites have a non-reference nucleotide in ≈ 0.1% of the reads with a similar distribution across all eleven sequenced individuals. We interpret these alternative nucleotides as sequencing errors (rather than true mutations) based on their relatively low frequency and the fact that they were dispersed across all sequenced individuals and across the whole mtDNA sequence. An observed transition/transversion ratio close to one (Additional file 1: Table S2), distinctly different from what is normally observed in diversity or divergence data, further supports this interpretation. The estimated mean error rate (0.11%, see below) is very close to our expectation based on the k-mer count analysis (see above) and it is also in line with the specifications for HiSeq sequencing (http://www.illumina.com), representing a Qphred score of 30; and to that found in other studies after quality filtering . Six nucleotides had a very high number of one specific non-reference allele (1.3-67.9%) in a single individual only (Additional file 1: Table S3, Additional file 1: Figure S8), with all other individuals having a low (less than 0.1%) number of non-reference nucleotides at this site. We interpret this as evidence for heteroplasmy, i.e. mutations in the pool of mtDNA molecules transmitted between generations .
The mean site-specific error rate (measured as the count of a non-reference nucleotide divided by total sequencing depth at that position) of the wolverine individual sequenced at high depth was 0.11%, with a range between 0.028% and 2.45%. Sequencing errors thus occurred at all positions of the mitochondrial genome. G and C nucleotides had a slightly higher mean error rate compared with A and T nucleotides (t = −10.14, df = 15,550, p < < 0.001, Additional file 1: Table S2). Sequence specific errors (SSE) were previously found to be common in Illumina HiSeq reads with the highest rates seen at the motif “GGC”  and, in particular, “GGCNG” [36, 40]. Our data support these findings. All three sites with error rates above 1% included the sequence motif “CCNGCC” (a one-base pair extension of the reverse complement of the above) directly downstream of the high error position (Additional file 1: Figure S9). SSE are likely to bias estimates of sequence diversity as they will lead to falsely inferred SNPs in specific error prone sites. An efficient way of overcoming this problem would be to consider on which strand SNPs are detected (only calling SNPs if the variation is seen on both strands), as only reads from one direction should be affected by the increased error rate . It is however, important to note that this strategy requires sufficient sequencing coverage of both strands at the variable nucleotide position.
Alternatively, the link between error rates and sequencing coverage could also be an effect of imperfect mapping of reads with sequencing errors if mapping parameters are stringently set. Observed coverage is thus expected to decrease in such regions  and may be falsely interpreted as having copy number variation (for genomic sequencing) or reduced expression levels (for transcriptomic sequencing), if these factors are not properly controlled for. However, it should be noted that not all positions with an error prone sequence motif seems to be effected by sequencing coverage bias. For example, the sequence motif “CCNGCC” (and its reverse complement) occurs 22 times in the wolverine mtDNA sequence, and most of these show neither elevated error rates nor decreased sequencing coverage. In order to be able to apply proper controls for SSE bias it is thus important to elucidate why some, but not all, regions with certain sequence motifs show marked increases in sequencing errors. An important development for the future is to develop and utilise more precise methods to control for these biases in bioinformatics pipelines (including base calling, read trimming, error correction and mapping algorithms). Additionally, wet-lab based methods can be improved to produce a more even coverage across regions with different sequence characteristics . As different sequencing technologies show different signatures of bias, combining different types of sequencing data could be a suitable strategy to reduce these effects .
We have reported a striking variation in sequencing coverage across the mitochondrial genome. Reproducibility of coverage patterns within and across species provide evidence that this variation is largely due to intrinsic properties of the DNA sequence, with GC content as an important explanatory factor. Coverage is also related to local levels of sequencing error rates, with peaks in sequencing error also showing marked drops in coverage. Such error peaks are also often associated with certain error prone sequencing motifs. This highlights the importance of controlling for coverage bias when investigating sequencing data for applications such as RNA-Seq, CNV identification or whole genome sequencing.
Samples and sequencing
Sequencing was performed on a HiSeq2000 instrument (Illumina Inc) using TruSeq SBS v3 chemistry, according to the manufacturer’s protocols, with paired-end (PE) reads (length varying from 65 bp to 150 bp) and with insert sizes from 180 bp to 500 bp (Table 1). Base calling was done on the instrument by RTA 184.108.40.206 or 1.13.48 and the resulting .bcl files were converted to Qseq format with OLB-1.9.0 (Illumina Inc) and then demultiplexed, allowing for one mismatch base, and converted to fastq format with CASAVA-1.7.0 (Illumina Inc). Reads that did not pass the quality filtering were excluded.
We used whole genome sequencing data from one mammal (wolverine, G. gulo) and one bird (collared flycatcher, Ficedula albicollis). Wolverine data were obtained from one female sequenced to a nuclear genome depth of 45X (raw PE data; Ekblom et al., unpublished). DNA from this animal sampled in the province of Jämtland, Sweden in 2010 was extracted from muscle tissue using DNeasy Blood and Tissue kit (Qiagen). In addition, DNA samples were extracted from muscle tissue of 10 unrelated wolverine individuals (all males from different parts of the Scandinavian population) and were sequenced to approximately 10X nuclear coverage each. Flycatcher data were obtained from one focal individual (DNA sampled from muscle tissue with phenol-chloroform extraction) sequenced to 85X as described in . Raw sequence read data is available at NCBI SRA (accession number ERP001377).
Raw wolverine reads were quality trimmed using ConDeTri  and reads from the individual sequenced at high depth were then simultaneously mapped to a published mtDNA reference sequence for this species NCBI Accession number: NC_009685; , together with the draft genome assembly (Gugu1.0, R Ekblom unpublished), using BWA . A consensus mtDNA sequence was extracted with samtools and bcftools , and manually edited to include also insertions and deletions. The consensus was then used as reference when mapping reads from the additional individuals sequenced at lower coverage. The rationale of mapping reads to both the mitochondrial and the nuclear genome, despite our interest here only lies in mtDNA coverage, relates to the possible occurrence of nuclear copies of mtDNA sequences so-called numts; . Reads originating from such copies will preferentially map to the nuclear genome, when this is included in the reference, and will thus not contribute to the calculation of mtDNA coverage .
A collared flycatcher mtDNA reference sequence was first generated by mapping reads from the sequenced individual onto the mitochondrial genome of the closely related species Ficedula zanthopygia (NCBI Accession number: NC_015802). From this, a consensus sequence was extracted as above. We found several regions where the sequence differed notably between F. albicollis and F. zanthopygia. Manual editing of the sequence was performed iteratively, with mapping to the new consensus after every update until albicollis reads corresponded perfectly to the new consensus sequence. Finally, as for the wolverine, mapping was performed using both the mtDNA consensus sequence and nuclear genome assembly (fAlb15, GI:513788161) together as reference. Pileup files were extracted with samtools and merged between individuals and species. Since mtDNA is a circular molecule and given that BWA only map reads to a linear reference, we introduced a break in the mtDNA molecule. As a consequence, coverage in the end regions became progressively lower towards the start and end nucleotide of the reference. We therefore performed the mapping using a two hundred basepair overlap in the ends of the mtDNA reference sequence. There were also low coverage and high error rates in the region downstream of the cyt-b gene (i.e. the D-loop, or the control region), because of problems of correctly aligning reads to the long repeat region (bp 15,988-16,256). In order to avoid biases introduced by these mapping issues we used only data between position 29 and 15,816 for all downstream analyses of coverage and error rates (trimming away a total of 749 bp).
k-mer counting was done using the jellyfish (ver.1.1.6) software , with default settings and k-mer length set to 25 bp. Depth of sequencing (N) was calculated using the formula N = M*(L-k + 1)/L, where M represents the mean k-mer coverage, k is the k-mer length and L is the mean read length . Annotation of the mitochondrial consensus sequences was performed using the Mitos web interface  and visualised using CGView . For tests of correlation between GC content and coverage we used coverage from the individual sequenced at high depth and mean GC content in non-overlapping sliding windows of size 50 bp of the mitochondrial genome of the respective species. For the analysis of coverage at individual nucleotides we performed an ANCOVA analysis using nucleotide identity and gene type as categorical variables, and GC content of the surrounding region (50 bp window) as a continuous variable.
We identified regions of the mtDNA sequence that were homologous between wolverine and flycatcher using two-sequence megablast (all hit regions extracted using default settings; http://blast.ncbi.nlm.nih.gov), extracted these using in-house perl scripts and aligned them using MAFFT . Additional data handling and statistical analyses were performed using R 2.15 .
Availability of supporting data
The annotated consensus sequences for the complete mtDNA genomes sequenced in this study are available at NCBI [GenBank: KF415127.1 (wolverine), GenBank: KF293721.1 (collared flycatcher)].
We thank Malin Johansson and Jessica Magnusson for lab work and Pall I Olason for bioinformatic assistance. Two anonymous reviewers provided feedback on a previous version of this manuscript. Sequencing was performed by the SNP&SEQ Technology Platform (http://www.sequencing.se), Science for Life Laboratory at Uppsala University, a national infrastructure supported by the Swedish Research Council (VR-RFI) and the Knut and Alice Wallenberg Foundation. Computational infrastructure was provided by the Uppsala Multidiciplinary Centre for Advanced Computational Science (UPPMAX). This work was supported by the Swedish Environmental Protection Agency [grant number 235-12-11].
- Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009, 6 (11s): S22-S32. 10.1038/nmeth.1371.PubMed CentralPubMedView ArticleGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7 (3): 562-578. 10.1038/nprot.2012.016.PubMed CentralPubMedView ArticleGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19 (9): 1586-1592. 10.1101/gr.092981.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Yandell M, Ence D: A beginner's guide to eukaryotic genome annotation. Nat Rev Genet. 2012, 13 (5): 329-342. 10.1038/nrg3174.PubMedView ArticleGoogle Scholar
- Schatz MC, Delcher AL, Salzberg SL: Assembly of large genomes using second-generation sequencing. Genome Res. 2010, 20 (9): 1165-1173. 10.1101/gr.101360.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Treangen TJ, Salzberg SL: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2011, 13 (1): 36-46.PubMed CentralPubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralPubMedView ArticleGoogle Scholar
- Aird D, Ross M, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe D, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-10.1186/gb-2011-12-2-r18.PubMed CentralPubMedView ArticleGoogle Scholar
- Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014, 15 (2): 121-132. 10.1038/nrg3642.PubMedView ArticleGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39 (13): e90-10.1093/nar/gkr344.PubMed CentralPubMedView ArticleGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40 (10): e72-10.1093/nar/gks001.PubMed CentralPubMedView ArticleGoogle Scholar
- van Heesch S, Mokry M, Boskova V, Junker W, Mehon R, Toonen P, de Bruijn E, Shull J, Aitman T, Cuppen E, Guryev V: Systematic biases in DNA copy number originate from isolation procedures. Genome Biol. 2013, 14 (4): R33-10.1186/gb-2013-14-4-r33.PubMed CentralPubMedView ArticleGoogle Scholar
- Cheung M-S, Down TA, Latorre I, Ahringer J: Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 2011, 39 (15): e103-10.1093/nar/gkr425.PubMed CentralPubMedView ArticleGoogle Scholar
- Davey JW, Cezard T, Fuentes-Utrilla P, Eland C, Gharbi K, Blaxter ML: Special features of RAD Sequencing data: implications for genotyping. Mol Ecol. 2012, 22 (11): 3151-3164.PubMed CentralPubMedView ArticleGoogle Scholar
- Wolf J, Bryk J: General lack of global dosage compensation in ZZ/ZW systems? Broadening the perspective with RNA-seq. BMC Genomics. 2011, 12 (1): 91-10.1186/1471-2164-12-91.PubMed CentralPubMedView ArticleGoogle Scholar
- Szatkiewicz JP, Wang W, Sullivan PF, Wang W, Sun W: Improving detection of copy-number variation by simultaneous bias correction and read-depth segmentation. Nucleic Acids Res. 2013, 41 (3): 1519-1532. 10.1093/nar/gks1363.PubMed CentralPubMedView ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralPubMedView ArticleGoogle Scholar
- Meacham F, Boffelli D, Dhahbi J, Martin D, Singer M, Pachter L: Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 2011, 12 (1): 451-10.1186/1471-2105-12-451.PubMed CentralPubMedView ArticleGoogle Scholar
- Allhoff M, Schonhuth A, Martin M, Costa I, Rahmann S, Marschall T: Discovering motifs that induce sequencing errors. BMC Bioinformatics. 2013, 14 (Suppl 5): S1-10.1186/1471-2105-14-S5-S1.PubMed CentralPubMedView ArticleGoogle Scholar
- Ross M, Russ C, Costello M, Hollinger A, Lennon N, Hegarty R, Nusbaum C, Jaffe D: Characterizing and measuring bias in sequence data. Genome Biol. 2013, 14 (5): R51-10.1186/gb-2013-14-5-r51.PubMed CentralPubMedView ArticleGoogle Scholar
- Minoche A, Dohm J, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011, 12 (11): R112-10.1186/gb-2011-12-11-r112.PubMed CentralPubMedView ArticleGoogle Scholar
- Rasmussen DA, Noor MAF: What can you do with 0.1 x genome coverage? A case study based on a genome survey of the scuttle fly Megaselia scalaris (Phoridae). BMC Genomics. 2009, 10 (1): 382-10.1186/1471-2164-10-382.PubMed CentralPubMedView ArticleGoogle Scholar
- Feldmeyer B, Hoffmeier K, Pfenninger M: The complete mitochondrial genome of Radix balthica (Pulmonata, Basommatophora), obtained by low coverage shot gun next generation sequencing. Mol Phylogenet Evol. 2010, 57 (3): 1329-1333. 10.1016/j.ympev.2010.09.012.PubMedView ArticleGoogle Scholar
- Nabholz B, Jarvis ED, Ellegren H: Obtaining mtDNA genomes from next-generation transcriptome sequencing: a case study on the basal Passerida (Aves: Passeriformes) phylogeny. Mol Phylogenet Evol. 2010, 57 (1): 466-470. 10.1016/j.ympev.2010.06.009.PubMedView ArticleGoogle Scholar
- Iorizzo M, Senalik D, Szklarczyk M, Grzebelus D, Spooner D, Simon P: De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome. BMC Plant Biol. 2012, 12 (1): 61-10.1186/1471-2229-12-61.PubMed CentralPubMedView ArticleGoogle Scholar
- Miller JM, Malenfant RM, Moore SS, Coltman DW: Short reads, circular genome: skimming SOLiD sequence to construct the bighorn sheep mitochondrial genome. J Hered. 2012, 103 (1): 140-146. 10.1093/jhered/esr104.PubMedView ArticleGoogle Scholar
- Blower DC, Hereward JP, Ovenden JR: The complete mitochondrial genome of the dusky shark Carcharhinus obscurus. Mitochondrial DNA. 2013, 24 (6): 619-621. 10.3109/19401736.2013.772154. Early OnlinePubMedView ArticleGoogle Scholar
- Hung C-M, Lin R-C, Chu J-H, Yeh C-F, Yao C-J, Li S-H: The de novo assembly of mitochondrial genomes of the extinct passenger pigeon (Ectopistes migratorius) with next generation sequencing. PLoS One. 2013, 8 (2): e56301-10.1371/journal.pone.0056301.PubMed CentralPubMedView ArticleGoogle Scholar
- Samuels DC, Han L, Li J, Quanghu S, Clark TA, Shyr Y, Guo Y: Finding the lost treasures in exome sequencing data. Trends Genet. 2013, 29 (10): 593-599. 10.1016/j.tig.2013.07.006.PubMed CentralPubMedView ArticleGoogle Scholar
- Miller FJ, Rosenfeldt FL, Zhang C, Linnane AW, Nagley P: Precise determination of mitochondrial DNA copy number in human skeletal and cardiac muscle by a PCR-based assay: lack of change of copy number with age. Nucleic Acids Res. 2003, 31 (11): e61-10.1093/nar/gng060.PubMed CentralPubMedView ArticleGoogle Scholar
- Robin ED, Wong R: Mitochondrial DNA molecules and virtual number of mitochondria per cell in mammalian cells. J Cell Physiol. 1988, 136 (3): 507-513. 10.1002/jcp.1041360316.PubMedView ArticleGoogle Scholar
- Arnason U, Gullberg A, Janke A, Kullberg M: Mitogenomic analyses of caniform relationships. Mol Phylogenet Evol. 2007, 45 (3): 863-874. 10.1016/j.ympev.2007.06.019.PubMedView ArticleGoogle Scholar
- Walker CW, Vilà C, Landa A, Lindén M, Ellegren H: Genetic variation and population structure in Scandinavian wolverine (Gulo gulo) populations. Mol Ecol. 2001, 10 (1): 53-63. 10.1046/j.1365-294X.2001.01184.x.PubMedView ArticleGoogle Scholar
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, et al: The sequence and de novo assembly of the giant panda genome. Nature. 2010, 463 (7279): 311-317. 10.1038/nature08696.PubMed CentralPubMedView ArticleGoogle Scholar
- Ellegren H, Smeds L, Burri R, Olason PI, Backstrom N, Kawakami T, Kunstner A, Makinen H, Nadachowska-Brzyska K, Qvarnstrom A, Uebbing S, Wolf JBW: The genomic landscape of species divergence in Ficedula flycatchers. Nature. 2012, 491 (7426): 756-760.PubMedGoogle Scholar
- Sequence assembly with MIRA3, The Definitive Guide: [http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html]
- Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009, 6 (11s): S13-S20. 10.1038/nmeth.1374.PubMedView ArticleGoogle Scholar
- Wilhelm BT, Landry J-R: RNA-Seq–quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009, 48 (3): 249-257. 10.1016/j.ymeth.2009.03.016.PubMedView ArticleGoogle Scholar
- Chinnery PF, Thorburn DR, Samuels DC, White SL, Dahl H-HM, Turnbull DM, Lightowlers RN, Howell N: The inheritance of mitochondrial DNA heteroplasmy: random drift, selection or both?. Trends Genet. 2000, 16 (11): 500-505. 10.1016/S0168-9525(00)02120-X.PubMedView ArticleGoogle Scholar
- Isaacs FJ, Carr PA, Wang HH, Lajoie MJ, Sterling B, Kraal L, Tolonen AC, Gianoulis TA, Goodman DB, Reppas NB, Emig CJ, Bang D, Hwang SJ, Jewett MC, Jacobson JM, Church GM: Precise manipulation of chromosomes in vivo enables genome-wide codon replacement. Science. 2011, 333 (6040): 348-353. 10.1126/science.1205822.PubMedView ArticleGoogle Scholar
- Oyola S, Otto T, Gu Y, Maslen G, Manske M, Campino S, Turner D, MacInnis B, Kwiatkowski D, Swerdlow H, Quail M: Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes. BMC Genomics. 2012, 13 (1): 1-10.1186/1471-2164-13-1.PubMed CentralPubMedView ArticleGoogle Scholar
- Smeds L, Künstner A: ConDeTri - A content dependent read trimmer for Illumina data. PLoS One. 2011, 6 (10): e26314-10.1371/journal.pone.0026314.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- Richly E, Leister D: NUMTs in sequenced eukaryotic genomes. Mol Biol Evol. 2004, 21 (6): 1081-1084. 10.1093/molbev/msh110.PubMedView ArticleGoogle Scholar
- Marçais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011, 27 (6): 764-770. 10.1093/bioinformatics/btr011.PubMed CentralPubMedView ArticleGoogle Scholar
- Bernt M, Donath A, Jühling F, Externbrink F, Florentz C, Fritzsch G, Pütz J, Middendorf M, Stadler PF: MITOS: improved de novo metazoan mitochondrial genome annotation. Mol Phylogenet Evol. 2012, In Press Available onlineGoogle Scholar
- Grant JR, Stothard P: The CGView server: a comparative genomics tool for circular genomes. Nucleic Acids Res. 2008, 36 (suppl 2): W181-W184.PubMed CentralPubMedView ArticleGoogle Scholar
- Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008, 9 (4): 286-298. 10.1093/bib/bbn013.PubMedView ArticleGoogle Scholar
- R Core Team: R: A Language and Environment for Statistical Computing. 2013, Vienna: R Foundation for Statistical ComputingGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.