Volume 15 Supplement 3
Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data
© Diroma et al.; licensee BioMed Central Ltd. 2014
Published: 6 May 2014
Whole Exome Sequencing (WES) is one of the most used and cost-effective next generation technologies that allows sequencing of all nuclear exons. Off-target regions may be captured if they present high sequence similarity with baits. Bioinformatics tools have been optimized to retrieve a large amount of WES off-target mitochondrial DNA (mtDNA), by exploiting the aspecificity of probes, partially overlapping to Nuclear mitochondrial Sequences (NumtS). The 1000 Genomes project represents one of the widest resources to extract mtDNA sequences from WES data, considering the large effort the scientific community is undertaking to reconstruct human population history using mtDNA as marker, and the involvement of mtDNA in pathology.
A previously published pipeline aimed at assembling mitochondrial genomes from off-target WES reads and further improved to detect insertions and deletions (indels) and heteroplasmy in a dataset of 1242 samples from the 1000 Genomes project, enabled to obtain a nearly complete mitochondrial genome from 943 samples (76% analyzed exomes). The robustness of our computational strategy was highlighted by the reduction of reads amount recognized as mitochondrial in the original annotation produced by the Consortium, due to NumtS filtering.
An accurate survey was carried out on 1242 individuals. 215 indels, mostly heteroplasmic, and 3407 single base variants were mapped. A homogeneous mismatches distribution was observed along the whole mitochondrial genome, while a lower frequency of indels was found within protein-coding regions, where frameshift mutations may be deleterious. The majority of indels and mismatches found were not previously annotated in mitochondrial databases since conventional sequencing methods were limited to homoplasmy or quasi-homoplasmy detection. Intriguingly, upon filtering out non haplogroup-defining variants, we detected a widespread population occurrence of rare events predicted to be damaging. Eventually, samples were stratified into blood- and lymphoblastoid-derived to detect possibly different trends of mutability in the two datasets, an analysis which did not yield significant discordances.
To the best of our knowledge, this is likely the most extended population-scale mitochondrial genotyping in humans enriched with the estimation of heteroplasmies.
Mitochondrial DNA (mtDNA) polyploidy is a physiologic trait of human cells and implicates the possibility of the co-existence of different mtDNA genotypes within the same cell, tissue, or individual, a condition known as heteroplasmy. Up to date, quantification of heteroplasmy remains a challenging task in the characterization of mitochondrial variants and a limit for conventional sequencing methods . The advent of Next Generation Sequencing (NGS) technologies has revolutionized the field of genomics, providing the possibility of unprecedented large-scale and high-throughput analyses. Indeed, massive-parallel sequencing implies ultra-deep yields, allowing the quantification of mitochondrial heteroplasmic variants [2, 3]. One of the recent applications of NGS is Whole Exome Sequencing (WES), a powerful and quite cost-effective strategy to perform targeted deep sequencing of genomic protein coding regions . Even though the most recent WES protocols include the use of specific baits targeted to mtDNA, the majority of kits currently used is devoted to the enrichment of nuclear-coding DNA, while mtDNA targeting has mostly been neglected . Nonetheless, it was recently demonstrated that the precious information of mitochondrial genotype may be retrieved from off-target DNA in human WES studies, even when designed for nuclear DNA exclusively . It was indeed observed that the overlapping of nuclear probes onto nuclear mitochondrial sequences (NumtS ) determines a cross-hybridization of such baits with mtDNA, which in turn is brought along as a 'contaminant' . The natural abundance of the mitochondrial molecules in cells allows to achieve a high read depth, so that a recovery and assembly of the mtDNA genome from nuclear WES studies is indeed feasible  together with the quantification of heteroplasmy wherever the mitochondrial genome is sufficiently covered.
The relevance of estimating mitochondrial heteroplasmy is further highlighted by the fact that mtDNA mutations exert their phenotypic effect above a certain mutation load threshold [8, 9], which may vary depending on the type of change [8, 10] and tissue. Indeed, several studies demonstrated that mtDNA mutations are functionally recessive until the mutant load exceeds a specific threshold and leads to a biochemical dysfunction [8, 9, 11]. In fact mitochondrial mutations are largely involved in various diseases, aging and cancer . In addition, the finest quantification of heteroplasmy among familiar lineages is helpful for forensic studies  and to better understand mechanisms of intergenerational segregation, especially in the case of maternal transmission of mutations predisposing to mtDNA disorders .
Even though since 1995 it is known that heteroplasmy in normal individuals may not be a rare biological status , only recent surveys on mitochondrial genotyping and heteroplasmy annotation with deep sequencing have revealed that in normal human cells a widespread heterogeneity of mtDNA variants co-existence occurs in healthy subjects and varies among tissues . Moreover, a condition of 'universal heteroplasmy' was depicted by Chinnery and colleagues in their recent work  in which they observed, by using high-throughput technologies, the presence of very low-level heteroplasmic variants in related and unrelated individuals, likely due to inherited or somatic events, not predicted to be pathogenic.
So far, consistently with the limited sensitivity of strategies and with the restricted population sampling available for mitochondrial genotyping, commonly used mitochondrial portals and databases [17, 18] do not report heteroplasmy values for the mutations/variants whose fraction has been reported in literature. However, while such information is lacking in databases, nucleotide and amino acid sites variability is at least available within HmtDB  as a valuable information that contributes both to the definition of haplogroups and to the recognition of private variants or mutations with a potential pathogenic role.
In this work we used a pipeline previously provided , that we further implemented for insertions and deletions (indels) calling and applied to the in silico extraction and characterization of mitochondrial DNA from nuclear WES studies, comprising 1242 exome samples of the 1000 Genomes project  belonging to 19 different human populations. 1000 Genomes Project WES studies represent one of the widest nuclear genomic resources available to the scientific community, yet strikingly, up to now, mtDNA genotyping has been disregarded, despite the important contribution the small mitochondrial genome provides to the cell homeostasis. We were able to recover mtDNA from all WES studies and to assemble the complete mitochondrial genome for the majority of analyzed samples. High quality metrics were used to filter mismatches and indels and related heteroplasmy values of filtered variants were recorded. We enriched these data with a functional characterization, firstly detecting haplogroup and non-haplogroup defining sites. Variability estimation reported in the human mitochondrial database and predictions of pathogenicity [20–22] were also applied to non-haplogroup defining variants, revealing an intriguingly heterogeneous scenario among populations. We also performed a comparison between lymphoblastoid cell lines (LCLs) and blood-derived mitochondrial DNA isolated from the group of available samples, in order to assess the possible presence of genotype differences among the two different DNA sources, as previously tested for nuclear DNA [23–25]. Finally, in line with the recent finding of the widespread distribution of heteroplasmy among healthy individuals, we found a consistent enrichment, among all 19 human populations screened, of very low-levels heteroplasmies, mostly unshared between subjects and associated with very low variability values reported in HmtDB, which more likely represent a bulk of private variants carried by each individual.
Features of the 19 populations analyzed in this study.
N. of Samples
Mitochondrial Mean Coverage
Mitochondrial Mean Per Base Depth
African Caribbean in Barbados
African Ancestry in Southwest USA
Chinese Dai in Xishuangbanna, China
Utah residents with ancestry from northern and western Europe
Han Chinese in Beijing, China
Han Chinese South, China
Colombian in Medellin, Colombia
Finnish in Finland
British From England and Scotland, UK
Gujarati Indians in Houston, Texas, USA
Iberian populations in Spain
Japanese in Tokyo, Japan
Kinh in Ho Chi Minh City, Vietnam
Luhya in Webuye, Kenya
Mexican Ancestry in Los Angeles, California, USA
Peruvian in Lima, Peru
Puerto Rican in Puerto Rico
Toscani in Italia
Yoruba in Ibadan, Nigeria
Mitochondrial sequences recovery from 1000 Genomes WES studies, variant calling and heteroplasmy assessment
We adopted a pipeline previously published  in order to recover mtDNA from off-target sequences of WES studies, removing reads similar to NumtS (Nuclear Mitochondrial Sequences)  and maintaining only reads with a univocal mapping onto the mitochondrial genome. Original mitochondrial BAM (Binary SAM, Sequence Alignment/Map format) files from the 1000 Genomes Project were first converted in fastq files using the SamToFastq module of the Picard suite of tools (v.1.68)  to apply the abovementioned pipeline, which requires the usage of SAMtools (v.0.1.19) , GSNAP (version 2012-01-11)  and python 2.7.1 . After read mapping on the mitochondrial genome, using rCRS as reference sequence, the MarkDuplicates module of the Picard tools was applied on obtained SAM files in order to remove PCR duplicates. Duplicates filtering was just recommended in the adopted pipeline, but we considered this step crucial for a correct allelic quantification in our analysis. We extended variant call analysis also to indels, since they were not completely taken into account by the pipeline we used. In particular, we observed that it was targeted only for the detection of homoplasmic deletions, disregarding heteroplasmic indels. Moreover, the tool previously developed did not automatically quantify heteroplasmic fractions (HF) that we instead implemented in our system. We thus implemented a workflow in python programming language to parse the SAM file CIGAR string  associated to each pairwise aligned read, holding information on indel events. Mismatches were instead analyzed directly from the mtDNAassembly-table.txt output of the adopted pipeline. Filtering of insertion events was performed using quality score (QS≥25) and read depth (rd≥5) cut-off values per position previously adopted , whereas the goodness of a deletion event was evaluated on the basis of its 5bp-long upstream and downstream flanking region. Indeed, a median QS≥25 and rd≥5 were required for the flanking regions to consider such deletion in the genome assembly. HF was assessed as the fraction of the variant read depth onto the total mitochondrial read depth of the same position (for mismatches and deletions) or of the 5' flanking position (for insertions).
The adopted pipeline allowed also to produce a consensus mitochondrial sequence for each individual, including only mismatches equal or above a HF threshold of 0.75, otherwise the corresponding IUPAC character for single nucleotide variants was reported . This consensus was used for the haplogroup assignment and for the subsequent analysis of pathogenicity linked to non-synonymous mismatches considering also all those variants under the fixed threshold annotated with IUPAC code.
All mitochondrial unique variants detected in this study are reported in a BED (Browser Extensible Data) table (Additional file 2), generated from a VCF file which is the output of the python script we developed for the detection of indels and heteroplasmic fraction of mitochondrial variants. The python script and the VCF file (reporting all the detected mitochondrial variants integrated with the related heteroplasmic fractions) are available upon request.
Haplogroup assignment of 943 mitochondrial consensus sequences was performed using the stand-alone mt-classifier tool, implemented in HmtDB , which uses the Reconstructed Sapiens Reference Sequence (RSRS)  as mitochondrial reference.
The samples were chosen among the best assembled mitochondrial genomes within our dataset, harboring less than 500 gaps in their sequence. The mt-classifier version used refers to the Phylotree  mtDNA tree build 15.
Mitochondrial reference databases
Site variability values calculated on mitochondrial nucleotide multi-aligned sequences from nearly 10,000 healthy individuals, were downloaded from HmtDB web site . Variability scores contributed to the recognition of private variants or mutations with a potential pathogenic role: they ranged from 0 to 1, where low variability values may be suggestive of a novel haplogroup-defining variant or a rare disease-linked mutation, while high variability scores were owned by common alleles within genomes stored in HmtDB.
Mitochondrial coding and control regions point mutations with reports of disease-associations were available in Mitomap .
Pathogenicity predictions on variants identified within the analyzed dataset were estimated using MutPred , Polyphen-2  and SNPs&GO  online software, producing 6 different scores and related prediction classes for each variant. Pathogenicity values, ranging from 0 to 1, allowed discriminating between potentially pathogenic and neutral mutations. Each of the abovementioned software presents a specific threshold of pathogenicity. We considered 0.70 as threshold for high pathogenicity score by MutPred as suggested by Pereira et al.  and "disease" prediction class by the other ones.
Hierarchical clustering analysis
A hierarchical clustering based on Euclidean distance estimation was performed to evidence shared classes of heteroplasmy within individuals with the aim to identify a possible clusterization of samples belonging to the same population sharing a similar number of variants referred to the same heteroplasmic range. The heatmap representation was used to visualize hierarchical clustering by rows and columns, using a matrix of numbers of variants belonging to 11 different classes of heteroplasmy, as raw input data. The maPalette function from the marray R package version 3.0.1 was used to define heatmap colors, while the heatmap R function was used to draw the heatmap.
Lymphoblastoid cell lines and blood subsets
We identified 60 potential blood samples among our 943 best assembled mitochondrial genomes and chose 60 potential LCL samples on the basis of Epstein-Barr virus (EBV) coverage values provided by the Consortium . Indeed an EBV coverage equal to 0 might indicate blood derived DNA, while LCL samples were selected among those with the highest EBV coverage (in this case, higher than 400).
Spectra of variants (insertions, deletions, mismatches) within blood and LCL samples were compared with one-tailed and two-tailed Student t-tests. Fisher's test was applied to the distributions of blood and LCL heteroplasmic variants within mitochondrial loci and also to the distributions of variants within the same datasets per class of heteroplasmic fractions.
Coverage and quality of assembled mitochondrial genomes
We downloaded aligned exome data (as BAM files) related to 1242 individuals of the 1000 Genomes Project from the public repository . Sequence reads were extracted from the BAM files and re-aligned to the human reference genomes to assemble mitochondrial genomes for all the samples by applying Picardi's pipeline .
The first step of our analysis was the assessment of quality and coverage of the reconstructed mitochondrial genomes. We found that 76% of them showed a nearly complete assembly (mitochondrial genome coverage >97%). The robustness of the computational strategy we used was highlighted by the reduction of the read count (an average of 30% per sample) mapped to the reference mitochondrial genome with respect to the original annotation by the 1000 Genomes Consortium. Two different factors contributed to the reduction of the original number of reads, namely PCR duplicates removal and, particularly, NumtS filtering , disregarded in the mapping step performed by the Consortium.
The mitochondrial assembly showed different mean coverage values within the 19 analyzed populations (Table 1): they ranged from 99.91% to 47.30% with respect to the entire mitochondrial genome. Mean per base depth was also widely variable, ranging from 409.61 to 25.12. Detailed data about mitochondrial genome coverage of each sample are reported in Additional File 1.
Coverage and quality score (QS) statistics were performed for each mitochondrial locus by estimating median read depth and QS (Additional File 3 and Additional File 4). As already reported , a better efficiency in mitochondrial reads extraction was obtained with the Agilent enrichment kit, as suggested by comparing the highest mitochondrial depth values obtained within the same locus (MT-ND6) through the Agilent (274.57X) and the NimbleGen samples (55.67X). Furthermore, Agilent mitochondrial reads showed slightly higher mean QS than NimbleGen (34.62 vs 32.22), with the maximum and the minimum QS observed within the same loci, MT-TS2 (35.82 vs 33.13) and MT-TR (34.30 vs 31.47) respectively. Intergenic regions within Agilent and Nimblegen samples presented the same trend of QS and coverage values as of D-Loop and coding regions (data not shown).
To verify the robustness of our protocol, we further selected a subset of 28 samples with a mitochondrial genome coverage >99%, which were previously analyzed with both low coverage Whole Genome Sequencing (WGS)  and Exome Sequencing projects by the Consortium . The dataset included 2 samples for 14/19 populations. None of the analyzed exome samples was found within the low coverage dataset for the remaining 5 populations. We considered the subset of 28 samples (indicated as "Low Coverage Control" in Additional File 1) with the aim of identifying possible disagreements about variants detected by the Consortium and our pipeline. We were able to detect up to 93.53% of mitochondrial variants per sample reported within the low coverage Variant Call Format (VCF) file generated by the Consortium (Additional File 5), while 45/89 were missed with exome samples as they were not sufficiently covered. Moreover, on average 3 variants per sample were not identified within exome samples, although their positions were sufficiently covered, whereas we were able to find more than 10 variants per sample not previously annotated with low coverage analysis by the Consortium. The latter were all heteroplasmic and about the 70% presented a HF<0.10, proving the sensitivity of our pipeline.
Haplogroup assignment of reconstructed mitochondrial genomes
The haplogroup assignment procedure helped to test the quality of reconstructed mitochondrial sequences, since the geographical areas of sampled individuals were known. To this aim, we selected 943 out of 1242 samples (Methods section), which were analyzed by applying the mt-classifier implemented in HmtDB . Generally the predicted haplogroups seemed to converge with individual ethnicity.
About 93% of the 943 reconstructed sequences presented a prediction reliability P_Hg >90%, defined as the highest fraction of Nph (recognized SNPs in the sequence which define haplogroup (Hg) according to Phylotree (ph) classification ) over the total number of the haplogroup-defining expected sites (Nph_exp) . This high percentage of true-positive haplogroup-defining variant sites confirmed the high sensitivity of our method. Assigned haplogroups are reported in Additional file 1.
Indels and mismatches recognition, mapping and in silico validation
About 56% samples harbored at least one deletion and 64% presented an insertion (Additional File 7). The most common events were single-base insertions (in 69% cases) and deletions involving up to 3 bases (in 50% cases; data not shown). However their distribution normalized to the length of each mitochondrial locus showed high peaks of indels ratio within the D-loop, while a lower number was present within coding regions, where frameshift mutations may be deleterious (Figure 1).
Interestingly, a certain amount of deletions was detected also within genes for transfer RNA. Intergenic regions, not shown in Figure 1, included 6 heteroplasmic insertions, 3 heteroplasmic deletions and 2 homoplasmic deletions.
The majority of the indels identified (72.09%) occurred within homopolymeric stretches (Additional file 8), defined as regions with the same nucleotide in at least two adjacent positions. The shortest homopolymers harbored the highest number of indels, although 5-bases stretches presented high levels of both deletions and insertions too, with the exception of G-stretches. This result was quite expected considering the homopolymeric nucleotide compositional bias of the mitochondrial genome, and that almost all of the homopolymers (70%) within the mitochondrial reference genomes (rCRS and RSRS) are represented by two-bases stretches. Moreover, up to 25% of the identified indels mapped within mitochondrial tandem repeat regions, where the onset of mutational events is favored [43, 44].
Summary related to indels and mismatches detected within the 1242 analyzed mitochondrial genomes
% Annotated in Healthy
% Annotated in Patients
% Annotated in Healthy
% Annotated in Patients
Pathogenicity analysis of private mutations
After haplogroup assignment, we isolated more than 2500 variants not defining the individual haplogroup, then functionally annotated. Predictions of pathogenicity were carried out on the subset of 543 non-synonymous variant positions using MutPred , Polyphen-2  and SNPs&GO  tools with the aim of identifying potentially damaging alleles and their spread within the 19 populations (Additional File 11 and Additional File 12). All the 543 missense mismatches were classified according to the ranges of heteroplasmy, so that 339 were homoplasmic and 309 heteroplasmic. Above all, MT-ND5 appeared to harbor the highest frequency of non-synonymous changes. Occurrence of variants in the reconstructed mitochondrial genomes was not exclusive for a particular world area, but generally a variable number of samples belonging to different populations shared the same homoplasmic and quasi-homoplasmic/highly heteroplasmic variants, while low-level heteroplasmies were detected only in single individuals and may hence be likely considered personal mutations. For example, the m.13105A>G event was shared by 106 samples belonging to 14 populations, with the lowest HF equal to 0.87 (Additional File 11). We focused only on 289 variant sites (148 found in a homoplasmic state, 183 as heteroplasmic) predicted as probably pathogenic by at least one of the above mentioned software (Additional File 12) and interestingly they all showed very low variability (mean value = 0.02 for homoplasmic mutations, 0.03 for the heteroplasmic ones). Among potentially damaging events, the lowest HF value was 0.25 (Additional File 11). The most shared variants within this subset were all in a homoplasmic/highly heteroplasmic state. Only 14 non-shared mismatches were predicted as damaging mutations by the three software, and among them the m.3946G>A was also annotated in Mitomap as associated to MELAS (Mitochondrial Encephalomyopathy, Lactic Acidosis and Stroke-like episodes). They all presented a low variability value (<0.20) and a HF ranging from 0.29 to 1.00.
The complete list of mitochondrial unique variants identified is reported in Additional File 2.
Analysis of LCLs and blood samples
Number of variants within LCL and blood subsets, sorted by heteroplasmic fraction and variability score.
HF = 1
Number of heteroplasmic and homoplasmic variants within blood and LCL samples, sorted by heteroplasmic fraction
N. of Variant Positions in Blood Samples
N. of Variant Positions in LCL Samples
In this paper we report the results obtained by capturing human mtDNA sequences starting from 1000 Genomes Whole Exome Sequencing (WES) data specifically targeted to nuclear genes. WES has been commonly used with clinical aims since its first application , although this technology has been also implemented in evolutionary comparison of genomes . Even though the most recent WES protocols include the use of specific baits targeted to the whole mitochondrial genome as well as the MitoCarta set of nuclear genes , the probe sets used in the majority of previously published WES studies excluded such baits, hence mtDNA sequences have been considered by-products and thereby neglected.
For the first time, we here recovered the overlooked mitochondrial information and reconstructed a large number of human mitochondrial sequences from off-target exome data, thanks to the availability of population-scale sequence datasets offered by the 1000 Genomes Project . This was possible through the application of a bioinformatics pipeline aimed at assembling mitochondrial genomes from exome data and then at annotating insertions, deletions and mismatches, accompanied by their heteroplasmy fraction, a series of tasks that were not feasible so far and that are necessary when dealing with mtDNA data. Overall we managed to extract 1242 assembled mtDNA, testing the validity of the protocol on the basis of coverage and quality data of the mitochondrial genomes that were generated. Further evidences of the robustness of our method were provided by the accurate haplogroup assignment, which reflected individual ethnicity, and by the comparison with the set of variants within 28 samples selected from the mtDNA VCF of low-coverage whole genome sequencing data  generated by the Consortium. It has to be underlined that a fraction of mitochondrial variants (6.47%) annotated in the low coverage report was not identified by our protocol. This discrepancy may be likely due to variant artefacts as a consequence of WES probes overlapping NumtS, which were not filtered out in the 1000 Genomes project. On the other hand Whole Genome Sequencing protocols ensure the availability of probes mapping on the entire mitochondrial genome, therefore some missing variants derived from the exomes analyzed are linked to an incomplete covering of any mitochondrial genomes. Another possible source of discrepancy may be represented by sequencing errors ruled out by our protocol as associated to quality scores lower than the minimum threshold required for variant calling. The analysis of the reconstructed mitochondrial genomes highlighted a widespread distribution of polymorphisms in healthy samples. Noteworthy, a parallelism may be observed between the enrichment in damaging and probably damaging rare variants within nuclear low frequency alleles of the 1000 Genomes Low Coverage  and Exon Pilot Projects , and the numerous group of mitochondrial pathogenic predicted alleles [20–22] and mutations with a confirmed disease association [17, 18], detected in our dataset.
In fact, Exon Pilot rare variants  observed among continents highlighted a reduction in the degree of allele sharing, as well as a very low variability was associated to homoplasmic/heteroplasmic mitochondrial variants we predicted as damaging, that were mainly observed in sparse unrelated individuals. Although low variability may be affected by a bias in the world population sampling, this finding may suggest that, similarly to the nucleus, negative selection is acting on these mitochondrial sites.
Furthermore, similarly to the results on nuclear DNA reported by Xue et al. , a possible explanation of the presence of pathogenic mitochondrial variants in healthy samples in a heteroplasmic condition below the specific threshold for the onset of the disease , at least at the time of recruitment, may be linked to the age of disease onset or to an incomplete penetrance due to the absence in the subject of other factors contributing to the disease; a further explanation may be linked to the presence of erroneous annotations of disease-associated variants in databases. Indeed only a few variants are currently annotated as unequivocally disease-causative, i.e. mainly Leber's Hereditary Optic Neuropathy (LHON) and Mitochondrial Encephalomyopathy, Lactic Acidosis and Stroke-like episodes (MELAS) mutations , while there is a high degree of uncertainty about the correct disease association for most of the mitochondrial mutations reported in literature especially since most of the time functional studies are lacking. Moreover, no information about the level of heteroplasmy is reported in mitochondrial databases and this is becoming an urgent need, considering the relationship between heteroplasmy threshold and clinical manifestation of the disease [8, 11] and the ever increasing power of NGS technology in detecting also minor alleles in heterogeneous pools of variants [1, 16, 45, 46]. As recently shown and in line with our findings, a 'universal heteroplasmy' of low-level heteroplasmic variants is present in healthy humans , which may not be disregarded. In this population-scale scenario of variety, a common event recorded within our dataset is length heteroplasmy, that is the coexistence of different lengths of the same homopolymeric stretch within the same individual, mainly occurring within mtDNA control regions and present at a lower frequency in the coding regions. Although coding regions are highly populated with homopolymeric stretches, the lower occurence of length heteroplasmic variants may be due to a negative selection process acting to avoid function disruption.
As general concerns were expressed about the use of LCLs-derived DNA for sequencing studies , we further sought differences in LCL versus blood sample subsets. To the best of our knowledge, no previous study showed reports on LCLs and blood mtDNA comparison. Although in our analyses we lacked paired samples of blood and LCLs from the same subject, we disposed of numerous groups of blood and LCL samples from several unrelated individuals (except for one trio), however sufficient to identify qualitative and quantitative gaps between the two, considering that EBV transformation may result in low-level generation of de novo mutations [23–25]. Several works on whole exome/genome sequencing of LCLs/blood were all in agreement that the great majority of variants found in paired samples is shared (99% at least) [23–25], except for a minimal percentage of de novo variants, also found in low-passages LCLs . A comparable result was observed in our mitochondrial blood/LCL samples: the average number of heteroplasmic variants per type of sample was the same (about 7), even though we solely observed a difference in the numerical distribution of heteroplasmic variants along mitochondrial loci (Figure 4). However 80% variants shared by both datasets (25% of all the detected variants) were homoplasmic and the only numerical, albeit not statistically significant discrepancy was recorded for very low-level heteroplasmic alleles (<10%), mostly enriched in LCLs. Moreover, low variability sites abound among non-shared variants, which belong to the low-level heteroplasmy range of variants observed also in blood samples, likely a more reliable DNA source than LCLs. This finding is consistent with the general observation among all 19 human populations screened in this study of a specific enrichment in very low-levels heteroplasmies and, in a mirrored fashion, quasi-homoplasmies (see Figure 3 and Additional File 10). As indeed demonstrated for both pathogenic predicted and LCL/blood variants analyzed, the majority of low-levels heteroplasmies are unshared between individuals and also associated with very low variability, thus more likely represent a bulk of private variants carried by each individual, as previously suggested . Moreover, in this fashion, our analysis suggests reliability of 1000 Genomes LCL samples and more generally of early passage EBV-transformed lymphocytes for mitochondrial genotyping, although validation with specific molecular methods for the detection of low heteroplasmies  and mtDNA targeted resequencing of blood-LCL paired-samples would be recommended.
An interesting finding in our analyses concerned the detection of shared potentially pathogenic variants. These results raise several questions on the definition of pathogenicity for an mtDNA change, a debate that has recently sparked within the mtDNA community and that calls for an in-depth thinking. The pathogenicity parameters assigned to nuclear variants may not apply sic to mitochondrial ones, in the same fashion as the concept of polymorphism. Because of the peculiarities of mitochondrial polyplasmic genetics, the assignment of pathogenicity ought to take into account the degree of heteroplasmy, the haplogroup/haplotype context, and even environmental factors. For instance, a report describes a common European variant as predisposing to breast cancer in African women, which puts a red alert sign on the need for clarification in this field . Overall, unless supported by clear-cut functional studies, mtDNA variants ought not to be considered pathogenic independently on the abovementioned factors. These considerations also draw attention on the true existence of a variant. On one side, it appears striking that we detected a large number of previously unreported variants, in a database that contains nearly 10,000 mitochondrial genomes, a large number of which at very low heteroplasmy levels. This may not appear surprising since they may be detected exclusively if highly sensitive methods are applied, such as deep sequencing. Nonetheless, one ought to consider that the increasing sensitivity of sequencing strategies allows the detection of low heteroplasmy degrees, but a limit is reached across which it becomes impossible to distinguish between a true low heteroplasmy and errors, intuitively. From a biological standpoint, extremely low heteroplasmic variants, being these potentially pathogenic or not, may rarely have a physio-pathological meaning (unless they may become amplified during transmission to offspring). We may not assume that all variants here detected in blood-extracted samples were germ-line, as blood is indeed a somatic tissue. With this in mind, one has to ask whether the occurrence of very low variants in a somatic tissue may be more than a completely random transient event, and therefore whether annotation of such variants is a necessary task. Theoretically, every position in the mtDNA may vary in a single copy and be detected before evolution or random drift act to amplify it or to revert it, therefore a consensus must be reached for annotation of very low heteroplasmic data.
Overall, we believe that the accurate annotation of the heteroplasmy degree of mitochondrial variants, which is of paramount importance especially in clinical studies, should be better implemented in up-to-date mitochondrial databases. Indeed, the combined use of several information, such as the health status of individuals, the biological source of DNA, the heteroplasmic status of variants, the prediction of pathogenicity and the haplogroup assignment should be a valuable tool for the definition of general criteria of prioritization of mitochondrial disease-associated variants.
To the best of our knowledge, this is likely the most extended population-scale mitochondrial genotyping in humans, enriched with the estimation of heteroplasmies. We used a pipeline to extract and characterize mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data, previously disregarded in these individuals. The application of our protocol offers also the relevant opportunity to further use this information, coupled with the nuclear genotype, for nuclear/mitochondrial coevolution studies both in health and disease.
List of abbreviations
Whole Exome Sequencing
Nuclear Mitochondrial Sequences
insertions and deletions
Next Generation Sequencing
lymphoblastoid cell line
Genome Reference Consortium Human Reference 37/human genome version 19
revised Cambridge Reference Sequence
Reconstructed Sapiens Reference Sequence
Binary Sequence Alignment/Map
Whole Genome Sequencing
Variant Call Format
origin of light-strand replication
Mitochondrial Encephalomyopathy, Lactic Acidosis and Stroke-like episodes
Leber's Hereditary Optic Neuropathy.
The authors would like to thank the 1000 Genomes Consortium for the availability of data, and in particular Dr. L. Clarke for useful explanations. The authors are thankful to Dr. E. Picardi, Prof. G. Pesole, Dr. V. Colonna for helpful discussions about the data and to undergraduate students C. Guttà, A. Lombardi and A. Sebastiani, for their contribution.
This work was partly supported by the Italian Ministry of University and Research (MIUR) grants FIRB Futuro in Ricerca 2008 to GG and University of Bari funds (code ATTPRIN2009) to MA.
Publication of this article was funded in part the corresponding author (MA).
This article has been published as part of BMC Genomics Volume 15 Supplement 5, 2014: Italian Society of Bioinformatics (BITS): Annual Meeting 2013: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S3
- He Y, Wu J, Dressman DC, Iacobuzio-Donahue C, Markowitz SD, Velculescu VE, Diaz LA, Kinzler KW, Vogelstein B, Papadopoulos N: Heteroplasmic mitochondrial DNA mutations in normal and tumour cells. Nature. 2010, 464 (7288): 610-614.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang S, Huang T: Characterization of mitochondrial DNA heteroplasmy using a parallel sequencing system. Biotechniques. 2010, 48 (4): 287-296.View ArticlePubMedGoogle Scholar
- Zaragoza MV, Fass J, Diegoli M, Lin D, Arbustini E: Mitochondrial DNA variant discovery and evaluation in human Cardiomyopathies through next-generation sequencing. PLoS One. 2010, 5 (8): e12295-PubMed CentralView ArticlePubMedGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011, 12 (11): 745-755.View ArticlePubMedGoogle Scholar
- Pesole G, Allen JF, Lane N, Martin W, Rand DM, Schatz G, Saccone C: The neglected genome. EMBO Rep. 2012, 13 (6): 473-474.PubMed CentralView ArticlePubMedGoogle Scholar
- Picardi E, Pesole G: Mitochondrial genomes gleaned from human whole-exome sequencing. Nat Methods. 2012, 9 (6): 523-524.View ArticlePubMedGoogle Scholar
- Simone D, Calabrese FM, Lang M, Gasparre G, Attimonelli M: The reference human nuclear mitochondrial sequences compilation validated and implemented on the UCSC genome browser. BMC Genomics. 2011, 12: 517-PubMed CentralView ArticlePubMedGoogle Scholar
- Rossignol R, Faustin B, Rocher C, Malgat M, Mazat JP, Letellier T: Mitochondrial threshold effects. Biochem J. 2003, 370 (Pt 3): 751-762.PubMed CentralView ArticlePubMedGoogle Scholar
- Gasparre G, Kurelac I, Capristo M, Iommarini L, Ghelli A, Ceccarelli C, Nicoletti G, Nanni P, De Giovanni C, Scotlandi K, et al: A mutation threshold distinguishes the antitumorigenic effects of the mitochondrial gene MTND1, an oncojanus function. Cancer Res. 2011, 71 (19): 6220-6229.View ArticlePubMedGoogle Scholar
- Chinnery PF, Hudson G: Mitochondrial genetics. Br Med Bull. 2013, 106: 135-159.PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor RW, Turnbull DM: Mitochondrial DNA mutations in human disease. Nat Rev Genet. 2005, 6 (5): 389-402.PubMed CentralView ArticlePubMedGoogle Scholar
- Wallace DC: Mitochondrial diseases in man and mouse. Science. 1999, 283 (5407): 1482-1488.View ArticlePubMedGoogle Scholar
- Sekiguchi K, Sato H, Kasai K: Mitochondrial DNA heteroplasmy among hairs from single individuals. J Forensic Sci. 2004, 49 (5): 986-991.View ArticlePubMedGoogle Scholar
- Lee HS, Ma H, Juanes RC, Tachibana M, Sparman M, Woodward J, Ramsey C, Xu J, Kang EJ, Amato P, et al: Rapid mitochondrial DNA segregation in primate preimplantation embryos precedes somatic and germline bottleneck. Cell Rep. 2012, 1 (5): 506-515.PubMed CentralView ArticlePubMedGoogle Scholar
- Comas D, Paabo S, Bertranpetit J: Heteroplasmy in the control region of human mitochondrial DNA. Genome Res. 1995, 5 (1): 89-90.View ArticlePubMedGoogle Scholar
- Payne BA, Wilson IJ, Yu-Wai-Man P, Coxhead J, Deehan D, Horvath R, Taylor RW, Samuels DC, Santibanez-Koref M, Chinnery PF: Universal heteroplasmy of human mitochondrial DNA. Hum Mol Genet. 2013, 22 (2): 384-390.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruiz-Pesini E, Lott MT, Procaccio V, Poole JC, Brandon MC, Mishmar D, Yi C, Kreuziger J, Baldi P, Wallace DC: An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res. 2007, 35 (Database): D823-828.PubMed CentralView ArticlePubMedGoogle Scholar
- Rubino F, Piredda R, Calabrese FM, Simone D, Lang M, Calabrese C, Petruzzella V, Tommaseo-Ponzetta M, Gasparre G, Attimonelli M: HmtDB, a genomic resource for mitochondrion-based human variability studies. Nucleic Acids Res. 2012, 40 (Database): D1150-1159.PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491 (7422): 56-65.View ArticleGoogle Scholar
- Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009, 25 (21): 2744-2750.PubMed CentralView ArticlePubMedGoogle Scholar
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods. 2010, 7 (4): 248-249.PubMed CentralView ArticlePubMedGoogle Scholar
- Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R: Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat. 2009, 30 (8): 1237-1244.View ArticlePubMedGoogle Scholar
- Londin ER, Keller MA, D'Andrea MR, Delgrosso K, Ertel A, Surrey S, Fortina P: Whole-exome sequencing of DNA from peripheral blood mononuclear cells (PBMC) and EBV-transformed lymphocytes from the same donor. BMC Genomics. 2011, 12: 464-PubMed CentralView ArticlePubMedGoogle Scholar
- Schafer CM, Campbell NG, Cai G, Yu F, Makarov V, Yoon S, Daly MJ, Gibbs RA, Schellenberg GD, Devlin B, et al: Whole exome sequencing reveals minimal differences between cell line and whole blood derived DNA. Genomics. 2013, 102 (4): 270-277.View ArticlePubMedGoogle Scholar
- Nickles D, Madireddy L, Yang S, Khankhanian P, Lincoln S, Hauser SL, Oksenberg JR, Baranzini SE: In depth comparison of an individual's DNA and its lymphoblastoid cell line using whole genome sequencing. BMC Genomics. 2012, 13: 477-PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes. [http://www.1000genomes.org/]
- 1000 Genomes FTP Site. [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/20120522.exome.alignment.index]
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralView ArticlePubMedGoogle Scholar
- Picard tools. [http://picard.sourceforge.net/index.shtml]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.PubMed CentralView ArticlePubMedGoogle Scholar
- GMAP/GSNAP. [http://research-pub.gene.com/gmap/]
- Python. [http://www.python.org/]
- HmtDB. [http://www.hmtdb.uniba.it:8080/hmdb/]
- Behar DM, van Oven M, Rosset S, Metspalu M, Loogvali EL, Silva NM, Kivisild T, Torroni A, Villems R: A "Copernican" reassessment of the human mitochondrial DNA tree from its root. Am J Hum Genet. 2012, 90 (4): 675-684.PubMed CentralView ArticlePubMedGoogle Scholar
- van Oven M, Kayser M: Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat. 2009, 30 (2): E386-394.View ArticlePubMedGoogle Scholar
- MITOMAP - MutationsCodingControl. [http://www.mitomap.org/bin/view.pl/MITOMAP/MutationsCodingControl]
- Pereira L, Soares P, Radivojac P, Li B, Samuels DC: Comparing phylogeny and the predicted pathogenicity of protein variations reveals equal purifying selection across the global human mtDNA diversity. Am J Hum Genet. 2011, 88 (4): 433-439.PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes - Samples included in the Project. [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info]
- 1000 Genomes - Data. [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/]
- 1000 Genomes - Integrated Call Sets. [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets]
- Bandelt HJ, Parson W: Consistent treatment of length variants in the human mtDNA control region: a reappraisal. Int J Legal Med. 2008, 122 (1): 11-21.View ArticlePubMedGoogle Scholar
- Gasparre G, Iommarini L, Porcelli AM, Lang M, Ferri GG, Kurelac I, Zuntini R, Mariani E, Pennisi LF, Pasquini E, et al: An inherited mitochondrial DNA disruptive mutation shifts to homoplasmy in oncocytic tumor cells. Hum Mutat. 2009, 30 (3): 391-396.View ArticlePubMedGoogle Scholar
- Samuels DC, Schon EA, Chinnery PF: Two direct repeats cause most human mtDNA deletions. Trends Genet. 2004, 20 (9): 393-398.View ArticlePubMedGoogle Scholar
- Mita S, Rizzuto R, Moraes CT, Shanske S, Arnaudo E, Fabrizi GM, Koga Y, DiMauro S, Schon EA: Recombination via flanking direct repeats is a major cause of large-scale deletions of human mitochondrial DNA. Nucleic Acids Res. 1990, 18 (3): 561-567.PubMed CentralView ArticlePubMedGoogle Scholar
- Sosa MX, Sivakumar IK, Maragh S, Veeramachaneni V, Hariharan R, Parulekar M, Fredrikson KM, Harkins TT, Lin J, Feldman AB, et al: Next-generation sequencing of human mitochondrial reference genomes uncovers high heteroplasmy frequency. PLoS Comput Biol. 2012, 8 (10): e1002737-PubMed CentralView ArticlePubMedGoogle Scholar
- Li M, Schonberg A, Schaefer M, Schroeder R, Nasidze I, Stoneking M: Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes. Am J Hum Genet. 2010, 87 (2): 237-249.PubMed CentralView ArticlePubMedGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276.PubMed CentralView ArticlePubMedGoogle Scholar
- Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZX, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, et al: Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010, 329 (5987): 75-78.PubMed CentralView ArticlePubMedGoogle Scholar
- Falk MJ, Pierce EA, Consugar M, Xie MH, Guadalupe M, Hardy O, Rappaport EF, Wallace DC, LeProust E, Gai X: Mitochondrial disease genetic diagnostics: optimized whole-exome analysis for all MitoCarta nuclear genes and the mitochondrial genome. Discov Med. 2012, 14 (79): 389-399.PubMed CentralPubMedGoogle Scholar
- Xue Y, Chen Y, Ayub Q, Huang N, Ball EV, Mort M, Phillips AD, Shaw K, Stenson PD, Cooper DN, et al: Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing. Am J Hum Genet. 2012, 91 (6): 1022-1032.PubMed CentralView ArticlePubMedGoogle Scholar
- Marth GT, Yu F, Indap AR, Garimella K, Gravel S, Leong WF, Tyler-Smith C, Bainbridge M, Blackwell T, Zheng-Bradley X, et al: The functional spectrum of low-frequency coding variation. Genome Biol. 2011, 12 (9): R84.-PubMed CentralView ArticlePubMedGoogle Scholar
- Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, et al: Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011, 43 (7): 712-714.PubMed CentralView ArticlePubMedGoogle Scholar
- Kurelac I, Lang M, Zuntini R, Calabrese C, Simone D, Vicario S, Santamaria M, Attimonelli M, Romeo G, Gasparre G: Searching for a needle in the haystack: comparing six methods to evaluate heteroplasmy in difficult sequence context. Biotechnol Adv. 2012, 30 (1): 363-371.View ArticlePubMedGoogle Scholar
- Canter JA, Kallianpur AR, Parl FF, Millikan RC: Mitochondrial DNA G10398A polymorphism and invasive breast cancer in African-American women. Cancer Res. 2005, 65 (17): 8028-8033.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.