Molecular identication and evolutionary relationships between the subspecies of Musa by DNA barcodes

Background: The banana ( Musa sp., AAA ) genome is constantly increasing due to high frequency somaclonal variations. Due to its large diversity, a conventional numerical and morphological based taxonomic identication of banana cultivars is laborious, dicult, and often is a subject of disagreements. Results: In study, we used ITS2 region to identify and determine the genetic relationship between the cultivars and varieties of banana. Herein, a total of 16 banana cultivars were PCR amplied using ITS2 region. In addition, 321 sequences were retrieved from GenBank, USA, and used for the study. The sequences were aligned using Clustal W and genetic distances computed using MEGA V5.1. The signicant divergence between the intra- and -specic genetic distances of ITS2 region was observed and the presence of a DNA barcoding gap was obvious. Based on BLAST1 and Distance methods, the results proved that ITS2 region can be successfully identied and distinguished for the cultivar and varieties of banana. Secondly, in this study, ITS2 revealed the relations between the cultivar and varieties of banana. Conclusion: Thus, from this study, it is clear that ITS2 is not only an ecient DNA barcode to identify the banana species but also a potential candidate to study phylogenetic relationships between subspecies and cultivars. This is the rst comprehensive study to categorically distinguish the economically important banana sub-species and varieties using DNA barcodes and to understand its evolutionary relationship


Background
Banana and plantain belong to the family Musaceae and are cultivated throughout tropical and subtropical regions of the world [1]. This is an important food crop next to rice, wheat and corn [2]. The edible Musa species and their hybrids and polyploids originated from two main wild species of Musa acuminata Colla and M. balbisiana Colla, with A and B genomes respectively [3]. The major cultivars belong to the sub groups of Cavendish (AAA), Lujugira (AAA), Figuepomme (AAB), Plantain (AAB), Saba Bluggoe (ABB) and Sucier (AA) [4]. It is a staple edible fruit crop with a good source of potassium and magnesium, which provides health bene ts like maintaining the normal blood pressure and protects the heart [5]. The genome is continuously expanding due to high frequency of somaclonal variation, increasing diversity, leading to often subject of disagreements [6]. There are nine subspecies like banksii, burmannica, burmannicoides, errans, malaccenis, microcarpa, siamea, truncataandzebrine are identi ed under M. acuminata.M. balbisiana exhibits wide morphological characters. Though the species M.schizocarpa and M.texilis were endemic to Papua New Guinea, which do not show apparent morphological diversi cation [7].The species Musa nagensium remained unnoticed by botanists and no collections were made more than a century from North-East India. [8] Misidenti ed M. cheesmanii N.W. simmonds as M. nagensium and provided the photograph of the former. It was rediscovered and detailed description was explained by both [9] and [10]. Similarly, there was a misidenti cation on AAA genome, which showed similarity to ABB genome [11]. The genetic diversity was observed in the pool of M. acuminata species by using primers of highly repetitive sequence and tandem repeats by [12] [13]. A total number of cultivars in bananas and plantains were estimated to be around 300-1000 and their names are highly confusing, even within a country [14]. Cultivated bananas differ from their relative wild species and multiplied through vegetative propagation, exhibiting high level of morphological diversi cation. [7] Classi cation found to be clear, coherent and accepted widely. Though the cultivated banana has socioeconomic importance, where the genetic studies are limited due to continuation of polyploidy, parthenocarpy and complexity in sample collection. Moreover, the correct identi cation of Musa cultivar is crucial for utilization and also importantly for genetic resources conservation. Traditional methods to identify Musa cultivars relies more on morphological characters [15] but that are often affected by environmental and developmental factors. Phenotypic classi cation and their genetic relationship among the genotypes are still under debate [16]. Even molecular markers like RAPD [17] did not have a su cient discriminating power to classify the nine genotypes of Musa [18]. Genomic in situ hybridisation (GISH) are not found to be suitable for highthroughput screening to large breeding populations [19]. But DNA markers were used for identifying dwarf Cavendish derived from micropropagation [20], nger printing can detect the parental genotypes with progeny populations during hybridisation [21]. Plastid subtype identity (PS-ID) through sequence analysis was carried out to show the possible maternal relationship among Musa sp., [22]. PCR-RFLP markers of ribosomal internal transcribed spacers (ITS) were used to determine the Musa genome and hybrids at nursery stage [23]. Therefore, a simple and accurate identi cation method is very much required for determining the genetic variation between the cultivars of Musa species.
DNA barcoding is a recent technique that uses short and standardised DNA fragment to discriminate the specimens at species level [24] [25] [26]. Many disputed species have been correctly identi ed [27]. Herbal products have also been authenticated through DNA barcodes [28] [29]. Also some studies reported that DNA barcodes are used even for identifying the herbarium samples [30], intra speci c, ecotypes [31] and ornamental species for horticultural industries [32]. Hence, DNA barcoding became an e cient tool for identi cation with discriminating power at species level [25]. The chloroplasts DNA sequences like matK, rbcL, psbA-trnH, atpF-atpH and internal transcribes spacer (ITS) region of nuclear ribosomal DNA have been proposed as potential plant barcodes [33]. The internal transcribed spacer 2 (ITS2) is located between the ribosomal 5.8S and 28S, which is actively involved in the regulation of the transcription of active ribosomal subunits and it is essential for pre-rRNA processing [34]. Using the conserved regions, it is easy to design a universal primer, PCR amplify and DNA sequencing of amplicons will reveal the variability that can be used to distinguish the closely related species. Due to this universality, currently ITS2 was considered as standard barcode for authenticating the medicinal plants [35] [36] [37]. Moreover, recently it also clearly identi ed the different varieties, imported teas [38] and small millet land races [39].
In this context, DNA barcoding analysis was performed for the banana cultivars and wild Musa accessions using internal transcribed spacer region ITS2 for better understanding of the origin and domestication of cultivated banana and to clear the confusions in varietal synonyms.

PCR success rate and DNA sequencing
The ampli cation and sequence success rate of the ITS2 sequences from sampled specimens of Musa sp., was found to be 100%. The lengths of the ITS2 sequences used for the analyses were in the range of 325-375 bp, with an average of 345 bp. The mean GC content was 60.3%, with a range of 58.3-69%.

Genetic diversity
Genetic divergences were estimated using six metrics like average inter-speci c distance, the minimum inter-speci c distance, theta prime, average intra-speci c distance, coalescent depth and theta. The region ITS2 exhibits signi cant divergences at the inter-species level ( Table 2) at the level of cultivars and varieties level. At the intra-speci c level, relatively lower divergences were observed for all the corresponding metrics.

Assessment of barcoding gap
Interspeci c versus intraspeci c divergence were analysed by examining the distribution of genetic distance at a scale of 0.008 distance units. Only a slight overlap in inter and intraspeci c variation was observed (Fig 1). The interspeci c distance was in the range of 0.002 -0.184 equaled 0.002 for only 0.26% and the proportion of inter-speci c genetic distance < 0.135 was about 8.33%. The intraspeci c distance ranged from 0.000 to 0.135, and most Musa species with more than two samples in our study had a unique sequence (58.93%) in the ITS2 region. The results indicate that evident barcoding gap between inter and intra speci c divergence, thusITS2 provides a useful region to authenticate different Musa species. E cacy of ITS2 for authentication ITS2 showed 97.7% and 95.8% identi cation success rates at the species level for 345 samples of Musa using BLAST1 and nearest genetic distance respectively. Nearly 15 cultivar and wild species were identi ed that are shown in the Table 3. Thus, ITS2 region exhibited high identi cation e ciency.
Sequence analysis and species discrimination ITS2 sequences were collected and evaluated using MEGA (Fig 2). As a result, over 95.6% of species had larger inter-than intra-speci c diversity; therefore, there were relatively clear species boundaries for ITS2 sequences. Only two species were exceptions M.schizocarpa and M. acuminata x M.textilis had very less variability of about 0.035%. ITS2 region shows higherpolymorphic sites representing higher genetic diversity in between sup species and cultivars of Musa. Unique haplotypes of Musa species and sub species were identi ed by using restriction enzymes like MSeI, pstI and AvaII respectively shown in the Table 5.
Nucleotide polymorphism and neutrality tests DNA polymorphism analyses shows rich genomic variations in Musa accessions, with the total number of polymorphic sites being 112 in cultivated bananas in A genome and 33 in B genome. Nucleotide diversity (π and θ) for all cultivated and wild Musa accessions were estimated for silent, nonsynonymous and total sites independently. Summaries of nucleotide diversity data for two ITS2 regions are given in Table 4. Reduced levels of polymorphism emerged as a general property of cultivated bananas relative to their wild progenitors. It represents that subspecies has slightly higher nucleotide diversity than wild and cultivated species. Thus, these ndings suggest that the cultivars would not have undergone any severe genetic bottleneck during the initial domestication process. The triploid genome AAA and AAB groups also hold high levels of nucleotide diversity, representing the historical population sizes are large. The ABB genome of cultivated banana shows higher nucleotide diversity than that ofM. balbisiana (Table 4). We found that nucleotide diversity at non-synonymous sites ITS2 region was reduced in the Agenome of wild species represented as ps shown in Table 4. No polymorphic sites were observed within the cultivar and subspecies. However, it was found that the genetic diversity of AAA genome was 4-6folds higher than A genome cultivars. Additionally, the patterns of nucleotide variations in ITS2 region were examined for deviation from neutral equilibrium evolution using the Tajima's neutrality (D) test. Thus no signi cant departure from the neutral model was observed.

Phylogenetic analysis
The morphological classi cation of Musa species is based on [11] and [41]. To analyse the phylogenetic relationship of Musa cultivar with wild species nearly 103 species were studied using Neighbour Joining (NJ) method shown in Sup Fig 1. Musa species for this study includes 60 cultivars, 5 wild species and 9 subspecies shown in the Table 3. Among 103 sequences, 31 species where taken as representative for the comparative analysis for cultivar and wild samples from the laboratory source with subspecies and hybrids from the GenBank (Fig 3). The phylogenetic tree (Fig 3) consists of three main clades A, B and C.

Data analysis using restriction enzymes
The restriction enzymes MSeI and AvaII provide the best discriminatory power to differentiate the haploids of Musa species using ITS2 sequences. MSeI shows single restriction sites for 11 genome of wild and cultivar species and 3 shows two restriction sites at different sites and AvaII one, two and three restrictions respectively.

Discussion
A common problem for banana researchers and horticulturist in south East Asia is the presence of numerous cultivar names and synonyms in different languages in the regions. Better knowledge on synonyms may promote banana trade and commerce. A rapid and reliable method for species and cultivar recognition is vital to certify the fruits and plantlets of Musa sp., and to preserve banana germplasm resources. To our knowledge, this is the rst time that the DNA barcoding has been utilised in identi cation of cultivar species of Musa using large sample size. An ideal DNA barcode should include higher inter-speci c but low intra-speci c divergence in order to discriminate different species [26] as in many other previous studies for various cultivar species of colletotrichum isolates [61], g [43], grapevine [44], pineapple [45]. In this study, ITS2 possess su cient variable region between the cultivar species for determination of genetic divergence with high discriminatory ability. Morphological characters resolve M. acuminata population only at high land regions but low land shows overlapping and no distinguishing patterns were observed [47]. PCR-RFLP of ITS region using RsaI restriction endonuclease enzyme was used on 68 banana accessions, which showed consistent polymorphic banding DNA patterns between the wild and cultivars species of M. Acuminate [48]. Similarly, acceptable structure diversity and molecular phylogeny were observed when an ITS1-5.8S-ITS2 region was used for the species of Musaceae [49]. Through this study, we propose that ITS2 can be an ideal DNA barcode candidate for Musa sp., however, it should be noted that the phylogeny of the Musaceae remains still controversial. The taxonomic position of the species M. beccarii remains uncertain and the species M. ingensis still undetermined [50]. Hence, the taxonomic assignment of cultivars of Musa and its discriminating power of ITS2 was conclusive.
The phylogenetic tree shows three clades A, B and C. Clade A consists of two clusters, the specimen B genome of M.balbsiana seems to be closer with M.acuminata subsp. Siamea, whereas sub spburmannica and burmannicoides grouped separately (Fig. 3). The cultivar red banana and robusta of AAA genome was closer with sub species M acuminata subsp. truncata by BLAST1 and using distancebased identi cation methods the cultivar red banana showed 99% and 99.1% as Musa acuminata subsp. malaccensis respectively.

Conclusion
In summary, our study clearly demonstrated that ITS2 is an ideal DNA barcode to identify Musa subspecies or cultivars and in reconstruction of the genus Musa phylogeny. However, more Musa species should be included in future to verify whether these ndings hold good even if closely related taxa are newly included. In conclusion, DNA barcoding provided much useful genetic information about very complex Musa species, which will be very useful for germplasm management and in resource protection.

Plant Materials
From GenBank, 800 sequences were obtained, out of which707 sequences belong to 46 species and 65 sequences to wild isolates. We have sequenced 28 Musa samples out of which 12 species were from the regions of Western Ghats of India and were submitted to GenBank its accession number is given in the Table 1 and its images are shown in supplementary Figure 1. Nearly 46 annotated species and subspecies of GenBank sequences were used for the study are shown in supplementary Table 1.

DNA Extraction, Ampli cation and Sequencing
Fresh, young leaves of sampled specimens were collected and genomic DNA was isolated by following [51]. The ITS2 region was ampli ed using the following pair of universal primers ITS-2F, 5'-ATGCGATACTTGGTGTGAAT-3', and ITS -3R, 5'-GACGCTCTCCAGACTACAAT-3'. Primers were synthesised by Integrated DNA Technologies, USA. PCR was carried out in 25µL volume containing 1XPCR Buffer, 2.5mM Mg 2+ , 0.4 mM dNTPs, 0.5 µM of each primer, 1 U Taq DNA polymerase (Genei, India), and 30 ng genomic DNA template. The ampli cation was performed in a Gradient Master Cycler (Eppendorf, Germany) with a PCR program: 94°C for 4 min, followed by 35 cycles of 94°C for 45 s, 56°C for 45 s, 72°C for 1.5 min, and a nal extension at 72°C for 10 min. The PCR products were sequenced by ABI-3130 Genetic Analyzer (Bioserve, India).

Sequence and genetic relationship analysis
The original sequences were analysed using MEGA [52] (Tamura et al. 2007). The ITS2 sequences were subjected to Hidden Markov Model analysis to remove the conserved 5.8S and 28S DNA sequences [26]. The ITS2 sequence were aligned using Clustal W [53] and the genetic distance computed using MEGA 5.1 according to the Kimura 2-parameter (K2P) model [52]. The average intra-speci c distance, the minimum intra-speci c distance and theta prime were used to represent inter-speci c divergences using the K2P model [25] [35]. The average intra-speci c distance, coalescent depth and theta were calculated to evaluate the intra-speci c variation [26]. The distributions of inter-versus intra-speci c variability were compared using the DNA barcoding gaps [54]. Wilcoxon two sample tests were performed as indicated previously [55]. Two methods for species identi cation including BLAST1 and the nearest distance method were used to evaluate the species authentication e cacy [56]. ITS2 sequences of Musa species in this study were used as query sequences. BLAST program (http://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to search for the reference database for each query sequence. In nearest distance method, correct identi cation means that the hit in our database based on the smallest genetic distance is from the same species as that of query. Ambiguous identi cation means that several hits from our database were found to have the same smallest genetic distance to the query sequence. Incorrect identi cation means that the hit based on the smallest genetic is not from the expected species [35]. The discriminatory power of ITS2 sequences was calculated using MEGA.
To understand the wild parents of cultivated bananas, the Neighbour joining (NJ) method for phylogenetic inference was carried out in MEGA version 5.1 [52], using Kimura's 2-parameter distances [57] (Kimura 1980). Gaps were treated as missing data and bootstrap values for the NJ trees were obtained from 1000 replicates. We evaluated overall nucleotide diversity and also for AAA, AAB, AA, Wild species, sub species, cultivar species respectively. Genetic analysis of sequence polymorphism was performed using MEGA. Number of segregating sites (S), number of haplotypes (H), Tajima's D was determined [58]. In addition, we surveyed nucleotide diversity (ℼ) [59] and theta (θ) [60] for total, silent and nonsynonymous sites independently, whereas insertion/deletions (indels) were not included in this analysis.
Data analysis using restriction enzymes ITS2 sequence data of 15 specimens were aligned and restriction patterns were predicted as shown in the  Figure 1 The Relative distribution of inter-speci c divergence between congeneric musa species and intra-speci c variation (p < 0.001) Figure 3 Neighbour Joining (NJ) tree for Musa accessions using ITS2 region. Numbers are bootstrap percentage above 50%. Numbers are bootstrap percentage above 50%. Capital letters following each accession name indicate the previously-recognized genome composition of the cultivar. The appearance of an accession more than once represents distinct sequence cloned from the same cultivar. Red indicates wild species, green indicates sub species and blue indicates cultivar