Development of simple sequence repeat (SSR) markers from a genome survey of Chinese bayberry (Myrica rubra)

Background Chinese bayberry (Myrica rubra Sieb. and Zucc.) is a subtropical evergreen tree originating in China. It has been cultivated in southern China for several thousand years, and annual production has reached 1.1 million tons. The taste and high level of health promoting characters identified in the fruit in recent years has stimulated its extension in China and introduction to Australia. A limited number of co-dominant markers have been developed and applied in genetic diversity and identity studies. Here we report, for the first time, a survey of whole genome shotgun data to develop a large number of simple sequence repeat (SSR) markers to analyse the genetic diversity of the common cultivated Chinese bayberry and the relationship with three other Myrica species. Results The whole genome shotgun survey of Chinese bayberry produced 9.01Gb of sequence data, about 26x coverage of the estimated genome size of 323 Mb. The genome sequences were highly heterozygous, but with little duplication. From the initial assembled scaffold covering 255 Mb sequence data, 28,602 SSRs (≥5 repeats) were identified. Dinucleotide was the most common repeat motif with a frequency of 84.73%, followed by 13.78% trinucleotide, 1.34% tetranucleotide, 0.12% pentanucleotide and 0.04% hexanucleotide. From 600 primer pairs, 186 polymorphic SSRs were developed. Of these, 158 were used to screen 29 Chinese bayberry accessions and three other Myrica species: 91.14%, 89.87% and 46.84% SSRs could be used in Myrica adenophora, Myrica nana and Myrica cerifera, respectively. The UPGMA dendrogram tree showed that cultivated Myrica rubra is closely related to Myrica adenophora and Myrica nana, originating in southwest China, and very distantly related to Myrica cerifera, originating in America. These markers can be used in the construction of a linkage map and for genetic diversity studies in Myrica species. Conclusion Myrica rubra has a small genome of about 323 Mb with a high level of heterozygosity. A large number of SSRs were identified, and 158 polymorphic SSR markers developed, 91% of which can be transferred to other Myrica species.

Dongkui, both from the Zhejiang province. Although there are abundant germplasm resources, studies on genetics and breeding of the species are still in their infancy. Molecular marker technology is a popular tool for breeding and genetic research, and with the construction of a genomic library, 13 polymorphic microsatellite loci have been developed in M. rubra [3] and 11 from an expressed sequence tag library [4]. Recently, 12 primer pairs have been temporarily developed by ISSRsuppression PCR [5] with GSG (GT) 6 as the primer for enriching microsatellite sequences. Reports on the genetic diversity in Chinese bayberry using SSR markers have also recently been published [6,7], but the number of markers for Chinese bayberry is limited.
The reproducibility, multiallelicism, co-dominance, relative abundance and good genome coverage of SSR markers have made them one of the most useful tools for genetic diversity and linkage mapping. Genomic SSRs and EST-SSRs, considered complementary to plant genome mapping, have been reported in many fruit crops, such as walnut [8], cherry [9], apricot [10] and coconut [11]. EST-SSRs are useful for genetic analysis, but their relatively low polymorphism and the high possibility of no gene-rich regions in the genome are limitations to their use. In contrast, genomic SSRs are highly polymorphic and tend to be widely distributed throughout the genome, resulting in better map coverage [12].
With genetic maps serving as the basis for future positional gene cloning, making map-based cloning and marker-assisted selection possible, the development of more SSRs is essential. As sequencing technologies advance, whole-genome shotgun (WGS) sequences are becoming increasingly available. These DNA sequences are valuable resources for SSR development in many plant species, such as rice [13] and papaya [14]. In addition, they can be used to evaluate the frequency and distribution of different types of SSRs in the genome, and even help to estimate genome size and characters such as heterozygosis and repeats.
As a way of reducing the cost of genotyping research, Schuelke [15] proposed a method for fluorescent dye labelling of PCR fragments with a sequence-specific forward primer: the universal fluorescent-labelled M13 (-21) primer, at the 5 ' end, acts as the forward primer in a 'one-tube' reaction. As this method allows for highthroughput genetic analyses, with a high number of microsatellite markers widely used, we considered the possibility of using this approach for multiplex PCR, to improve the efficiency and save costs.
In this study, we mined and validated 158 SSR markers and describe their application for understanding the genetic relationship among 29 Chinese bayberry accessions and other Myrica species. These markers are useful for genotyping and genetic diversity analysis and linkage mapping of Myrica rubra and other Myrica species.

Results
Genome survey using whole genome shotgun data in Chinese bayberry WGS generated 273,161 (>100 bp) high quality sequence reads from two DNA libraries (250 bp and 500 bp) of the androphyte individual 'C2010-55'. We used 9.01 G raw data for K-mer analysis and heterozygous simulation. For the 17-mer frequency distribution (Figure 1), the peak of the depth distribution was about 22. The estimated genome size was 323 Mb, using the formula: genome size = k-mer count/peak of the kmer distribution. The minor peak at 1/2 altitude of the main peak indicates the high level of heterozygosity in this genome ( Figure 1). A total of 739,969 contigs were assembled with a total sequence length of 255.7 Mb. The length of N50 was 295 bp in our assembly, and the longest contig and scaffold 7,593 and 127,008 bp, respectively.

Frequency distribution of different types of SSR markers
A total of 17,172 out of 273,161 scaffolds (6%) retrieved from the genome survey sequence contained 28,602 SSRs (Table 1), of which 5,401 contained more than one SSR, and 1,444 SSRs were present in compound format. Among the derived SSR repeats, the dinucleotide was the most abundant repeat, accounting for 84.72% of total SSRs, followed by tri-(13.78%), tetra-(1.34%), penta-(0.12%), and hexa-(0.04%) nucleotides (Table 1). There was a large proportion of both dinucleotides and trinucleotides while the rest amounted to less 2%. The average frequency of occurrence was about 10.47% (Table 1).
The SSR frequency of each motif is presented in Additional file 1. The SSR motif consists of 69 types. Among the repeat motifs of the dinucleotide, the AG/CT repeat was the most common, representing 53.72%, followed by 39.20% AT repeats (Figure 2), and the predominant motifs of trinucleotide (AAG/ CTT and AAT/ATT) repeats accounted for 37.15% and 32.56%, respectively ( Figure 3).

Polymorphism of SSR markers
We first designed and synthesised 600 SSR primer pairs from those scaffolds more than 2Kb long. The majority of SSR loci were dinucleotide repeats (597, 99.5%), and the remainder trinucleotide. Initially, the effectiveness of these primer pairs was detected in two cultivars (Biqi and Dongkui) and M. cerifera through denaturing PAGE (Polyacrylamide gel electrophoresis), and 581 (96.8%) of these were amplified successfully in Biqi and Dongkui, and 400 (66.7%) in M. cerifera. The SSR loci (186, 31%) were identified as heterozygous loci either in Biqi or in Dongkui.
Subsequently, they were used to screen 32 accessions, and detected an average of 8.25 alleles and from 3 to 15 alleles per locus ( Table 2).
The PIC at each locus ranged from 0.256 to 0.883 with an average of 0.67 loci. The PCR product size ranged from 110 to 274 bp. All the primers produced amplicons in agreement with the expected sizes. Most of the SSR primers (139 primer pairs) showed significant deviation from HW equilibrium (P < 0.05). Partial correlation analysis showed that significant positive correlations existed between the repeat unit length and PIC (P < 0.01, r = 0.2747). This showed that these SSRs had high rates of transferability for M. adenophora (91.14%) and M. nana (89.87%) and low rates for M. cerifera (46.84%), indicating that these markers are suitable for genetic diversity analyses in other Myrica species.
One of the objectives of this study was to develop potential suitable SSR markers for genetic mapping using Biqi and Dongkui as crossing parents. We selected 99 heterozygous loci in Biqi and 105 in Dongkui (Table 3): 135 primer pairs can be used together in linkage mapping of the planned F 1 populations between Biqi and Dongkui.

Genetic relationship analysis
The 32 accessions were divided into three groups (A, B and C, Figure 4), based on Nei's genetic distance coefficient [16] using UPGMA cluster analysis. The similarity among all the accessions varied from 0.54 to 0.84. At the species level, the UPGMA dendrogram produced clusters separating M. nana and M. cerifera into two distinct groups. The genetic Figure 1 The distribution of 17-mer depth analysis based on whole genome shotgun data in Chinese bayberry.  similarity between M. cerifera and M. rubra was 0.54, lower than the 0.74 previously reported by Xie [6]. The main cluster ' A' included the subgroups A-1 and A-2. Subgroup A-1 includes 16 accessions, mainly from the cities of Ningbo (12) and Hangzhou (3), where the popular and dominant cultivar is Biqi. This demonstrates that these natural elite seedling selections are truly distinct from the local cultivars. For their genetic relationships (Figure 4), the rare monoecious individual (C2010-4) is closely related to Biqi, while the accessions 'Shuijing' and 'Y2010-72' (both white fruit type) are clearly separated in the cluster, with low genetic distance.
Subgroup A-2 includes 14 accessions, with four from Wenzhou, two from Taizhou, and one each from the cities of Hangzhou and Ningbo, and the Hunan, Guangxi, Guizhou and Jiangsu provinces. This group includes the popular cultivar Dongkui. The four accessions from Wenzhou distributed in this cluster have genetic similarity. The accession 'Tongzimei' from the Hunan province is on an independent branch, showing that it is genetically distinct. 'Xiaolejiangchonghei' and 'M. adenophora' grouped together, and are possibly in the same population. Six androphyte accessions, distributed in group A, are close to cultivars of the same geographic origin.
The accessions 'Myrica nana' from Yunnan and 'Myrica cerifera' from the USA were independently classified as the 'B' and 'C' group, indicating a distant relationship with cultivated Myrica rubra.

Discussion
Our major aims were to find a large set of SSR markers for Myrica rubra and understand the genetic diversity and relationship among representative cultivars, androphyate and related species. This research paves the way for constructing an SSR-based linkage map in Myrica.

The genome characteristics of genus Myrica
Shotgun sequencing is suitable for simple genomes, with no or few repeat sequences, such as Chinese bayberry. For such genomes, the genome can largely be assembled simply by merging together reads containing overlapping sequence [17]. We report the genome survey of Chinese bayberry using whole genome shotgun sequencing. The 17-nucleotide depth distribution suggests a genome size of 323 Mb, larger than peach (220 Mb, http://www.rosaceae.org/peach/genome), but close to our estimate of 250 Mb from flow cytometry using rice as the reference (date not shown). Although the highly homozygous material was selected on a limited number of SSR loci assays, the actual heterozygous rate, as revealed by 185 new SSR markers, was very high (63%). To overcome the key issue of heterozygosity and allow us to generate a high-quality genome sequence, we can use a unique homozygous form such as monoploid, derived using tissue culture or from nature and worth further study.

Marker development for under-utilised fruit crops
SSRs have been widely used for high-throughput genotyping and map construction as they have the advantage of high abundance, random distribution within the genome, high polymorphism information content and stable co-dominance [18][19][20]. They can be developed from either genomic or expressed sequence tag (EST) libraries. Although EST-SSRs are useful for genetic analysis, their disadvantages of relatively low polymorphism and high concentration in gene-rich regions of the genome may limit their usage, especially for the construction of linkage maps [21]. In this study, a total of 600 SSR primer pairs were designed from 28,602 SSR sites, and 581 (96.8%) primer pairs were effective. Due to the self-complementary nature to form dimers, AT/TA is not usually used to develop markers [12]. Our findings are in agreement with that published for peach, where the dinucleotide repeat motifs were also found to be the most common, and (CT) n as the most common repeat unit [22].
In the present study, the mean value of PIC was higher than the previously reported 0.62 [7], but the consistent relationship was observed between SSR polymorphism and repeat unit length. There are some reports of a positive relationship between degree of polymorphism and repeat unit length [23,24]. However, those polymorphic SSRs that are homozygous in both parents cannot be mapped in F 1 populations, although they are useful for mapping in F 2 or backcross populations [25], while heterozygous SSRs can be used for mapping in F 1 populations ( Table 2). The estimated number of SSRs that can be mapped in the F 1 populations between Biqi and Dongkui was about 85%.         Recently, based on mass sequence data of Chinese bayberry obtained by RNA-Seq, 41,239 UniGenes were identified and approximately 80% of the UniGenes (32,805) were annotated, which provides an excellent platform for future EST-SSR development and functional genomic research [26].

High efficient test methods
Normally, a universal M13 primer is labelled with one of a number of fluorescent dyes. The tailed primer provides a complementary sequence to the fluorescent labelled M13 primer, leading to the amplification of fluorescent PCR products, and then the PCR products of different sizes and/or labelled with different fluorescent dyes are mixed and tested [27]. In this research, a multiplex PCR strategy was designed using the universal M13-tailed primer and three additional tail primers, designed arbitrarily, in presumed four-plex amplifications in single PCR, for a major reduction in cost and time. However, as only a few primer combinations were successful, most resulting in weak bands, we did the PCR individually and mixed the PCR products. Further optimisation of multiplex PCR is needed to evaluate its general applicability.

Evolution of Myrica species
In this study, both cultivated species and wild species were analysed and their genetic diversity could easily be differentiated. M. nana and M. cerifera were clearly genetically distant to M. rubra. M. nana, also known as the dwarf or Yunnan arbutus, is indigenous to the Yunnan and Guizhou provinces, and has a plant height of < 2 m. The juvenile period of fruit tree can be shortened for breeding purposes. Studies on embryo culture in vitro of the F 1 seeds of crosses between M. rubra and M. nana, [28], has shown good cross compatibility between M. rubra and M. nana, resulting in 70.5% normal seeds with intact embryo. M. adenophora and M. nana grow as wild trees, with the fruit of M. adenophora only suitable for medical purposes and not edible.
Our findings on the genetic similarity between M. adenophora and M. rubra, which are considered a progenitor-derivative species pair, are consistent with a previously published figure of 0.897 [29]. An earlier study observed little change in allelic diversity along the chronosequence and no evidence for heterosis, although there was a moderate change in genotypic diversity [30]. The markers developed in this study can be very useful in future population structure analysis.

Conclusions
In summary, the genome size of Myrica genus is small, about 320 Mb. A large set of SSRs were developed from a genome survey of Myrica rubra. The results suggest that they have high rates of transferability, making them suitable for use in other Myrica species.

Plant materials and genome survey
We selected an androphyte 'C2010-55' for the genome survey because it was the most homozygous ( (Table 4), were used to evaluate the suitability of the SSRs for genetic distance analysis. Young leaves were collected and frozen in liquid nitrogen prior to genomic DNA extraction using CTAB methods [4]. DNA concentrations were measured spectrophotometrically at 260 nm, and the extracts electrophoresed on 1% agarose to confirm the quality. The purified DNAs were standardised at 40 ng/μl and stored at -40°C.

SSR identification and primer design
We used MISA scripting language (http://pgrc.ipk-gatersleben.de/misa/misa.html) to identify microsatellite repeats in our sequence database. The SSR loci containing perfect repeat units of 2-6 nucleotides only were considered. The minimum SSR length criteria were defined as six reiterations for dinucleotide, and five reiterations for other repeat units. Mononucleotide repeats and complex SSR types were excluded from the study.
The SSR primers were designed using BatchPrimer3 interface modules (http://pgrc.ipk-gatersleben.de/misa/ primer3.html). We selected 600 primers that met the following parameters: 110-230 final product length, primer size from 18 to 22 bp with an optimum size of 20 bp, and the annealing temperature was set at 60°C. The repeat units over eight were used.

Polymerase chain reaction and gel electrophoresis
Each 20 μl reaction mixture contained 10 × PCR buffer (plus Mg2+), 0.2 mM of each dNTP, 5 pmol of each reverse, 4 pmol of the tail primer, 1 pmol of the forward primer, 0.5 units of rTaq polymerase (TaKaRa, China) and 40 ng genomic DNA template. Each primer pair had an interval of 20 bp according to the expected size of amplicons. DNA amplification was in an Eppendorf Mastercycler (Germany) programmed at 94°C for 5 min for initial denaturation, then 32 cycles at 94°C (30 s)/58°C (30 s)/72°C (30 s), followed by 8 cycles of 94°C (30 s)/53°C (30 s)/72°C (30 s). The final extension step was 10 min at 72°C. Each PCR product was run on 1% agarose gel at 110 V for a quality check.
Subsequently, PCR products were electrophoresed on 8% denaturing PAGE, according to Myers et al. [31], at 60 W in a Sequi-Gen GT Nucleic Acid electrophoresis cell (BioRad) for 4 h, depending on the fragment sizes to be separated, and visualised by silver staining [32]. Genotypes showing one and two bands were scored as homozygous and heterozygous, respectively, and the results recorded and photographed.
Multiplex PCR was designed and tested with products of different sizes and labelled with different fluorescent dyes. Each 20 μl reaction mixture contained 10 × PCR buffer (plus Mg 2+ ), 0.8 mM of each dNTP, 1 unit of rTaq polymerase, 40 ng genomic DNA template and a total of four primer pairs with 5 pmol of each reverse primer, 4 pmol of each tail primer, and 1 pmol of each forward primer. The PCR products were diluted, mixed with the internal size standard LIZ500 (Applied Biosystems) and loaded on an ABI 3130 Genetic Analyzer. Alleles were scored using GeneMapper version 4.0 software (Applied Biosystems, Foster City, CA, USA).

Data analysis
The raw genome sequence data was first filtered to obtain high quality reads, then assembled using SOAP (http://soap.genomics.org.cn/soapdenovo.html) denovo software to contig, scaffold and fill in gaps. In addition, we used SSPACE software to build the scaffold. K-mer analysis was to help estimate the genome size and characters, such as heterozygosis and repeats.

Additional material
Additional file 1: Occurrence of different SSRs in the genome survey of Chinese bayberry.