Skip to main content

iMSAT: a novel approach to the development of microsatellite loci using barcoded Illumina libraries



Illumina sequencing with its high number of reads and low per base pair cost is an attractive technology for development of molecular resources for non-model organisms. While many software packages have been developed to identify short tandem repeats (STRs) from next-generation sequencing data, these methods do not inform the investigator as to whether or not candidate loci are polymorphic in their target populations.


We provide a python program iMSAT that uses the polymorphism data obtained from mapping individual Illumina sequence reads onto a reference genome to identify polymorphic STRs. Using this approach, we identified 9,119 candidate polymorphic STRs for use with the parasitoid wasp Trioxys pallidus and 2,378 candidate polymorphic STRs for use with the aphid Chromaphis juglandicola. For both organisms we selected 20 candidate tri-nucleotide STRs for validation. Using fluorescent-labeled oligonucleotide primers, we genotyped 91 female T. pallidus collected in nine localities and 46 female C. juglandicola collected in 4 localities and found 15 of the examined markers to be polymorphic for T. pallidus and 12 of the examined markers to be polymorphic for C. juglandicola.


We present a novel approach that uses standard Illumina barcoding primers and a single Illumina HiSeq run to target polymorphic STR fragments to develop and test STR markers. We validate this approach using the parasitoid wasp T. pallidus and its aphid host C. juglandicola. This approach, which would also be compatible with 454 Sequencing, allowed us to quickly identify markers with known variability. Accordingly, our method constitutes a significant improvement over existing STR identification software packages.


Next-Generation Sequencing (NGS) technologies have recently revolutionized the ease and the rate at which genetic resources can be developed [13]. This revolution has made it possible to now use the genetic tools in nearly all organisms that were previously only available for model taxa [4]. For example, the development of short tandem repeat (STR or microsatellite) markers in non-model organisms is currently undergoing a complete paradigm shift in regards to the techniques and methods used to isolate potential markers, particularly for insects [58]. These new techniques have replaced the laborious steps of DNA cloning with the speed and ease of NGS technologies [9] allowing researchers to quickly develop candidate markers for their study organisms.

Perhaps as a result of the increased accessibility to NGS technologies for STR marker development, there has also been an increased level of activity in the development of associated software for identifying candidate markers. Many highly cited packages exist [1022], but see [23] for a more thorough review. Current software packages work by searching through assembled sequence data for tandem-repeat regions, and then apply filters to optimize the list of candidate sites based on user specified criteria. Newer programs directly allow for the use of whole genome data [21] or raw sequence data from paired-end Illumina sequencing [22]. The most recent software program, SSR_pipeline, represents a particularly important improvement in the identification of STR data by directly using quality scores from the sequence reads to aid in the identification of STR markers. Yet, two major obstacles remain for the identification of STR markers for genetic analyses based on NGS sequence results. First, most existing software packages provide an overwhelming number of candidate loci. Second, they do not inform the investigator as to which loci are polymorphic for the populations under study. For example, a recent study that integrated NGS technologies with existing software packages to develop markers for a species of aphid [24] found that only 0.76% of their 342 candidate markers were suitable for use, though whether this was due to failure to amplify target loci with standard PCR protocols or because amplified loci were not polymorphic is unknown to us.

To improve the rate at which polymorphic STR markers can be identified and developed for use in genetic analyses, we present the use of a novel technique that uses barcoded Illumina sequencing libraries to identify polymorphic STR markers. We test this technique using two insect species from phylogentically distinct orders: the braconid wasp Trioxys pallidus and its aphid host Chromaphis juglandicola. Both insects occur in walnut orchards in California where C. juglandicola is an important invasive pest that was brought under effective biological control by the deliberate introduction of T. pallidus from Iran in 1969 [2527]. We then compare the patterns of STR motifs found for each species to other results published from their respective orders to examine the value of this approach for phylogenetically diverse organisms.


Next-generation sequencing results

Our Illumina sequencing run for T. pallidus resulted in over 99 million 100 base pair reads and our Illumina sequencing run for C. juglandicola resulted in over 170 million 100 base pair reads. Using the de novo genome assembly program Velvet [28], we constructed 65,535 contigs with an average length of 834.2 base pairs and an average coverage of 8.0X for T. pallidus. For C. juglandicola, we developed 474,388 contigs with an average length of 2,573 base pairs and an average coverage of 11.2X. Raw sequence reads were uploaded to BioSample (Accession Numbers SAMN03020618 - SAMN03020621).

Comparison of iMSAT to other methods for identifying STRs

Using MSATCOMMANDER [12] and Phobos [20], we identified 18,525 and 21,860 STRs for T. pallidus (Table 1) and 187,270 and 100,290 STRs for C. juglandicola (Table 2), respectively. Using our novel python program iMSAT (, we found 9,119 candidate polymorphic STRs for T. pallidus (Table 1) and 2,378 candidate polymorphic STRs for C. juglandicola (Table 2). For T. pallidus di-nucleotide STRs were the most abundant type identified by all three methods in being between 55% and 65%, tri-nucleotide STRs represented between 31% and 38% and tetra- and penta-nucleotide STRs constituted between 4% and 6% combined. For C. juglandicola, di-nucleotide STRs were again the most abundant type identified by all three methods (82-93%). However, for this species, tri-nucleotide STRs were rare (4.6-17%) while tetra- and penta-nucleotide STRs were extremely rare (0-3%).

Table 1 STR results from Trioxys pallidus
Table 2 STR results from Chromaphis juglandicola

Amplification of tri-nucleotide STRs in T. pallidus

Of the selected 20 STRs from our output of candidate polymorphic tri-nucleotide STRs, we consistently amplified 17 of them with standard PCR protocols. Two of these markers, TpMSAT3 and TpMSAT6 were excluded from the analysis because they displayed repeat patterns not consistent with tri-nucleotide STRs. DNA sequences for the repeat region of each STR marker used in this study were uploaded to GenBank (Accession #’s KC477413 - KC477427) and their characteristics were summarized in Table 3.

Table 3 Characteristics of the 15 and 12 polymorphic STRs isolated from T. pallidus and C. juglandicola

Characteristics of STR markers in T. pallidus

Allelic diversity ranged from three alleles per locus for TpMSAT05 to nine for TpMSAT11 and TpMSAT14 (Table 3). Measures of averaged heterozygosity ranged from 0.21 to 0.54 for Ho and 0.33 to 0.54 for He (Table 4). One locus, TpMSAT05, exhibited a marginally significant deviation from Hardy-Weinberg Equilibrium (HWE) (χ2 = 16.44, DF = 8, P = 0.04), though this deviation was not significant after Bonferroni correction for multiple comparisons (corrected α = 0.013). Another locus, TpMSAT13, exhibited a highly significant deviation from HWE (χ2 = 40.23, DF = 8, P = 0.002), which was still significant after Bonferroni correction (corrected α = 0.006). Linkage disequilibrium was not observed between any of the STR markers.

Table 4 Source populations of T. pallidus and C. juglandicola

Amplification of tri-nucleotide STRs in C. juglandicola

We selected 20 STRs from our output of candidate polymorphic tri-nucleotide STRs and we were able to consistently amplify 16 of them with standard PCR protocols. Of the 16 markers all but three were found to be polymorphic in our sample populations. One of these markers, CjMSAT12, was excluded from the analysis because it displayed fragment length polymorphisms outside of its expected range. DNA sequences for the repeat region of each STR marker used in this study were uploaded to GenBank (Accession #’s KJ939575 - KJ939587), and their characteristics were summarized in Table 4.

Characteristics of STR markers in C. juglandicola

Allelic diversity for polymorphic loci ranged from two alleles per locus for CjMSAT01, CjMSAT03, CjMSAT08, CjMSAT09, CjMSAT16, and CjMSAT19 to seven for CjMSAT13. Measures of averaged heterozygosity ranged from 0.08 to 0.15 for Ho and 0.08 to 0.17 for He (Table 4). No locus displayed deviations from HWE, and there was no evidence of linkage disequilibrium observed between any of the STR markers.


The genomic revolution sparked by the advent of NGS is well underway, and its low per base pair cost and high number of sequence reads yields many benefits and tools [29], including the rapid development of polymorphic markers for population genetic studies. Our pipeline involving iMSAT identifies polymorphic STRs from two simultaneously obtained sequencing reads. The output of iMSAT facilitates the design of primers for population-level studies, reducing the time and expense associated with the production of STRs.

Potential benefits and limitations

iMSAT represents a significant improvement over existing techniques. NGS technologies are today’s standard for developing STR markers e.g. [5, 30]. They have eliminated the laborious steps associated with plasmid cloning [reviewed by [31]]. However, candidate markers still require testing to identify polymorphic regions. Given the large numbers of candidate markers identified by existing software packages (Tables 1 and 2), the selection of candidate polymorphic loci and their validation with PCR is both expensive and time consuming. Two recent studies using NGS technologies and existing STR software programs employed PCR screening to examine 48 [32] and 342 [24] candidate markers, for which only 11 (23%) and 26 (0.76%), respectively, were used in the subsequent studies. Whether these relatively low rates of success in the development of effective markers were due to problems with PCR amplification or to fixation of the markers once amplified is unclear. Regardless, our approach identifies 17 of the 20 candidate polymorphic STRs, among which 75% of the original 20 are polymorphic for T. pallidus. Thus, iMSAT is a useful tool for population genetic research. Similarly, 16 out of 20 candidates amplified consistently and 12 of the 16 markers (60% of the original 20) are polymorphic for C. juglandicola.

Our program adds virtually no costs to the overall production of STR markers, as an additional Illumina Sequencing Library, for example, can be produced by a third party for as little as $200 USD (quote from the Functional Genomics Laboratory at the University of California Berkeley, June 2014). Freely available, iMSAT generates a list of polymorphic STR markers in a fast and cost-effective manner. Although iMSAT recovers far fewer potential STR markers than other programs, its ability to identify candidate polymorphic greatly outweighs the reduction in total numbers of potential markers. Most studies based on STRs have used relatively few markers (10–50) and, statistically, there is no need to develop upwards of 12,000 STR markers that is possible using NGS technologies. The ‘novelty’ of our approach is to use the polymorphism data provided by the raw sequence reads themselves to identify candidate STR markers, and our program takes advantage of the output from existing software tools [28, 33, 34].

A similar approach to screening NGS sequence results for polymorphic regions before STR development has previously been presented by Hoffman and Nichols [35]. These authors also pooled DNA extracts to create a single sequencing library for 454 sequencing, re-mapped the individual sequence reads to their de novo assembly, and targeted STR repeats that appeared polymorphic. While similar in that both approaches perform in silico polymorphism detection, ours has the advantages that by using barcoded libraries we were able to assign sequence reads to both of our populations of T. pallidus and C. juglandicola with only a single run each. This advantage is particularly valuable, as it allows the identification of markers that not only are likely to be polymorphic, but whose polymorphism can also be characterized as either within and/or between populations. This greatly increases the utility of the data in generating useful STR markers, and may in part explain the greater rate of success we observed in isolating polymorphic markers.

Comparison of results with other species of insects

The availability of published genomes from several insect species allows for comparative genomic analyses, including examinations into the diversity and distribution of STR motifs. Behura and Severson [36] compared coding sequences from 25 species of insects representing five different orders. In contrast to our findings for T. pallidus and C. juglandicola, they found that tri-nucleotide repeats were the most common repeat type across insects. Their results may be inherently biased towards recovery of tri-nucleotide repeats, however, because they focused on coding regions of DNA where single or double base pair insertions/deletions are unlikely [37]. Another recent study [38] examined both coding and non-coding regions, compared published whole genome sequence data from 12 species of insects representing six orders. Although most species had predominantly di- or tri-nucleotide repeats, no one type was dominant; even congeners differed in which type of repeats were most abundant. The most dramatic example occured between Drosophila simulans and D. melanogaster. While D. simulans had relatively equal proportions of di-, and tri-nucleotides as the most abundant repeat types, penta-nucleotide repeats were most abundant in D. melanogaster and twice more than any other type of repeat. Interestingly, they found that STRs were more common among the Hymenoptera and represented a higher percentage of the genome than in any of the other orders of insects examined. The Hymenoptera also differed from other orders in that di-nucleotide repeats were the most abundant type of repeat – between 2 and 5 times more frequent than tri-nucleotide repeats. For the aphid Acyrthosiphon pisum, di-nucleotide repeats were about half as abundant as in their examined hymenopteras. Contrary to our results for C. juglandicola, they also found that tri-nucleotide repeats were the most abundant type of repeat unit.


We announce a novel approach for using NGS technologies in conjunction with several popular software packages to identify polymorphic STRs. This approach allows the rapid and cost-effective development of 15 polymorphic STRs for T. pallidus and 12 for C. juglandicola.


To identify and test STR markers, we used an NGS approach. Sequencing libraries for T. pallidus were created by pooling twenty individuals of T. pallidus reared from filbert aphids collected in Bethel, Oregon, United States into a sample labeled “Hazelnut”, and twenty individuals of T. pallidus reared from walnut aphids collected in Tehran, Iran into a sample labeled “Walnut”. DNA was then extracted from each pooled sample using a Qiagen DNeasy Blood & Tissue Kit (Qiagen) with the following modification. To reduce the amount of residual salt in the extract, critical for NGS applications, we performed the AW1 and AW2 washes twice each, followed by an additional spin step to remove any residual AW2 buffer. This was followed by standard elution with the AE buffer. Sequencing libraries for C. juglandicola were created by pooling 20 individuals of C. juglandicola collected in Upper Lake, California, United States, into a sample labeled “US”, and 20 individuals of C. juglandicola collected in Parnac, France into a sample labeled “France”. DNA was then extracted from each pooled sample using the Qiagen Gentra-PureGene DNA Extraction Kit (Qiagen). Concentrations of nucleic acids for all extracts were then quantified with a ND-1000 NanoDrop® (NanoDrop Technologies, Inc.) and concentrations of double stranded DNA were measured using the Qubit® dsDNA HS Assay kit (Life Technologies Corp.). Sequencing libraries for each T. pallidus extract were created using the Nextera™ DNA Sample Prep Kit (Illumina, Inc.) as per instructions, and each library was constructed using a different Illumina barcoding primer. Sequencing libraries for each C. juglandicola extract were created using the PrepX™ ILM DNA Library Kit (Wafergen Biosystems, Inc.) at The Functional Genomics Laboratory of the University of California Berkeley and each library was constructed using a different Illumina barcoding primer. Sequencing libraries were examined for fragment length distribution and concentrations using a 2100 Expert Bioanalyzer (Agiliant Technologies), and a KAPA Biosystems Library Quantification Kit (KAPA Biosystems). Each species’ libraries were then pooled together, and sequenced independently each using a single run of an Illumina HiSeq2000 (Illumina, Inc.) sequencer at the Vincent J. Coates Genomics Sequencing Laboratory, University of California Berkeley.

Summary statistics representing the sequence results from the Illumina HiSeq2000 run were calculated using the FASTX-Toolkit [39] and this program was then used to filter low quality reads. Individual Illumina sequencing reads were then assembled into contigs using the de novo assembly program Velvet 1.1.06 [28] with a kmer length of 65 for T. pallidus and 67 for C. juglandicola. We then used MSATCOMMANDER [12] and Phobos [20] to identify di-, tri-, tetra-, and penta-nucleotide repeat patterns with their default settings. We then compared these results to those identified with iMSAT, our novel python program ( iMSAT uses a “.vcf” report file of polymorphic sites generated from mapping NGS sequencing reads to a genome assembly using BWA [34] and SAMtools [33]. Both BWA and SAMtools are widely used for the identification and analysis of single nucleotide polymorphisms [4042].

iMSAT uses an interactive command-line interface (see Figure 1 for a graphical representation). The first user prompt asks for the locations of the alignment and “.vcf” files as well as the formatting of the alignment file. iMSAT can process alignment files with both traditional “.FASTA preprocessing” formatted sequence data (i.e. one line beginning with a “>” followed by the sequence name, and a second line with the sequence data) or a tab-delimited format (i.e. one line with both sequence name and sequence data separated by a tab). Our program subsequently filters the “indel” data from the “.vcf” report, and searches for all polymorphic sites that represented di-, tri-, tetra-, and penta-nucleotide STRs that were greater than five repeat units in length. The user is prompted as to whether or not they would like a separate list of polymorphic STR markers that are “fixed” in one of their target populations. The program then produces a “.FASTA preprocessing” formatted file identifying the location of the polymorphic STR in the sequence title and 300 bp of both the leading and trailing sequence strands to allow for the production of primers. Sequence information for all primers used, including fluorescent label are available in Additional file 1.

Figure 1

iMSAT workflow diagram. Before using iMSAT, barcoded NGS sequencing libraries are produced (A) and sequenced (B) and either used to create a de novo assembly or to align an available reference genome (C). Then using SAMtools and BWA the individual sequence reads are used to create a polymorphism report (D) that includs the location of the polymorphic loci, type (SNP or INDEL), and other quality statistics. iMSAT then uses the output and the alignment file to filter the polymorphism data based on a user specified number of base pairs (E), identifies the STR motifs and the number of repeats (F), and outputs separate .FASTA preprocessing files for each candidate locus that can be used with primer design software (G).

For T. pallidus, we tested the program as if whole genome assembly was being used. To do so, we combined the contig sequences generated by Velvet [28] into one continuous DNA sequence strand with the union of two contig sequences being differentiated by the addition of 100 “N” base pairs. The addition of these “N” base pairs ensured that when we could exclude any potential STR markers that would be artificially created when we joined the separate consensus sequences. For C. juglandicola we tested the program using the raw output from Velvet [28] where all 474,388 contigs were represented in traditional FASTA preprocessing formatting. For both species we then used the “vcf” report generated using BWA [34] and SAMtools [33] to target polymorphic STRs.

To validate this approach, we filtered the data to only include those repeat regions that were; a) tri-nucleotide repeats, b) were composed of high-quality reads based on the “vcf” file, and c) had no “N” base calls within 300 base pairs of the repeat region to allow for primer construction. Though the majority of candidate STRs were di-nucleotide repeats, we selected tri-nucleotide repeats because of the known problems associated with scoring di-nucleotide STRs caused by “stutters” [43]. For each species, we then selected the 20 tri-nucleotide candidate markers with the greatest number of repeat units. Primer pairs for all markers were generated using Primer3 [44] as implemented in Geneious 5.6.2 [45]. To ease multiplexing, primers were designed to be at least 20 base pairs in length and to have an optimal annealing temperature of 57°C.

To test the candidate markers, DNA was extracted from 91 female T. pallidus reared from three species of aphid (C. juglandicola and Panaphis juglandis on walnut, and M. coryli on filbert) from nine different localities (Table 4), and from 46 female C. juglandicola from four different localities (Table 4) using the Qiagen DNeasy Blood & Tissue Kit (Qiagen). Non-labeled oligonucelotide primers were used to test and optimize the conditions of each of the 20 candidate regions for each species through standard PCR protocols and the amplified fragments were sequenced at the DNA Sequencing Facility of the University of California Berkeley. For candidate markers that were consistently amplified, fluorescent-labeled primers compatible with the GeneScan™ 600 LIZ size standard (Life Technologies) were used. PCR conditions were then re-optimized for the fluorescent-labeled primers. For both species markers were amplified using one of two PCR protocols signified by their primary annealing temperature (Ta 57 or Ta 50). For Ta 57 an initial denaturation for 5 min at 95°C was followed by 35 cycles of 95°C for 1 min, 57°C for 1.5 min, 72°C for 1 min, followed by a 10 min extension period at 72°C. For Ta 50, a touchdown protocol was used with the following profile: an initial denaturation for 5 min at 95°C, followed by 14 cycles of 95°C for 1 min, 57°C for 1.5 min, and 72°C for 1 min where the annealing temperature decreased 0.5°C every cycle, followed by 30 cycles with an annealing temperature of 50°C, and a 10 min extension period at 72°C.

Fragment lengths were measured in comparison to the GeneScan™ LIZ® 600 Size Standard v. 2.0 (Life technologies) using an Applied Biosystems 3730XL (Life Technologies) at the DNA Sequencing Facility at the University of California Berkeley, and scored using the Microsatellite Plug-in for Geneious 5.6.2 [45]. The number of alleles per locus (k), averaged observed (Ho) and expected (He) heterozygosity, deviations from Hardy-Weinberg equilibrium (HWE), and presence of linkage disequilibrium (LD) between loci were examined using GenePop 4.2 [46, 47].

Availability of supporting data

The data sets supporting the results of this article are available in the NCBI data repository. Raw sequence reads from the Illumina HiSeq runs for Trioxys pallidus and Chromphis juglandicola have the following Accession Numbers: SAMN03020618 - SAMN03020621 and can be found at

Sequences of SSR loci for Trioxys pallidus have the following Accession Numbers: KC477413 - KC477427 and can be found at

Sequences of SSR loci for Chrompahis juglandicola have the following Accession Numbers: KJ939575 - KJ939587 and can be found at

The python script and supporting information are available in the SourceForge source code repository at


  1. 1.

    Mardis ER: The impact of next-generation sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133-141. 10.1016/j.tig.2007.12.007.

    CAS  PubMed  Article  Google Scholar 

  2. 2.

    Metzker ML: Applications of next-generation sequencing; sequencing technologies - the next generation. Nat Rev Genet. 2010, 11 (1): 31-46. 10.1038/nrg2626.

    CAS  PubMed  Article  Google Scholar 

  3. 3.

    Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Ekblom R, Galindo J: Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity. 2011, 107 (1): 1-15. 10.1038/hdy.2010.152.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  5. 5.

    Gardner MG, Fitch AJ, Bertozzi T, Lowe AJ: Rise of the machines - recommendations for ecologists when using next generation sequencing for microsatellite development. Mol Ecol Resour. 2011, 11 (6): 1093-1101. 10.1111/j.1755-0998.2011.03037.x.

    PubMed  Article  Google Scholar 

  6. 6.

    Keller D, Jung E, Holderegger R: Development of microsatellite markers for the wetland grasshopper Stethophyma grossum. Conserv Genet Resour. 2012, 4 (2): 507-509. 10.1007/s12686-011-9586-1.

    Article  Google Scholar 

  7. 7.

    Bai X, Zhang W, Orantes L, Jun T-H, Mittapalli O, Mian MAR, Michel AP: Combining Next-Generation Sequencing strategies for rapid molecular resource development from an invasive aphid species, Aphis glycines. PLoS One. 2010, 5 (6): e11370-10.1371/journal.pone.0011370.

    PubMed Central  PubMed  Article  Google Scholar 

  8. 8.

    Cerna K, Straka J: Identification of 37 microsatellite loci for Anthophora plumipes (Hymenoptera: Apidae) using next generation sequencing and their utility in related species. Eur J Entomol. 2012, 109 (2): 155-160. 10.14411/eje.2012.020.

    CAS  Article  Google Scholar 

  9. 9.

    López-Uribe MM, Santiago CK, Bogdanowicz SM, Danforth BN: Discovery and characterization of microsatellites for the solitary bee Colletes inaequalis using Sanger and 454 pyrosequencing. Apidologie. 2013, 44 (2): 163-172. 10.1007/s13592-012-0168-3.

    Article  Google Scholar 

  10. 10.

    Castoe TA, Poole AW, Koning APJ, Jones KL, Tomback DF, Oyler-McCance SJ, Fike J, Lance SL, Streicher JW, Smith EN, Pollock DD: Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a snake. PLoS ONE. 2012, 7: e30953-10.1371/journal.pone.0030953.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  11. 11.

    Castoe TA, Poole AW, Gu W, de Koning APJ, Daza M, Smith EN, Pollock DD: Rapid identification of thousands of microsatellite loci for the copperhead snake (Agkistrodon contortrix) from modest amounts of 454 shotgun genome sequence. Mol Ecol Resour. 2010, 10: 341-347. 10.1111/j.1755-0998.2009.02750.x.

    CAS  PubMed  Article  Google Scholar 

  12. 12.

    Faircloth BC: MSATCOMMANDER: detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol Ecol Resour. 2008, 8 (1): 92-94. 10.1111/j.1471-8286.2007.01884.x.

    CAS  PubMed  Article  Google Scholar 

  13. 13.

    Dereeper A, Argout X, Billot C, Rami J-F, Ruiz M: SAT, a flexible and optimized Web application for SSR marker development. BMC Bioinformatics. 2007, 8: 465-10.1186/1471-2105-8-465.

    PubMed Central  PubMed  Article  Google Scholar 

  14. 14.

    Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005, 21: I152-I158. 10.1093/bioinformatics/bti1003.

    CAS  PubMed  Article  Google Scholar 

  15. 15.

    Jewell E, Robinson A, Savage D, Erwin T, Love CG, Lim GAC, Li X, Batley J, Spangenberg GC, Edwards D: SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery. Nucleic Acids Res. 2006, 34: W656-W659. 10.1093/nar/gkl083.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  16. 16.

    Kofler R, Schloetterer C, Lelley T: SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics. 2007, 23 (13): 1683-1685. 10.1093/bioinformatics/btm157.

    CAS  PubMed  Article  Google Scholar 

  17. 17.

    Kraemer L, Beszteri B, Gaebler-Schwarz S, Held C, Leese F, Mayer C, Poehlmann K, Frickenhaus S: STAMP: Extensions to the STADEN sequence analysis package for high throughput interactive microsatellite marker design. BMC Bioinformatics. 2009, 10: 465-

    Article  Google Scholar 

  18. 18.

    Li Q, Wan J-M: SSRHunter: Development of a local searching software for SSR sites. Yichuan. 2005, 27 (5): 808-810.

    Google Scholar 

  19. 19.

    Thurston MI, Field D: Msatfinder: detection and characterisation of microsatellites. 2005, Distributed by the authors at CEH Oxford, Mansfield Road, Oxford OX1 3SR

    Google Scholar 

  20. 20.

    Mayer C: Phobos 3.3.11.>. In.; 2006–2010

  21. 21.

    Du L, Li Y, Zhang X, Yue B: MSDB: A user-friendly program for reporting distribution and building databases of microsatellites from genome sequences. J Hered. 2013, 104 (1): 154-157. 10.1093/jhered/ess082.

    PubMed  Article  Google Scholar 

  22. 22.

    Miller MP, Knaus BJ, Mullins TD, Haig SM: SSR_pipeline: A bioinformatic infrastructure for identifying microsatellites from paired-end Illumina high-throughput DNA sequencing data. J Hered. 2013, 104 (6): 881-885. 10.1093/jhered/est056.

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Lim KG, Kwoh CK, Hsu LY, Wirawan A: Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief Bioinform. 2013, 14 (1): 67-81. 10.1093/bib/bbs023.

    PubMed  Article  Google Scholar 

  24. 24.

    Jun TH, Mian MAR, Freewalt K, Mittapalli O, Michel AP: Development of genic-SSRs markers from soybean aphid sequences generated by high-throughput sequencing of cDNA library. J Appl Entomol. 2012, 136 (8): 614-625. 10.1111/j.1439-0418.2011.01697.x.

    CAS  Article  Google Scholar 

  25. 25.

    van den Bosch R, Frazer BD, Davis CS, Messenger PS, Hom R: Trioxys pallidus: an effective new walnut aphid parasite from Iran. Calif Agric. 1970, 24 (11): 8-10.

    Google Scholar 

  26. 26.

    van den Bosch R, Hom R, Matteson P, Frazer BD, Messenger PS, Davis CS: Biological control of the walnut aphid in California: impact of the parasite. Trioxys pallidus Hilgardia. 1979, 47 (1): 1-13. 10.3733/hilg.v47n01p001.

    Article  Google Scholar 

  27. 27.

    van den Bosch R, Schilinger EI, Hagen KS: Initial field observations in California on Trioxys pallidus (Haliday) a recently introduced parasite of the walnut aphid. J Econ Entomol. 1962, 55 (6): 857-862.

    Article  Google Scholar 

  28. 28.

    Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  29. 29.

    Silva PI, Martins AM, Gouvea EG, Pessoa-Filho M, Ferreira ME: Develoment and validation of microsatellite markers for Brachiaria ruziziensis obtained by partial genome assembly of Illumina single-end reads. BMC Genomics. 2013, 14 (17): 9-

    Google Scholar 

  30. 30.

    Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, McCown B, Harbut R, Simon P: Using Next-Generation Sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. Am J Bot. 2012, 99 (2): 193-208. 10.3732/ajb.1100394.

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Andrés JA, Bogdanowicz SM: Isolating microsatellite loci: looking back, looking ahead. Molecular Methods for Evolutionary Genetics. Edited by: Orgogozo V, Rockman MV. 2011, New York: Springer, 211-232.

    Google Scholar 

  32. 32.

    McEwen JR, Vamosi JC, Rogers SM: Rapid isolation and cross-amplification of microsatellite markers in Plectritis congesta (Valerianaceae) with 454 sequencing. Am J Bot. 2011, 98 (12): e369-e371. 10.3732/ajb.1100160.

    CAS  PubMed  Article  Google Scholar 

  33. 33.

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup’ GPDP: The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.

    PubMed Central  PubMed  Article  Google Scholar 

  34. 34.

    Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  35. 35.

    Hoffman JI, Nichols HJ: A novel approach for mining polymorphic microsatellite markers in silico. PLoS One. 2011, 6 (8): e23283-10.1371/journal.pone.0023283.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  36. 36.

    Behura SK, Severson DW: Genome-wide comparative analysis of simple sequence coding repeats among 25 insect species. Gene. 2012, 504 (2): 226-232. 10.1016/j.gene.2012.05.020.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  37. 37.

    Li Y-C, Korol AB, Fahima T, Beiles A, Nevo E: Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol. 2002, 11 (12): 2453-2465. 10.1046/j.1365-294X.2002.01643.x.

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Pannebakker BA, Niehuis O, Hedley A, Gadau J, Shuker DM: The distribution of microsatellites in the Nasonia parasitoid wasp genome. Insect Mol Biol. 2010, 19: 91-98.

    CAS  PubMed  Article  Google Scholar 

  39. 39.

    Gordon A: FASTX-toolkit - FASTA/FASTQ preprocessing tools. 2009,,

    Google Scholar 

  40. 40.

    McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  41. 41.

    DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKeena A, Fennell TJ, Kernystsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43 (5): 491-498. 10.1038/ng.806.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  42. 42.

    Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011, 12 (6): 443-451. 10.1038/nrg2986.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  43. 43.

    DeWoody JA, Nason JD, Hipkins VD: Mitigating scoring errors in microsatellite data from wild populations. Mol Ecol Notes. 2006, 6 (4): 951-957. 10.1111/j.1471-8286.2006.01449.x.

    CAS  Article  Google Scholar 

  44. 44.

    Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz S, Misener S. 2000, Totowa, NJ: Humana Press, 365-386.

    Google Scholar 

  45. 45.

    Geneious v 5.6.2 created by Biomatters. Available from

  46. 46.

    Raymond M, Rousset F: GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. J Heredity. 1995, 86: 248-249.

    Google Scholar 

  47. 47.

    Rousset F: Genepop’007: a complete reimplementation of the Genepop software for Windows and Linux. Mol Ecol Resour. 2008, 8: 103-106. 10.1111/j.1471-8286.2007.01931.x.

    PubMed  Article  Google Scholar 

Download references


Funding for this study was provided by an EPA-STAR grant awarded to JCA. This work could not have been completed without the technical assistance of J Lozier, B Ort, S Ramirez and N Tsutsui. This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 Instrumentation Grants S10RR029668 and S10RR027303 where de novo genome assembly was performed. We thank all of our collaborators who have aided in the collection of T. pallidus and C. juglandicola individuals, as well as two anonymous reviewers whose comments and suggestions greatly improved the manuscript.

Author information



Corresponding author

Correspondence to Jeremy C Andersen.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JCA and NJM conceived of the study, participated in the design, wrote the manuscript, and read and approved the final version. JCA collected the molecular data.

Jeremy C Andersen and Nicholas J Mills contributed equally to this work.

Electronic supplementary material


Additional file 1: STR primer sequences. Excel spreadsheet including the sequence data for forward and reverse primers used to target each STR marker. (XLS 30 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Andersen, J.C., Mills, N.J. iMSAT: a novel approach to the development of microsatellite loci using barcoded Illumina libraries. BMC Genomics 15, 858 (2014).

Download citation


  • Illumina Sequencing
  • Sequencing Library
  • Candidate Marker
  • Aphid Host
  • Exist Software Package