iMSAT: a novel approach to the development of microsatellite loci using barcoded Illumina libraries
BMC Genomics volume 15, Article number: 858 (2014)
Illumina sequencing with its high number of reads and low per base pair cost is an attractive technology for development of molecular resources for non-model organisms. While many software packages have been developed to identify short tandem repeats (STRs) from next-generation sequencing data, these methods do not inform the investigator as to whether or not candidate loci are polymorphic in their target populations.
We provide a python program iMSAT that uses the polymorphism data obtained from mapping individual Illumina sequence reads onto a reference genome to identify polymorphic STRs. Using this approach, we identified 9,119 candidate polymorphic STRs for use with the parasitoid wasp Trioxys pallidus and 2,378 candidate polymorphic STRs for use with the aphid Chromaphis juglandicola. For both organisms we selected 20 candidate tri-nucleotide STRs for validation. Using fluorescent-labeled oligonucleotide primers, we genotyped 91 female T. pallidus collected in nine localities and 46 female C. juglandicola collected in 4 localities and found 15 of the examined markers to be polymorphic for T. pallidus and 12 of the examined markers to be polymorphic for C. juglandicola.
We present a novel approach that uses standard Illumina barcoding primers and a single Illumina HiSeq run to target polymorphic STR fragments to develop and test STR markers. We validate this approach using the parasitoid wasp T. pallidus and its aphid host C. juglandicola. This approach, which would also be compatible with 454 Sequencing, allowed us to quickly identify markers with known variability. Accordingly, our method constitutes a significant improvement over existing STR identification software packages.
Next-Generation Sequencing (NGS) technologies have recently revolutionized the ease and the rate at which genetic resources can be developed [1–3]. This revolution has made it possible to now use the genetic tools in nearly all organisms that were previously only available for model taxa . For example, the development of short tandem repeat (STR or microsatellite) markers in non-model organisms is currently undergoing a complete paradigm shift in regards to the techniques and methods used to isolate potential markers, particularly for insects [5–8]. These new techniques have replaced the laborious steps of DNA cloning with the speed and ease of NGS technologies  allowing researchers to quickly develop candidate markers for their study organisms.
Perhaps as a result of the increased accessibility to NGS technologies for STR marker development, there has also been an increased level of activity in the development of associated software for identifying candidate markers. Many highly cited packages exist [10–22], but see  for a more thorough review. Current software packages work by searching through assembled sequence data for tandem-repeat regions, and then apply filters to optimize the list of candidate sites based on user specified criteria. Newer programs directly allow for the use of whole genome data  or raw sequence data from paired-end Illumina sequencing . The most recent software program, SSR_pipeline, represents a particularly important improvement in the identification of STR data by directly using quality scores from the sequence reads to aid in the identification of STR markers. Yet, two major obstacles remain for the identification of STR markers for genetic analyses based on NGS sequence results. First, most existing software packages provide an overwhelming number of candidate loci. Second, they do not inform the investigator as to which loci are polymorphic for the populations under study. For example, a recent study that integrated NGS technologies with existing software packages to develop markers for a species of aphid  found that only 0.76% of their 342 candidate markers were suitable for use, though whether this was due to failure to amplify target loci with standard PCR protocols or because amplified loci were not polymorphic is unknown to us.
To improve the rate at which polymorphic STR markers can be identified and developed for use in genetic analyses, we present the use of a novel technique that uses barcoded Illumina sequencing libraries to identify polymorphic STR markers. We test this technique using two insect species from phylogentically distinct orders: the braconid wasp Trioxys pallidus and its aphid host Chromaphis juglandicola. Both insects occur in walnut orchards in California where C. juglandicola is an important invasive pest that was brought under effective biological control by the deliberate introduction of T. pallidus from Iran in 1969 [25–27]. We then compare the patterns of STR motifs found for each species to other results published from their respective orders to examine the value of this approach for phylogenetically diverse organisms.
Next-generation sequencing results
Our Illumina sequencing run for T. pallidus resulted in over 99 million 100 base pair reads and our Illumina sequencing run for C. juglandicola resulted in over 170 million 100 base pair reads. Using the de novo genome assembly program Velvet , we constructed 65,535 contigs with an average length of 834.2 base pairs and an average coverage of 8.0X for T. pallidus. For C. juglandicola, we developed 474,388 contigs with an average length of 2,573 base pairs and an average coverage of 11.2X. Raw sequence reads were uploaded to BioSample (Accession Numbers SAMN03020618 - SAMN03020621).
Comparison of iMSAT to other methods for identifying STRs
Using MSATCOMMANDER  and Phobos , we identified 18,525 and 21,860 STRs for T. pallidus (Table 1) and 187,270 and 100,290 STRs for C. juglandicola (Table 2), respectively. Using our novel python program iMSAT (https://sourceforge.net/projects/imsat/), we found 9,119 candidate polymorphic STRs for T. pallidus (Table 1) and 2,378 candidate polymorphic STRs for C. juglandicola (Table 2). For T. pallidus di-nucleotide STRs were the most abundant type identified by all three methods in being between 55% and 65%, tri-nucleotide STRs represented between 31% and 38% and tetra- and penta-nucleotide STRs constituted between 4% and 6% combined. For C. juglandicola, di-nucleotide STRs were again the most abundant type identified by all three methods (82-93%). However, for this species, tri-nucleotide STRs were rare (4.6-17%) while tetra- and penta-nucleotide STRs were extremely rare (0-3%).
Amplification of tri-nucleotide STRs in T. pallidus
Of the selected 20 STRs from our output of candidate polymorphic tri-nucleotide STRs, we consistently amplified 17 of them with standard PCR protocols. Two of these markers, TpMSAT3 and TpMSAT6 were excluded from the analysis because they displayed repeat patterns not consistent with tri-nucleotide STRs. DNA sequences for the repeat region of each STR marker used in this study were uploaded to GenBank (Accession #’s KC477413 - KC477427) and their characteristics were summarized in Table 3.
Characteristics of STR markers in T. pallidus
Allelic diversity ranged from three alleles per locus for TpMSAT05 to nine for TpMSAT11 and TpMSAT14 (Table 3). Measures of averaged heterozygosity ranged from 0.21 to 0.54 for Ho and 0.33 to 0.54 for He (Table 4). One locus, TpMSAT05, exhibited a marginally significant deviation from Hardy-Weinberg Equilibrium (HWE) (χ2 = 16.44, DF = 8, P = 0.04), though this deviation was not significant after Bonferroni correction for multiple comparisons (corrected α = 0.013). Another locus, TpMSAT13, exhibited a highly significant deviation from HWE (χ2 = 40.23, DF = 8, P = 0.002), which was still significant after Bonferroni correction (corrected α = 0.006). Linkage disequilibrium was not observed between any of the STR markers.
Amplification of tri-nucleotide STRs in C. juglandicola
We selected 20 STRs from our output of candidate polymorphic tri-nucleotide STRs and we were able to consistently amplify 16 of them with standard PCR protocols. Of the 16 markers all but three were found to be polymorphic in our sample populations. One of these markers, CjMSAT12, was excluded from the analysis because it displayed fragment length polymorphisms outside of its expected range. DNA sequences for the repeat region of each STR marker used in this study were uploaded to GenBank (Accession #’s KJ939575 - KJ939587), and their characteristics were summarized in Table 4.
Characteristics of STR markers in C. juglandicola
Allelic diversity for polymorphic loci ranged from two alleles per locus for CjMSAT01, CjMSAT03, CjMSAT08, CjMSAT09, CjMSAT16, and CjMSAT19 to seven for CjMSAT13. Measures of averaged heterozygosity ranged from 0.08 to 0.15 for Ho and 0.08 to 0.17 for He (Table 4). No locus displayed deviations from HWE, and there was no evidence of linkage disequilibrium observed between any of the STR markers.
The genomic revolution sparked by the advent of NGS is well underway, and its low per base pair cost and high number of sequence reads yields many benefits and tools , including the rapid development of polymorphic markers for population genetic studies. Our pipeline involving iMSAT identifies polymorphic STRs from two simultaneously obtained sequencing reads. The output of iMSAT facilitates the design of primers for population-level studies, reducing the time and expense associated with the production of STRs.
Potential benefits and limitations
iMSAT represents a significant improvement over existing techniques. NGS technologies are today’s standard for developing STR markers e.g. [5, 30]. They have eliminated the laborious steps associated with plasmid cloning [reviewed by ]. However, candidate markers still require testing to identify polymorphic regions. Given the large numbers of candidate markers identified by existing software packages (Tables 1 and 2), the selection of candidate polymorphic loci and their validation with PCR is both expensive and time consuming. Two recent studies using NGS technologies and existing STR software programs employed PCR screening to examine 48  and 342  candidate markers, for which only 11 (23%) and 26 (0.76%), respectively, were used in the subsequent studies. Whether these relatively low rates of success in the development of effective markers were due to problems with PCR amplification or to fixation of the markers once amplified is unclear. Regardless, our approach identifies 17 of the 20 candidate polymorphic STRs, among which 75% of the original 20 are polymorphic for T. pallidus. Thus, iMSAT is a useful tool for population genetic research. Similarly, 16 out of 20 candidates amplified consistently and 12 of the 16 markers (60% of the original 20) are polymorphic for C. juglandicola.
Our program adds virtually no costs to the overall production of STR markers, as an additional Illumina Sequencing Library, for example, can be produced by a third party for as little as $200 USD (quote from the Functional Genomics Laboratory at the University of California Berkeley, June 2014). Freely available, iMSAT generates a list of polymorphic STR markers in a fast and cost-effective manner. Although iMSAT recovers far fewer potential STR markers than other programs, its ability to identify candidate polymorphic greatly outweighs the reduction in total numbers of potential markers. Most studies based on STRs have used relatively few markers (10–50) and, statistically, there is no need to develop upwards of 12,000 STR markers that is possible using NGS technologies. The ‘novelty’ of our approach is to use the polymorphism data provided by the raw sequence reads themselves to identify candidate STR markers, and our program takes advantage of the output from existing software tools [28, 33, 34].
A similar approach to screening NGS sequence results for polymorphic regions before STR development has previously been presented by Hoffman and Nichols . These authors also pooled DNA extracts to create a single sequencing library for 454 sequencing, re-mapped the individual sequence reads to their de novo assembly, and targeted STR repeats that appeared polymorphic. While similar in that both approaches perform in silico polymorphism detection, ours has the advantages that by using barcoded libraries we were able to assign sequence reads to both of our populations of T. pallidus and C. juglandicola with only a single run each. This advantage is particularly valuable, as it allows the identification of markers that not only are likely to be polymorphic, but whose polymorphism can also be characterized as either within and/or between populations. This greatly increases the utility of the data in generating useful STR markers, and may in part explain the greater rate of success we observed in isolating polymorphic markers.
Comparison of results with other species of insects
The availability of published genomes from several insect species allows for comparative genomic analyses, including examinations into the diversity and distribution of STR motifs. Behura and Severson  compared coding sequences from 25 species of insects representing five different orders. In contrast to our findings for T. pallidus and C. juglandicola, they found that tri-nucleotide repeats were the most common repeat type across insects. Their results may be inherently biased towards recovery of tri-nucleotide repeats, however, because they focused on coding regions of DNA where single or double base pair insertions/deletions are unlikely . Another recent study  examined both coding and non-coding regions, compared published whole genome sequence data from 12 species of insects representing six orders. Although most species had predominantly di- or tri-nucleotide repeats, no one type was dominant; even congeners differed in which type of repeats were most abundant. The most dramatic example occured between Drosophila simulans and D. melanogaster. While D. simulans had relatively equal proportions of di-, and tri-nucleotides as the most abundant repeat types, penta-nucleotide repeats were most abundant in D. melanogaster and twice more than any other type of repeat. Interestingly, they found that STRs were more common among the Hymenoptera and represented a higher percentage of the genome than in any of the other orders of insects examined. The Hymenoptera also differed from other orders in that di-nucleotide repeats were the most abundant type of repeat – between 2 and 5 times more frequent than tri-nucleotide repeats. For the aphid Acyrthosiphon pisum, di-nucleotide repeats were about half as abundant as in their examined hymenopteras. Contrary to our results for C. juglandicola, they also found that tri-nucleotide repeats were the most abundant type of repeat unit.
We announce a novel approach for using NGS technologies in conjunction with several popular software packages to identify polymorphic STRs. This approach allows the rapid and cost-effective development of 15 polymorphic STRs for T. pallidus and 12 for C. juglandicola.
To identify and test STR markers, we used an NGS approach. Sequencing libraries for T. pallidus were created by pooling twenty individuals of T. pallidus reared from filbert aphids collected in Bethel, Oregon, United States into a sample labeled “Hazelnut”, and twenty individuals of T. pallidus reared from walnut aphids collected in Tehran, Iran into a sample labeled “Walnut”. DNA was then extracted from each pooled sample using a Qiagen DNeasy Blood & Tissue Kit (Qiagen) with the following modification. To reduce the amount of residual salt in the extract, critical for NGS applications, we performed the AW1 and AW2 washes twice each, followed by an additional spin step to remove any residual AW2 buffer. This was followed by standard elution with the AE buffer. Sequencing libraries for C. juglandicola were created by pooling 20 individuals of C. juglandicola collected in Upper Lake, California, United States, into a sample labeled “US”, and 20 individuals of C. juglandicola collected in Parnac, France into a sample labeled “France”. DNA was then extracted from each pooled sample using the Qiagen Gentra-PureGene DNA Extraction Kit (Qiagen). Concentrations of nucleic acids for all extracts were then quantified with a ND-1000 NanoDrop® (NanoDrop Technologies, Inc.) and concentrations of double stranded DNA were measured using the Qubit® dsDNA HS Assay kit (Life Technologies Corp.). Sequencing libraries for each T. pallidus extract were created using the Nextera™ DNA Sample Prep Kit (Illumina, Inc.) as per instructions, and each library was constructed using a different Illumina barcoding primer. Sequencing libraries for each C. juglandicola extract were created using the PrepX™ ILM DNA Library Kit (Wafergen Biosystems, Inc.) at The Functional Genomics Laboratory of the University of California Berkeley and each library was constructed using a different Illumina barcoding primer. Sequencing libraries were examined for fragment length distribution and concentrations using a 2100 Expert Bioanalyzer (Agiliant Technologies), and a KAPA Biosystems Library Quantification Kit (KAPA Biosystems). Each species’ libraries were then pooled together, and sequenced independently each using a single run of an Illumina HiSeq2000 (Illumina, Inc.) sequencer at the Vincent J. Coates Genomics Sequencing Laboratory, University of California Berkeley.
Summary statistics representing the sequence results from the Illumina HiSeq2000 run were calculated using the FASTX-Toolkit  and this program was then used to filter low quality reads. Individual Illumina sequencing reads were then assembled into contigs using the de novo assembly program Velvet 1.1.06  with a kmer length of 65 for T. pallidus and 67 for C. juglandicola. We then used MSATCOMMANDER  and Phobos  to identify di-, tri-, tetra-, and penta-nucleotide repeat patterns with their default settings. We then compared these results to those identified with iMSAT, our novel python program (https://sourceforge.net/projects/imsat/). iMSAT uses a “.vcf” report file of polymorphic sites generated from mapping NGS sequencing reads to a genome assembly using BWA  and SAMtools . Both BWA and SAMtools are widely used for the identification and analysis of single nucleotide polymorphisms [40–42].
iMSAT uses an interactive command-line interface (see Figure 1 for a graphical representation). The first user prompt asks for the locations of the alignment and “.vcf” files as well as the formatting of the alignment file. iMSAT can process alignment files with both traditional “.FASTA preprocessing” formatted sequence data (i.e. one line beginning with a “>” followed by the sequence name, and a second line with the sequence data) or a tab-delimited format (i.e. one line with both sequence name and sequence data separated by a tab). Our program subsequently filters the “indel” data from the “.vcf” report, and searches for all polymorphic sites that represented di-, tri-, tetra-, and penta-nucleotide STRs that were greater than five repeat units in length. The user is prompted as to whether or not they would like a separate list of polymorphic STR markers that are “fixed” in one of their target populations. The program then produces a “.FASTA preprocessing” formatted file identifying the location of the polymorphic STR in the sequence title and 300 bp of both the leading and trailing sequence strands to allow for the production of primers. Sequence information for all primers used, including fluorescent label are available in Additional file 1.
For T. pallidus, we tested the program as if whole genome assembly was being used. To do so, we combined the contig sequences generated by Velvet  into one continuous DNA sequence strand with the union of two contig sequences being differentiated by the addition of 100 “N” base pairs. The addition of these “N” base pairs ensured that when we could exclude any potential STR markers that would be artificially created when we joined the separate consensus sequences. For C. juglandicola we tested the program using the raw output from Velvet  where all 474,388 contigs were represented in traditional FASTA preprocessing formatting. For both species we then used the “vcf” report generated using BWA  and SAMtools  to target polymorphic STRs.
To validate this approach, we filtered the data to only include those repeat regions that were; a) tri-nucleotide repeats, b) were composed of high-quality reads based on the “vcf” file, and c) had no “N” base calls within 300 base pairs of the repeat region to allow for primer construction. Though the majority of candidate STRs were di-nucleotide repeats, we selected tri-nucleotide repeats because of the known problems associated with scoring di-nucleotide STRs caused by “stutters” . For each species, we then selected the 20 tri-nucleotide candidate markers with the greatest number of repeat units. Primer pairs for all markers were generated using Primer3  as implemented in Geneious 5.6.2 . To ease multiplexing, primers were designed to be at least 20 base pairs in length and to have an optimal annealing temperature of 57°C.
To test the candidate markers, DNA was extracted from 91 female T. pallidus reared from three species of aphid (C. juglandicola and Panaphis juglandis on walnut, and M. coryli on filbert) from nine different localities (Table 4), and from 46 female C. juglandicola from four different localities (Table 4) using the Qiagen DNeasy Blood & Tissue Kit (Qiagen). Non-labeled oligonucelotide primers were used to test and optimize the conditions of each of the 20 candidate regions for each species through standard PCR protocols and the amplified fragments were sequenced at the DNA Sequencing Facility of the University of California Berkeley. For candidate markers that were consistently amplified, fluorescent-labeled primers compatible with the GeneScan™ 600 LIZ size standard (Life Technologies) were used. PCR conditions were then re-optimized for the fluorescent-labeled primers. For both species markers were amplified using one of two PCR protocols signified by their primary annealing temperature (Ta 57 or Ta 50). For Ta 57 an initial denaturation for 5 min at 95°C was followed by 35 cycles of 95°C for 1 min, 57°C for 1.5 min, 72°C for 1 min, followed by a 10 min extension period at 72°C. For Ta 50, a touchdown protocol was used with the following profile: an initial denaturation for 5 min at 95°C, followed by 14 cycles of 95°C for 1 min, 57°C for 1.5 min, and 72°C for 1 min where the annealing temperature decreased 0.5°C every cycle, followed by 30 cycles with an annealing temperature of 50°C, and a 10 min extension period at 72°C.
Fragment lengths were measured in comparison to the GeneScan™ LIZ® 600 Size Standard v. 2.0 (Life technologies) using an Applied Biosystems 3730XL (Life Technologies) at the DNA Sequencing Facility at the University of California Berkeley, and scored using the Microsatellite Plug-in for Geneious 5.6.2 . The number of alleles per locus (k), averaged observed (Ho) and expected (He) heterozygosity, deviations from Hardy-Weinberg equilibrium (HWE), and presence of linkage disequilibrium (LD) between loci were examined using GenePop 4.2 [46, 47].
Availability of supporting data
The data sets supporting the results of this article are available in the NCBI data repository. Raw sequence reads from the Illumina HiSeq runs for Trioxys pallidus and Chromphis juglandicola have the following Accession Numbers: SAMN03020618 - SAMN03020621 and can be found at http://www.ncbi.nlm.nih.gov/biosample/.
Sequences of SSR loci for Trioxys pallidus have the following Accession Numbers: KC477413 - KC477427 and can be found at http://www.ncbi.nlm.nih.gov/nuccore/.
Sequences of SSR loci for Chrompahis juglandicola have the following Accession Numbers: KJ939575 - KJ939587 and can be found at http://www.ncbi.nlm.nih.gov/nucore/.
The python script and supporting information are available in the SourceForge source code repository at http://sourceforge.net/projects/imsat/.
Mardis ER: The impact of next-generation sequencing technology on genetics. Trends Genet. 2008, 24 (3): 133-141. 10.1016/j.tig.2007.12.007.
Metzker ML: Applications of next-generation sequencing; sequencing technologies - the next generation. Nat Rev Genet. 2010, 11 (1): 31-46. 10.1038/nrg2626.
Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.
Ekblom R, Galindo J: Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity. 2011, 107 (1): 1-15. 10.1038/hdy.2010.152.
Gardner MG, Fitch AJ, Bertozzi T, Lowe AJ: Rise of the machines - recommendations for ecologists when using next generation sequencing for microsatellite development. Mol Ecol Resour. 2011, 11 (6): 1093-1101. 10.1111/j.1755-0998.2011.03037.x.
Keller D, Jung E, Holderegger R: Development of microsatellite markers for the wetland grasshopper Stethophyma grossum. Conserv Genet Resour. 2012, 4 (2): 507-509. 10.1007/s12686-011-9586-1.
Bai X, Zhang W, Orantes L, Jun T-H, Mittapalli O, Mian MAR, Michel AP: Combining Next-Generation Sequencing strategies for rapid molecular resource development from an invasive aphid species, Aphis glycines. PLoS One. 2010, 5 (6): e11370-10.1371/journal.pone.0011370.
Cerna K, Straka J: Identification of 37 microsatellite loci for Anthophora plumipes (Hymenoptera: Apidae) using next generation sequencing and their utility in related species. Eur J Entomol. 2012, 109 (2): 155-160. 10.14411/eje.2012.020.
López-Uribe MM, Santiago CK, Bogdanowicz SM, Danforth BN: Discovery and characterization of microsatellites for the solitary bee Colletes inaequalis using Sanger and 454 pyrosequencing. Apidologie. 2013, 44 (2): 163-172. 10.1007/s13592-012-0168-3.
Castoe TA, Poole AW, Koning APJ, Jones KL, Tomback DF, Oyler-McCance SJ, Fike J, Lance SL, Streicher JW, Smith EN, Pollock DD: Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a snake. PLoS ONE. 2012, 7: e30953-10.1371/journal.pone.0030953.
Castoe TA, Poole AW, Gu W, de Koning APJ, Daza M, Smith EN, Pollock DD: Rapid identification of thousands of microsatellite loci for the copperhead snake (Agkistrodon contortrix) from modest amounts of 454 shotgun genome sequence. Mol Ecol Resour. 2010, 10: 341-347. 10.1111/j.1755-0998.2009.02750.x.
Faircloth BC: MSATCOMMANDER: detection of microsatellite repeat arrays and automated, locus-specific primer design. Mol Ecol Resour. 2008, 8 (1): 92-94. 10.1111/j.1471-8286.2007.01884.x.
Dereeper A, Argout X, Billot C, Rami J-F, Ruiz M: SAT, a flexible and optimized Web application for SSR marker development. BMC Bioinformatics. 2007, 8: 465-10.1186/1471-2105-8-465.
Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics. 2005, 21: I152-I158. 10.1093/bioinformatics/bti1003.
Jewell E, Robinson A, Savage D, Erwin T, Love CG, Lim GAC, Li X, Batley J, Spangenberg GC, Edwards D: SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery. Nucleic Acids Res. 2006, 34: W656-W659. 10.1093/nar/gkl083.
Kofler R, Schloetterer C, Lelley T: SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics. 2007, 23 (13): 1683-1685. 10.1093/bioinformatics/btm157.
Kraemer L, Beszteri B, Gaebler-Schwarz S, Held C, Leese F, Mayer C, Poehlmann K, Frickenhaus S: STAMP: Extensions to the STADEN sequence analysis package for high throughput interactive microsatellite marker design. BMC Bioinformatics. 2009, 10: 465-
Li Q, Wan J-M: SSRHunter: Development of a local searching software for SSR sites. Yichuan. 2005, 27 (5): 808-810.
Thurston MI, Field D: Msatfinder: detection and characterisation of microsatellites. 2005, Distributed by the authors at http://www.bioinformatics.org/groups/?group_id=469. CEH Oxford, Mansfield Road, Oxford OX1 3SR
Mayer C: Phobos 3.3.11. http://www.rub.de/spezzoo/cm/cm_phobos.htm>. In.; 2006–2010
Du L, Li Y, Zhang X, Yue B: MSDB: A user-friendly program for reporting distribution and building databases of microsatellites from genome sequences. J Hered. 2013, 104 (1): 154-157. 10.1093/jhered/ess082.
Miller MP, Knaus BJ, Mullins TD, Haig SM: SSR_pipeline: A bioinformatic infrastructure for identifying microsatellites from paired-end Illumina high-throughput DNA sequencing data. J Hered. 2013, 104 (6): 881-885. 10.1093/jhered/est056.
Lim KG, Kwoh CK, Hsu LY, Wirawan A: Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief Bioinform. 2013, 14 (1): 67-81. 10.1093/bib/bbs023.
Jun TH, Mian MAR, Freewalt K, Mittapalli O, Michel AP: Development of genic-SSRs markers from soybean aphid sequences generated by high-throughput sequencing of cDNA library. J Appl Entomol. 2012, 136 (8): 614-625. 10.1111/j.1439-0418.2011.01697.x.
van den Bosch R, Frazer BD, Davis CS, Messenger PS, Hom R: Trioxys pallidus: an effective new walnut aphid parasite from Iran. Calif Agric. 1970, 24 (11): 8-10.
van den Bosch R, Hom R, Matteson P, Frazer BD, Messenger PS, Davis CS: Biological control of the walnut aphid in California: impact of the parasite. Trioxys pallidus Hilgardia. 1979, 47 (1): 1-13. 10.3733/hilg.v47n01p001.
van den Bosch R, Schilinger EI, Hagen KS: Initial field observations in California on Trioxys pallidus (Haliday) a recently introduced parasite of the walnut aphid. J Econ Entomol. 1962, 55 (6): 857-862.
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.
Silva PI, Martins AM, Gouvea EG, Pessoa-Filho M, Ferreira ME: Develoment and validation of microsatellite markers for Brachiaria ruziziensis obtained by partial genome assembly of Illumina single-end reads. BMC Genomics. 2013, 14 (17): 9-
Zalapa JE, Cuevas H, Zhu H, Steffan S, Senalik D, Zeldin E, McCown B, Harbut R, Simon P: Using Next-Generation Sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. Am J Bot. 2012, 99 (2): 193-208. 10.3732/ajb.1100394.
Andrés JA, Bogdanowicz SM: Isolating microsatellite loci: looking back, looking ahead. Molecular Methods for Evolutionary Genetics. Edited by: Orgogozo V, Rockman MV. 2011, New York: Springer, 211-232.
McEwen JR, Vamosi JC, Rogers SM: Rapid isolation and cross-amplification of microsatellite markers in Plectritis congesta (Valerianaceae) with 454 sequencing. Am J Bot. 2011, 98 (12): e369-e371. 10.3732/ajb.1100160.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup’ GPDP: The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.
Hoffman JI, Nichols HJ: A novel approach for mining polymorphic microsatellite markers in silico. PLoS One. 2011, 6 (8): e23283-10.1371/journal.pone.0023283.
Behura SK, Severson DW: Genome-wide comparative analysis of simple sequence coding repeats among 25 insect species. Gene. 2012, 504 (2): 226-232. 10.1016/j.gene.2012.05.020.
Li Y-C, Korol AB, Fahima T, Beiles A, Nevo E: Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol. 2002, 11 (12): 2453-2465. 10.1046/j.1365-294X.2002.01643.x.
Pannebakker BA, Niehuis O, Hedley A, Gadau J, Shuker DM: The distribution of microsatellites in the Nasonia parasitoid wasp genome. Insect Mol Biol. 2010, 19: 91-98.
Gordon A: FASTX-toolkit - FASTA/FASTQ preprocessing tools. 2009, http://hannonlab.cshl.edu/fastx_toolkit/,
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKeena A, Fennell TJ, Kernystsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43 (5): 491-498. 10.1038/ng.806.
Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011, 12 (6): 443-451. 10.1038/nrg2986.
DeWoody JA, Nason JD, Hipkins VD: Mitigating scoring errors in microsatellite data from wild populations. Mol Ecol Notes. 2006, 6 (4): 951-957. 10.1111/j.1471-8286.2006.01449.x.
Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by: Krawetz S, Misener S. 2000, Totowa, NJ: Humana Press, 365-386.
Geneious v 5.6.2 created by Biomatters. Available from http://www.geneious.com
Raymond M, Rousset F: GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. J Heredity. 1995, 86: 248-249.
Rousset F: Genepop’007: a complete reimplementation of the Genepop software for Windows and Linux. Mol Ecol Resour. 2008, 8: 103-106. 10.1111/j.1471-8286.2007.01931.x.
Funding for this study was provided by an EPA-STAR grant awarded to JCA. This work could not have been completed without the technical assistance of J Lozier, B Ort, S Ramirez and N Tsutsui. This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 Instrumentation Grants S10RR029668 and S10RR027303 where de novo genome assembly was performed. We thank all of our collaborators who have aided in the collection of T. pallidus and C. juglandicola individuals, as well as two anonymous reviewers whose comments and suggestions greatly improved the manuscript.
The authors declare that they have no competing interests.
JCA and NJM conceived of the study, participated in the design, wrote the manuscript, and read and approved the final version. JCA collected the molecular data.
Jeremy C Andersen and Nicholas J Mills contributed equally to this work.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Andersen, J.C., Mills, N.J. iMSAT: a novel approach to the development of microsatellite loci using barcoded Illumina libraries. BMC Genomics 15, 858 (2014). https://doi.org/10.1186/1471-2164-15-858