Distinct chromatin features characterize different classes of repeat sequences in Drosophila melanogaster
© Krassovsky and Henikoff; licensee BioMed Central Ltd. 2014
Received: 13 October 2013
Accepted: 30 January 2014
Published: 6 February 2014
Repeat sequences are abundant in eukaryotic genomes but many are excluded from genome assemblies. In Drosophila melanogaster classical studies of repeat content suggested variability between individuals, but they lacked the precision of modern high throughput sequencing technologies. Genome-wide profiling of chromatin features such as histone tail modifications and DNA-binding proteins relies on alignment to the reference genome and hence excludes highly repetitive sequences.
By analyzing repeat libraries, sequence complexity and k-mer counts we determined the abundances of different D. melanogaster repeat classes in flies in two public datasets, DGRP and modENCODE. We found that larval DNA was depleted of all repeat classes relative to adult and embryonic DNA, as expected from the known depletion of repeat-rich pericentromeric regions during polytenization of larval tissues. By applying a method that is independent of alignment to the genome assembly, we found that satellite repeats associate with distinct H3 tail modifications, such as H3K9me2 and H3K9me3 for short repeats and H3K9me1 for 359 bp repeats. Short AT-rich repeats however are depleted of nucleosomes and hence all histone modifications and associated chromatin proteins.
The total repeat content and association of repeat sequences with chromatin modifications can be determined despite repeats being excluded from genome assemblies, revealing unexpected distinctions in chromatin features based on sequence composition.
KeywordsDNA satellites Next-generation sequencing ChIP-seq Histone modification
A large fraction of almost all eukaryotic genomes consists of tandemly repeated sequences, often called satellite DNA. Because satellite DNA repeat units are short and vary little if at all in sequence, they are mostly excluded from genome assemblies. This is unfortunate, because some satellite sequences are known to have important functions. For example centromeres – chromosome loci that form microtubule attachment sites during mitosis - are known to be positioned on repeat sequences in many organisms. Another example is telomeres – sequences that cap chromosome ends. Also, changes in satellite sequences can play roles in evolution and disease. Satellite sequences might have other functions: For example, they have been shown to be important for meiotic recombination.
Whole genome sequencing has become a widely used tool for the discovery of genetic variation, nucleosome position, chromatin modifications and DNA binding proteins. Analysis of such experiments relies on the alignment of individual sequence segments to the reference genome. Because it is impossible to uniquely align satellite sequences they are usually excluded from the genome assembly. Thus alternative methods for analysis of repeats in sequencing data are required.
Several groups have used methods independent of the alignment to the reference genome to analyze repeat content. Parker et al. used direct counting of telomere repeat sequences to estimate changes of telomere repeats in tumor cells. Hayden and Willard used k-mer analysis and repeat library alignment to describe canine centromere sequences. In this study we adapt these approaches with some modifications to study the repeat content of Drosophila melanogaster. Drosophila is a particularly attractive model because of previous extensive characterization of its satellite repeat content by methods other than sequencing. This provides a unique opportunity to verify recovery of satellites in sequencing data.
Drosophila repeat families were initially discovered by detection of satellite bands that form during CsCl equilibrium gradient centrifugation[6, 7]. When centrifuged at high force CsCl creates a gradient of Cs+ ions. While moving through this gradient long DNA fragments separate into distinct bands based on the buoyant density, which depends primarily on GC content. Tandem arrays of short repeat units typically have biased base composition [e.g. (AAGAG)n is 60% A + T and 40% G + C)], so that they effectively separate from long DNA fragments of average base composition comprising single-copy DNA. The bands can then be extracted, cloned and sequenced. Three of four of such bands were shown to consist of short (5 to 10 bp) repeats, while the fourth one consisted of longer (359 bp) repeat sequences. Both classes of these tandem repeats are highly abundant in the genome and map primarily to centromeric and pericentric regions of chromosomes. Another class of repeats is derived from transposable elements, found in all eukaryotic genomes. These are DNA sequences that have inserted copies of themselves into new positions in the genome, and are interspersed with single-copy or satellite sequences. Transposons have been shown to comprise ~15% of the Drosophila melanogaster genome.
Most of the repeated sequences are packaged into heterochromatin – condensed and mostly transcriptionally silent chromatin identified cytologically as being more refractile and more densely staining. Heterochromatin can be divided into constitutive, chromatin that is permanently condensed and is found in pericentric and telomeric regions, and facultative, gene-containing chromatin where condensation is associated with repression of gene expression. It is thought that this condensation and gene repression is achieved partly by posttranslational histone modifications, which are known to be enriched at different functional elements. For example, H3K4me3 is found at promoters of active genes in a variety of organisms. In flies it has been shown that constitutive heterochromatin is associated with H3K9me2 while repressed genes in facultative heterochromatin are enriched in H3K27me3.
Associations of specific DNA binding proteins with histone modifications are currently studied by chromatin immunoprecipitation followed by sequencing (Chip-Seq). Analysis of such experiments has thus far been limited to single-copy sequences and interspersed repeats. Studies of tandemly repeated sequences in heterochromatin by Chip-Seq are impeded by the inability to uniquely align repeat-containing reads to the reference genome.
Recently two large-scale initiatives generated comprehensive D. melanogaster sequencing datasets. One is the Drosophila Genetic Reference Panel (DGRP) which included sequencing of 200 inbred fly lines generated from wild caught flies. Data generated by DGRP were used to study phenotype-genotype associations and evolution of the subset of repeat sequences that could be mapped uniquely. The other large-scale initiative is modENCODE, which included Chip-Seq experiments for a number of DNA binding proteins and histone tail modifications from different developmental stages of Drosophila. In this study we used these publicly available resources to analyze the repeat content of the D. melanogaster genome and to identify histone tail modifications and DNA binding proteins associated with satellites.
Results and discussion
Strategy for quantifying repeats
Repeat libraries were constructed for short repeats (FlyBase), 359 bp repeats and transposons (FlyBase) by extraction from existing genome assemblies including unassembled contigs. A complexity score similar to the DUST score used by the BLAST program to exclude low-complexity sequences was calculated for each sequenced fragment. Short repeat units have low complexity scores. This means that finding the number of sequences with a low complexity score allows us to estimate the percentage of short repeats independent of alignment programs.
Another alternative to alignment to reference libraries is k-mer analysis. A k-mer is a sequence of length k found in the sequencing dataset. For example, the 5-mer AAGAG is one of the 5-mers found in the sequence AAGAGAAGAG. By counting all k-mers we can find sequences that occur very frequently in the genome. Satellites result in k-mers that have a much higher count than the rest of the genome. K-mer and low complexity analyses provide an estimation of the completeness of repeat libraries and find abundant sequences not included in the libraries.
To find the overall fold enrichment of satellites in the ChIP-seq experiments we calculated fold enrichment of each k-mer present at least twice in both ChIP and input samples. We then grouped k-mers by their enrichment value and classified each k-mer as being in one of the repeat families, in the euchromatin, or not previously mapped, by aligning to each of the repeat libraries and the whole annotated genome. Classification and grouping of k-mers in this manner allows visualization of enrichment or depletion of a particular repeat family.
Abundance of different repeat families in DGRP flies
% of total genome (mean between lines)
Standard deviation between lines
p-value (significance of the difference between lines)
Total low complexity
Low complexity, not short repeats
359 bp repeats
All repeat classes are depleted in larval relative to adult and embryonic development stages
An unusual feature of the Drosophila genome is changes in the repeat content for some cells of the organism. The best-known example is the polytene chromosomes of larval salivary glands[16, 17]. During larval stages of development rapid growth of the organism requires high levels of gene expression. To efficiently accommodate this need cells undergo multiple rounds of replication without mitoses or cell divisions. This process results in banded polytene chromosomes, which are composed of multiple precisely aligned copies of sister chromatids and homologs. Chromosomes in most larval tissues are polytene, with salivary gland chromosomes being the most extremely polytenized. Polytene chromosomes are depleted of heterochromatin, especially satellite sequences. Another cell type that is depleted of satellite DNA is nurse cells, which produce yolk that is stored in the egg and consumed during embryonic development. Unlike salivary glands and other larval tissues, nurse cell nuclei lack polytene structure.
Abundance of different repeat families in the fly genome by developmental stage (modENCODE dataset)
% of total genome (median)
Embryo (12–14 hr)
359 bp repeats
359 bp repeats
359 bp repeats
p-values that indicate significance of the difference between the developmental stages
359 bp repeats
359 bp repeats
359 bp repeats
The percentage of short repeats has been previously estimated at 5-10% using cot curves and at 18-22%[7, 8] using CsCl gradients for embryos of the Oregon R wild-type lab strain. In the datasets examined the short repeat content is 3% on average for DGRP flies, 12% for embryos and adult heads and 3% for larvae of modENCODE flies. The lower repeat content in DGRP samples might be explained by the use of whole flies in these experiments, where nurse and follicle cells that make up most of the mass of healthy adult female flies in uncrowded cultures will lower the satellite repeat content.
Multiple experimental replicates available in both modENCODE and DGRP datasets present an opportunity to examine the reliability of modern sequencing methods for recovery of repeated sequences. On the one hand, the abundance of short repeats varies only slightly between distinct DGRP fly lines and replicate datasets derived from the same fly line. On the other hand, there is considerable variation between modENCODE replicate datasets. This variability is unlikely to be due to alignment bias because we obtained similar estimates using an alternative method of finding short repeat sequences based on sequence complexity. High variation might be due to random loss or amplification of repeats by PCR during Illumina library preparation and flow cell cluster generation. PCR is known to have biases in amplification due to composition. Alternatively, variation might be due to real sequence heterogeneity among individuals of the same laboratory strain, as has been previously suggested for satellite sequences. This possibility is consistent with our observation that short repeat recovery is less variable for embryos than adult flies. It is possible that fewer adult flies are needed for the recovery of material necessary for constructing Illumina sequencing libraries compared to embryos, where there are fewer cells per individual, making inter-individual heterogeneity more evident in adult than embryo samples. Another possibility is that the differences arose from the DNA preparation method used. Unlike DGRP samples where DNA was extracted from flies directly, modENCODE samples were prepared for ChIP by extraction of cross-linked chromatin. Sonication of chromatin as opposed to sonication of naked DNA might produce additional variability between the experiments.
The most frequent k-mers in the fly genome are known short repeats and transposons
In both DGRP and modENCODE samples we found repeat sequences that were not previously identified as abundant repeats. We might attribute this discrepancy to differences in the method of DNA extraction or to evolutionary changes that occurred in the Oregon R strain, which has been maintained in various laboratories for several decades. Abundances of specific repeat sequences among the DGRP fly lines are very similar, indicating strong homogeneity of the satellites among individual flies in the wild outbred population. We did not find four of the previously reported repeats ("AACAA", "AATAAC", "AATAC" and "AATAG") among the top repeats in modENCODE samples, although they are present at very low abundance in both modENCODE and DGRP samples.
Histone H3 modifications are differentially associated with short repeat sequences
Posttranslational histone tail modifications are known to be involved in transposon silencing. Transposons are classified into groups based on their structure and mechanism of transposition. Retrotransposons, which mobilize via an RNA intermediate, are further divided into LTR (long terminal repeats) and non-LTR classes. Previous studies investigated whether some transposon families are preferentially associated with specific histone modifications. For example, a screen of 100 transposon sequences by microarray analysis found that retrotransposons have higher enrichment in H3K9me2 than other elements. In contrast, roo retrotransposons, which are abundant in euchromatin, were found to have lower H3K9me2 association. Four families of LTR retrotransposons (roo, tirant, 412 and F) were also screened for preferential association with H3K9me2 and H3K27me3 in different strains of D. melanogaster and were found to have large variations in enrichment between the strains. However, our systematic investigation based on classification of Illumina sequencing reads both by k-mer analysis and direct counting of reads mapped to different transposon groups detected no preferential association of LTR, non-LTR or IR transposon classes with histone modifications (Additional file1: Table S1).
All three HP1 proteins localize to transposons
AT-rich repeats are depleted of nucleosomes
We noticed that even for the H3K9me3 and H3K9me2 heterochromatic marks that are enriched in short repeats a few specific repeat sequences are depleted of these marks. This prompted us to look for a common property of short repeats that are depleted of heterochromatic marks and HP1 proteins. We first classified each repeat by the length of the repeat unit but detected no consistent trends. However, when we classified repeats by AT content, we observed that the short repeats that are depleted of HP1 family proteins and histone modifications are also very AT rich (Figures 7 and8 right panels; Additional file1: Table S2).
Highly AT-rich DNA has a narrow minor groove and reduced flexibility, which disfavors the tight wrapping of the double helix around the nucleosome core and results in preferential exclusion of nucleosomes[24, 33]. As the (AATAT)n, (AATATAT)n and other long arrays of pure AT sequences are predicted to be especially stiff, they would be expected to prevent nucleosome formation. Alternatively, nucleosomes might be actively excluded by competing DNA-binding proteins. For example, D1 protein is a highly abundant nuclear protein that is preferentially bound to the narrow minor groove of AATAT Drosophila satellite arrays[35, 36]. With ~1 D1 protein per 10 nucleosomes, and ~0.7% of the genome consisting of AATAT-containing satellites, there is enough chromatin-bound D1 to occupy ~1/2 of all the AATAT sites [(1 D1/10 nucleosomes)/(30 AATAT sites in a 150 bp span) = 0.0033% of the genome]. These alternative possibilities are not mutually exclusive, as expansion of an AATAT array would both exclude nucleosomes and promote D1 binding, consistent with the possibility that D1 protein has evolved to package stiff AT-rich satellites.
We have shown that enrichment of repeated sequences can be quantified in Chip-Seq experiments despite being largely excluded from genome assemblies. The strategy of calculating k-mer enrichment relative to the input allows direct comparison of repeat sequences to single-copy regions of the genome. The strategy presented here can be applied to study other chromatin features known to be located in heterochromatin, for example centromeres.
We also have presented the first analysis of the chromatin landscape of repeat sequences in a genome-wide context. Different heterochromatic regions of D. melanogaster have distinct chromatin features. Satellite sequences associate with specific histone modifications such as H3K9me2 and H3K9me3. All three HP1 homologues are enriched at transposons and do not show preferences for particular types of transposons. AT-rich short repeats are depleted of nucleosomes and hence all histone modifications. We conclude that ChIP-seq datasets can be mined to provide unexpected insights into chromatin landscapes of repetitive sequences.
modENCODE datasets ( http://data.modencode.org ) used in this study
DGRP datasets used in the study
SRR018517, SRR018518, SRR018519
SRR018574, SRR018575, SRR029943, SRR034277, SRR034278
SRR018579, SRR034281, SRR034282, SRR034283
SRR018287, SRR018288, SRR018289, SRR018290, SRR018291
SRR018582, SRR018583, SRR018584
SRR018591, SRR018592, SRR018593
SRR018292, SRR018293, SRR018294, SRR060098
SRR018295, SRR018296, SRR018297
The short repeats library was downloaded fromhttp://hgdownload.cse.ucsc.edu/goldenPath/dm3/bigZips/chromTrf.tar.gz. It was converted to a fasta file format and purged of duplicate entries. The 359 bp library was the one produced in and obtained directly from Dr. Gustavo Kuhn.
The transposon library was downloaded from FlyBase r5.48ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.48_FB2012_06/fasta/dmel-alltransposon-r5.48.fasta.gz.
Determining repeat abundances
Sequences were mapped to a short repeat and 359 bp repeat library using BWA and to transposons using Novoalign (http://www.novocraft.com). The number of sequences mapped to the library was divided by the total number of sequences to find the percentage abundance.
K-mers were obtained using Jellyfish with the command "jellyfish count -m 31 -o output -c 3 -s 10000000 -t 12 -L 2". K-mers were split into quintiles using a custom script and aligned to repeat libraries using BWA (short repeats and 359 bp repeats) and Novoalign (transposons).
Finding low complexity sequences
The percentage of low complexity sequences was found by running Prinseq with the command "perl prinseq-lite.pl -fastq FileName.fastq -verbose -graph_data -out_good null -lc_method dust -lc_threshold 7". This command separates sequences with a complexity score above 7 and records that number in the log file.
K-mer analysis of the ChIP-seq datasets
A k-mer count table was constructed for both Input and ChIP samples using Jellyfish and then merged using a custom R script. For each k-mer, enrichment was calculated by dividing the number of counts in the ChIP dataset by the number of counts in the corresponding Input dataset and normalized by multiplying by the ratio of the total number of sequences in input and ChIP samples. K-mers then were split into 16 groups based on enrichment. K-mer sequences from each group were aligned to repeat libraries and the genome assembly using BWA and Novoalign. The number of k-mers in each group mapped to a particular library was noted and then plotted using an R script. For experiments with two replicates the median number of k-mers in each bin is shown.
(Drosophila Genetic Reference Panel)
(mod el organism ENC yclopedia O f D NA E lements)
(Heterochromatin-associated protein 1).
We thank Harmit Malik and Charles Laird for helpful suggestions, Jorja Henikoff for scripting advice and Paul Talbert for comments on the manuscript.
- Yunis JJ, Yasmineh WG: Heterochromatin, satellite DNA, and cell function. Structural DNA of eucaryotes may support and protect genes and aid in speciation. Science. 1971, 174 (4015): 1200-1209.PubMedView ArticleGoogle Scholar
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327.PubMed CentralPubMedView ArticleGoogle Scholar
- Parker M, et al: Assessing telomeric DNA content in pediatric cancers using whole-genome sequencing data. Genome Biol. 2012, 13 (12): R113-PubMed CentralPubMedView ArticleGoogle Scholar
- Yamamoto M, Miklos GL: Genetic studies on heterochromatin in Drosophila melanogaster and their implications for the functions of satellite DNA. Chromosoma. 1978, 66 (1): 71-98.PubMedView ArticleGoogle Scholar
- Hayden KE, Willard HF: Composition and organization of active centromere sequences in complex genomes. BMC Genomics. 2012, 13: 324-PubMed CentralPubMedView ArticleGoogle Scholar
- Brutlag D, et al: Highly repeated DNA in Drosophila melanogaster. J Mol Biol. 1977, 112 (1): 31-47.PubMedView ArticleGoogle Scholar
- Peacock WJ, et al: The organization of highly repeated DNA sequences in Drosophila melanogaster chromosomes. Cold Spring Harb Symp Quant Biol. 1974, 38: 405-416.PubMedView ArticleGoogle Scholar
- Lohe AR, Brutlag DL: Multiplicity of satellite DNA sequences in Drosophila melanogaster. Proc Natl Acad Sci U S A. 1986, 83: 696-700.PubMed CentralPubMedView ArticleGoogle Scholar
- Kaminker JS, et al: The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol. 2002, 3 (12): RESEARCH0084-PubMed CentralPubMedView ArticleGoogle Scholar
- Brutlag DL: Molecular arrangement and evolution of heterochromatic DNA. Annu Rev Genet. 1980, 14: 121-144.PubMedView ArticleGoogle Scholar
- Yasuhara JC, Wakimoto BT: Molecular landscape of modified histones in Drosophila heterochromatic genes and euchromatin-heterochromatin transition zones. PLoS Genet. 2008, 4 (1): e16-PubMed CentralPubMedView ArticleGoogle Scholar
- Ruthenburg AJ, Allis CD, Wysocka J: Methylation of lysine 4 on histone H3: intricacy of writing and reading a single epigenetic mark. Mol Cell. 2007, 25 (1): 15-30.PubMedView ArticleGoogle Scholar
- Ebert A, et al: Histone modification and the control of heterochromatic gene silencing in Drosophila. Chromosome Res. 2006, 14 (4): 377-392.PubMedView ArticleGoogle Scholar
- Mackay TF, et al: The Drosophila melanogaster Genetic Reference Panel. Nature. 2012, 482 (7384): 173-178.PubMed CentralPubMedView ArticleGoogle Scholar
- Kuhn GC, et al: The 1.688 repetitive DNA of Drosophila: concerted evolution at different genomic scales and association with genes. Mol Biol Evol. 2012, 29 (1): 7-11.PubMedView ArticleGoogle Scholar
- Gall JG, Cohen EH, Polan ML: Reptitive DNA sequences in drosophila. Chromosoma. 1971, 33 (3): 319-344.PubMedView ArticleGoogle Scholar
- Dickson E, Boyd JB, Laird CD: Sequence diversity of polytene chromosome DNA from Drosophila hydei. J Mol Biol. 1971, 61 (3): 615-627.PubMedView ArticleGoogle Scholar
- Hammond MP, Laird CD: Chromosome structure and DNA replication in nurse and follicle cells of Drosophila melanogaster. Chromosoma. 1985, 91: 267-278.PubMedView ArticleGoogle Scholar
- Laird CD, McCarthy BJ: Molecular characterization of the Drosophila genome. Genetics. 1969, 63 (4): 865-882.PubMed CentralPubMedGoogle Scholar
- Bate M, Martinez Arias A: The Development of Drosophila melanogaster. 1993, Plainview, N.Y: Cold Spring Harbor Laboratory PressGoogle Scholar
- Aird D, et al: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-PubMed CentralPubMedView ArticleGoogle Scholar
- Altschul SF, et al: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.PubMedView ArticleGoogle Scholar
- Dechering KJ, et al: Distinct frequency-distributions of homopolymeric DNA tracts in different genomes. Nucleic Acids Res. 1998, 26 (17): 4056-4062.PubMed CentralPubMedView ArticleGoogle Scholar
- Struhl K, Segal E: Determinants of nucleosome positioning. Nat Struct Mol Biol. 2013, 20 (3): 267-273.PubMed CentralPubMedView ArticleGoogle Scholar
- Pinheiro I, et al: Prdm3 and Prdm16 are H3K9me1 methyltransferases required for mammalian heterochromatin integrity. Cell. 2012, 150 (5): 948-960.PubMedView ArticleGoogle Scholar
- Veiseth SV, et al: The SUVR4 histone lysine methyltransferase binds ubiquitin and converts H3K9me1 to H3K9me3 on transposon chromatin in Arabidopsis. PLoS Genet. 2011, 7 (3): e1001325-PubMed CentralPubMedView ArticleGoogle Scholar
- Bannister AJ, et al: Selective recognition of methylated lysine 9 on histone H3 by the HP1 chromo domain. Nature. 2001, 410: 120-124.PubMedView ArticleGoogle Scholar
- Lachner M, et al: Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature. 2001, 410: 116-120.PubMedView ArticleGoogle Scholar
- Canzio D, et al: Chromodomain-mediated oligomerization of HP1 suggests a nucleosome-bridging mechanism for heterochromatin assembly. Mol Cell. 2011, 41 (1): 67-81.PubMed CentralPubMedView ArticleGoogle Scholar
- Smothers JF, Henikoff S: The hinge of and chromo shadow domain impart distinct targeting of HP1-like proteins. Mol Cell Biol. 2001, 21: 2555-2569.PubMed CentralPubMedView ArticleGoogle Scholar
- Blattes R, et al: Displacement of D1, HP1 and topoisomerase II from satellite heterochromatin by a specific polyamide. EMBO J. 2006, 25 (11): 2397-2408.PubMed CentralPubMedView ArticleGoogle Scholar
- Perrini B, et al: HP1 controls telomere capping, telomere elongation, and telomere silencing by two different mechanisms in Drosophila. Mol Cell. 2004, 15 (3): 467-476.PubMedView ArticleGoogle Scholar
- Iyer V, Struhl K: Poly(dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J. 1995, 14 (11): 2570-2579.PubMed CentralPubMedGoogle Scholar
- Vlahovicek K, Kajan L, Pongor S: DNA analysis servers: plot.it, bend.it, model.it and IS. Nucleic Acids Res. 2003, 31 (13): 3686-3687.PubMed CentralPubMedView ArticleGoogle Scholar
- Levinger L, Varshavsky A: Protein D1 preferentially binds A + T-rich DNA in vitro and is a component of Drosophila melanogaster nucleosomes containing A + T-rich satellite DNA. Proc Natl Acad Sci U S A. 1982, 79: 7152-7156.PubMed CentralPubMedView ArticleGoogle Scholar
- Levinger LF: D1 protein of Drosophila melanogaster. Purification and AT-DNA binding properties. J Biol Chem. 1985, 260 (26): 14311-14318.PubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.PubMed CentralPubMedView ArticleGoogle Scholar
- Marcais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011, 27 (6): 764-770.PubMed CentralPubMedView ArticleGoogle Scholar
- Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011, 27 (6): 863-864.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.