- Research article
- Open Access
Billions of basepairs of recently expanded, repetitive sequences are eliminated from the somatic genome during copepod development
© Sun et al.; licensee BioMed Central Ltd. 2014
- Received: 29 August 2013
- Accepted: 3 March 2014
- Published: 11 March 2014
Chromatin diminution is the programmed deletion of DNA from presomatic cell or nuclear lineages during development, producing single organisms that contain two different nuclear genomes. Phylogenetically diverse taxa undergo chromatin diminution — some ciliates, nematodes, copepods, and vertebrates. In cyclopoid copepods, chromatin diminution occurs in taxa with massively expanded germline genomes; depending on species, germline genome sizes range from 15 – 75 Gb, 12–74 Gb of which are lost from pre-somatic cell lineages at germline – soma differentiation. This is more than an order of magnitude more sequence than is lost from other taxa. To date, the sequences excised from copepods have not been analyzed using large-scale genomic datasets, and the processes underlying germline genomic gigantism in this clade, as well as the functional significance of chromatin diminution, have remained unknown.
Here, we used high-throughput genomic sequencing and qPCR to characterize the germline and somatic genomes of Mesocyclops edax, a freshwater cyclopoid copepod with a germline genome of ~15 Gb and a somatic genome of ~3 Gb. We show that most of the excised DNA consists of repetitive sequences that are either 1) verifiable transposable elements (TEs), or 2) non-simple repeats of likely TE origin. Repeat elements in both genomes are skewed towards younger (i.e. less divergent) elements. Excised DNA is a non-random sample of the germline repeat element landscape; younger elements, and high frequency DNA transposons and LINEs, are disproportionately eliminated from the somatic genome.
Our results suggest that germline genome expansion in M. edax reflects explosive repeat element proliferation, and that billions of base pairs of such repeats are deleted from the somatic genome every generation. Thus, we hypothesize that chromatin diminution is a mechanism that controls repeat element load, and that this load can evolve to be divergent between tissue types within single organisms.
- Chromatin diminution
- Genome size
- Transposable elements
- Germline-soma differentiation
Chromatin diminution is the programmed deletion of DNA from presomatic cell or nuclear lineages during development, producing single organisms that contain two dramatically different nuclear genomes. Phylogenetically diverse taxa undergo some form of chromatin diminution or chromosome elimination, including representatives from the ciliates, nematodes, copepods, lampreys, and hagfish [1–7]. Cyclopoid copepods excise more than an order of magnitude more sequence than other taxa; depending on species, ~12 to 74 Gb of DNA are lost from the presomatic cell lineage at an early embryonic cleavage division each generation [8, 9]. This excision produces somatic genomes that are only ~1% – 20% of the size of the germline genomes. Post-diminuted somatic genomes are comparable in size to genomes of related copepods that lack chromatin diminution, whereas the pre-diminuted germline genomes are 5- to 75-fold larger [8, 9].
Different functions for chromatin diminution have been proposed for different taxa, reflecting specific properties of the excised DNA (when known). For example, reduction in gene misexpression in the soma through excision of germline-specific genes has been proposed for the lamprey and the nematode Ascaris[5, 10, 11]. In contrast, transposon elimination from the soma, which enables high-level somatic gene expression based on somatic polyploidy without high levels of TE transcription, has been proposed for ciliates [7, 12, 13]. In copepods, maintenance of high rRNA gene copy number during embryogenesis has been proposed [14, 15], as has removal of short, non-functional sequences [16, 17]. Because comparatively little is known about the sequence of excised DNA in copepods, however, most copepod-specific hypotheses for the functional significance of chromatin diminution focus on the effects of higher DNA content in the germline, irrespective of sequence. Such hypotheses include modulation of egg size, body size, developmental rate, and cell division rate, all of which are correlated with genome size in copepods [18–21]. To date, the DNA excised from copepods has not been analyzed using large-scale genomic data [16, 17], hindering in-depth study of the processes underlying germline genomic gigantism in this clade, as well as the functional significance of chromatin diminution.
Here, we used high-throughput genomic shotgun data and qPCR to characterize the germline and somatic genomes of Mesocyclops edax, a freshwater cyclopoid copepod species with a diploid germline genome of ~15 Gb and a diploid somatic genome of ~3 Gb . Chromosomal fragmentation and excision of DNA occurs at the 5th cleavage division; 12 Gb of sequence is lost from each of the 15 presumptive somatic cells during anaphase. Eliminated DNA appears to be heterochromatic and located primarily in the distal half of chromosomes; chromosome number is constant before and after diminution (n = 14) . Because genome expansion (in the absence of polyploidy) typically reflects the accumulation of transposable elements in eukaryotes, we targeted our study towards identifying and classifying repetitive DNA in the germline and somatic genomes. We show that the majority of both M. edax genomes consists of repetitive sequences that are either 1) verifiable transposable elements (TEs), or 2) non-simple repeat elements of likely TE origin. These identified repeats explain more than 90% of the difference in size between germline and somatic genomes. Excised repeats are a non-random sample of the total germline genomic repeat landscape; younger repeats are disproportionately excised from the somatic genome, whereas older repeats are disproportionately retained. Similarly, high-frequency DNA transposon and LINE superfamilies are disproportionately excised from the somatic genome, whereas high-frequency LTR retrotransposons are disproportionately retained. Sequence divergences of repeat elements in both genomes show a skew towards younger elements, indicating recent/ongoing repeat proliferation. Taken together, our results suggest that germline genome expansion in M. edax reflects explosive proliferation of repeat elements (including known TEs), and that billions of base pairs of such repeats are deleted from the somatic genome every generation. Thus, we hypothesize that chromatin diminution is a mechanism that controls repeat element load, and that this load can evolve to be highly divergent between tissue types within single organisms.
Repeat content of germline and somatic genomes
Summary statistics from shotgun 454 sequencing of Mesocyclops edax genomes
Average read length
Verification of repeat copy number differences between germline and soma with qPCR
Characterization of unknown repeat sequences
Because unknown repeats make up a substantial portion of both genomes, we performed 1) Open reading frame (ORF) identification, and 2) homology searches to further characterize such sequences. Of the 28,285 unknown repeats present in the M. edax repeat library, 4,900 (17.3%) and 2,844 (10.0%) were found to contain ORFs > 100 nucleotides in length using relaxed and strict criteria, respectively. Although most such ORFs were between 100 and 200 base pairs in length, a few were > 600 bp long (Additional file 1). Thus, many of the unknown repeats have the potential to encode proteins. However, only 351 of the 28,285 unknown repeat sequences (1.2%) retrieved hits from the protein database of Daphnia pulex, the only crustacean with a sequenced genome to date. Such hits comprised 137 Daphnia genes in total and included both hypothetical and known genes (Additional file 2). We note that, because Daphnia and Mesocyclops last shared a common ancestor hundreds of millions of years ago, this analysis would only detect genes exhibiting long-term sequence conservation, likely reflecting high levels of functional constraint. Thus, although many unknown repeats may encode proteins, most do not appear to be multi-copy endogenous genes exhibiting functional conservation across long evolutionary timescales. Taken together, the presence of ORFs and the lack of recognizable endogenous genes, as well as the non-simple/non-tandem (within the length of a single read) sequence composition, are consistent with the unknown repeats being of TE origin, although more extensive sequencing efforts will be required to classify these repeats conclusively.
Repeat element sequence divergence in somatic and germline genomes
Characterization of sequences excised during chromatin diminution
Results of relative frequency comparisons of repeat superfamilies in the somatic and germline genomes of M. edax
H0 : f g = f s (P-value)
Our analyses of the first large-scale genomic sequence datasets from copepod germline and soma show that the vast majority of the excised sequences are non-simple, non-tandem (within the length of a single read) repeats of known TE or likely TE origin. We show that 1) young, rapidly proliferating repeat elements underlie the gigantism of adult germline genomes of M. edax, and 2) the younger elements are disproportionately excised from the somatic genome. Thus, M. edax represents an unusual host/genetic parasite system; some of the deleterious effects of a recent/ongoing explosion in transposition activity (e.g. replication burden) are largely restricted to the germline and substantially reduced in the soma through physical removal of the transposing sequences. Previous smaller scale restriction digestion, cloning, and PCR-based studies of chromatin diminution in copepods identified several short simple repeats, several TEs, rRNAs, and dispersed motifs among the excised DNA sequences [15–17]. However, given the tiny fraction of the genome analyzed in these studies, a comprehensive picture of the targets of chromatin diminution has remained unknown. Our results confirm that all of these types of sequences are eliminated from the somatic genome. However, our conclusions about the relative contributions of these different sequence classes to the excised DNA differ, likely reflecting the different size scales of our datasets. For example, based on analyses of ~200 sequences from pre- and postdiminuted M. edax genomes, with read lengths of 500 – 750 bps, McKinnon and Drouin  concluded that TEs were not disproportionately deleted during chromatin diminution. The TE sequences in that study were from the LTR/BEL, LTR/Copia, LTR/Gypsy, and DNA/Ginger superfamilies. Our results corroborate their finding that none of these superfamilies is disproportionately excised from the soma; however, other TEs (e.g. DNA-hAT, DNA-MuDR, LINE-L1), as well as unknown repeats, not represented in McKinnon and Drouin’s dataset but represented in our shotgun data are disproportionately excised (Figure 5).
Germline genomic gigantism and chromatin diminution have co-evolved independently several times in the copepod genera Mesocyclops, Metacyclops, Megacyclops, and Cyclops[8, 26], suggesting that this mechanism of controlling repeat element load has been re-deployed numerous times within the cyclopoid copepods. Based on our current results, we propose the following hypothetical evolutionary scenarios for the origins of this trait: 1) Some copepod lineages begin to experience elevated germline repeat element proliferation, which may reflect genomic invasion of novel elements (note high levels of unknown repeats; Figure 2), decreased efficiency of cellular machinery targeting existing repeats, and/or decreased efficacy of selection against repeats because of strong genetic drift (e.g. a demographic history that includes a population bottleneck). 2) Lineages begin to eliminate these young repeats from the somatic genome by the introduction and non-homologous repair of double-strand breaks at germline-soma differentiation . Because these sequences are eliminated synchronously, an active, enzymatic mechanism is more likely than one mediated by ectopic recombination. Older, more divergent repeat elements (including those that predate the germline genome expansion, as well as those that diverge beyond recognition by excision machinery) remain in the somatic genome (Figure 6). Alternatively, the introduction and non-homologous repair of double-strand breaks at germline-soma differentiation may initially have targeted other sequences not identified in our dataset (e.g. single- or low-copy protein-coding genes), and TEs targeting such regions tagged for excision subsequently proliferated in the germline genome. 3) In either case, the germline and somatic genomes are now free to diverge from one another in repeat element content. In copepods with chromatin diminution, the germline genome grows to 5 – 75 times larger than the somatic genome.
Repeat elements and the repeat suppression/elimination machinery of their hosts co-evolve (e.g. TEs and small RNA-mediated silencing) . The outcome of this coevolutionary dynamic — the amount of repeat sequence present in a genome — is dictated in part by the relative strengths of 1) genetic drift, which can cause fixation of slightly deleterious repeats, and 2) selection for/against repeat insertions and efficacy of the cell’s repeat silencing machinery [28–30]. Our results show that, in M. edax, genomes with both high and low repeat content exist within single individuals. Germline genomic gigantism in copepods has been hypothesized to sequester the nucleotides necessary to provision a rapidly differentiating embryo, particularly in nitrogen- and phosphorous-poor environments . This function is a candidate target for selection to increase (or to maintain non-adaptively increased) germline repeat load in M. edax, despite the high costs that accompany high TE load (e.g. susceptibility to gain-of-function mutations, deleterious TE insertions, and ectopic recombination; disruption of cellular processes; and potential energetic costs) [31–34]. Alternatively, the high levels of germline repeats may serve no function in M. edax, but simply reflect selection’s inability to regulate these “selfish” sequences . In either case, elimination of the majority of repeats through chromatin diminution minimizes TE-associated costs to the soma and maintains ancestral somatic developmental rate, nucleus size, and cell size .
Molecular mechanisms of repeat removal in copepods remain unexplored. However, our results in M. edax, coupled with information about chromatin diminution in ciliates, suggest a candidate mechanism. In ciliates, the DNA excised from the micronucleus is derived from TEs, and its excision is directed by small-RNA-mediated modification of heterochromatin, followed by excision of this tagged heterochromatin by domesticated transposases [7, 13, 36, 37]. This represents the evolution of a novel function in ciliates for the RNAi-mediated transposon silencing machinery widely shared by eukaryotes . In M. edax, the germline genome has extremely high levels of TE and putatively TE-derived sequences (Figure 2), and their sequence divergence distributions show extensive recent and ongoing proliferation (Figure 4). High-frequency DNA transposon and LINE superfamilies, as well as the younger copies of all types of elements, are disproportionately excised from the somatic genome (Figures 5, and 6). Taken together, these results suggest that the RNAi transposon silencing machinery in the ancestral lineages leading to extant taxa with chromatin diminution may have changed function, from transposition silencing (the ancestral condition) to tagging specific TE sequences for excision (the derived condition). Alternatively, repeat removal may occur by a completely different mechanism in M. edax and the other cyclopoid copepods that possess chromatin diminution. Further analyses of the molecular processes underlying repeat removal in copepods will demonstrate whether phylogenetically distant taxa have evolved to eliminate repeats by the same, or different, mechanisms.
Our study used genomic shotgun sequence data from the cyclopoid copepod Mesocyclops edax to identify the repetitive sequences present in the species’ massive germline genome, but eliminated from the somatic genome by chromatin diminution. We show that excised repeats are a non-random sample of the total germline genome; younger repeats, as well as high-frequency DNA transposon and LINE superfamilies, are disproportionately excised from the somatic genome. Sequence divergence of repeat elements indicates recent/ongoing repeat proliferation. Taken together, our results suggest that germline genome expansion in M. edax reflects explosive proliferation of repeat elements (including known TEs), and that billions of base pairs of such repeats are deleted from the somatic genome every generation. Thus, we hypothesize that chromatin diminution is a mechanism that controls repeat element load, and that this load can evolve to be highly divergent between tissue types within single organisms.
Specimen information, tissue dissection, and genomic DNA extraction
Genomic DNA from somatic and germline cell lineages was obtained from adult female Mesocyclops edax Forbes, 1890 collected from Lake Shenandoah, Rockingham Co., Virginia, USA, (38°37’N; 78°83’W). Vouchers of this population are deposited in the National Museum of Natural History (USNM1121766). Somatic cells were obtained from first antennae severed at the 1st or 2nd antennal segment (Figure 1A). Cells containing the undiminuted (i.e. germline) genome were obtained from sacs of 8-cell embryos, which the adult carries external to its body (Figure 1B). Most embryos were first observed at the 4-cell stage and then followed until they had completed more than one half of their development at the 8-cell stage. Samples were preserved in 95% ethanol. Genomic DNA was extracted using the PROMEGA Total RNA isolation kit, omitting the DNAse step, as, in our experience, this kit yields high quality DNA from minute samples. From ~50 females, ~300 and 700 somatic cells and ~2000 and 1000 germ cells were harvested for shotgun sequencing and qPCR analyses, respectively.
Shotgun library creation and sequencing
Because obtaining genomic DNA from microscopic copepods is very labor intensive, the Repli-G Mini Kit (Qiagen) was used to amplify each genome with minimal bias for coding and satellite DNA . Libraries were prepared using the Nextera DNA Sample Prep Kit (Roche Titanium compatible). Libraries were sequenced on the Roche 454-FLX platform with XLR 70 Titanium reagents. Three-quarters and ¼ of a plate were allocated to germline and soma, respectively, because of genome size differences. Coverage was 0.76% and 1.6% for the germline and somatic genomes, respectively. DNA amplification, library preparation, and sequencing were performed by the University of Idaho Institute for Bioinformatics and Evolutionary Studies (IBEST) Genomics Resources Core facility.
Initial data processing
Shotgun reads from germline and soma were checked for sequencing artifacts generated by the presence of multiple beads and a single template in emPCR drops, which can skew estimates of repeat element abundance [39, 40]. The online 454 Replicate Filter (http://microbiomes.umms.med.umich.edu/replicates/) was used to filter out exact replicate reads (cutoff 0.99, length requirement 1.0 and initial base pair match 3). In total, 2.1% and 2.5% of shotgun reads were removed from the germline and somatic datasets, respectively, as potential artifacts.
Mining and classification of repeat elements
The pipeline included the following steps: 1) RepeatScout  was used to identify de novo repeats from shotgun reads, with default parameters. Shotgun reads from germline and soma were combined to increase the sequencing coverage. Identified repeats that were ≤ 50 nt or > 50% low-complexity were removed to construct a filtered RepeatScout library. 2) Shotgun reads were assembled into contigs using Newbler (http://contig.wordpress.com/table-of-contents/) with default parameters. To identify contigs that represent TEs, contig sequences were used as queries to BLASTx against the amino acid sequences of TE-encoded proteins (http://www.repeatmasker.org/RepeatProteinMask.html#database), with an e-value threshold cutoff of 1e-10. Contigs representing TEs were refined manually to avoid assembly artifacts . 3) Repeats identified in step 1 were classified using BLASTn against the TE contigs identified in step 2, with an e-value cutoff of 1e-5. Remaining unclassified repeats were used as queries to tBLASTx against the most recent release of RepBase (RepBase16.12), with an e-value threshold of 1e-5. 4) All classified repeats and TE contigs, along with the unclassified repeats (referred to as “unknown repeats” hereafter), were combined to produce an M. edax-specific repeat library. Using this library, we masked the shotgun reads from both germline and soma with RepeatMasker (http://www.repeatmasker.org/). Simple repeats were identified using the Tandem Repeats Finder  module in RepeatMasker.
Experimental verification of repeat content difference between germline and soma
For each of four types of repeats, we picked one representative family that our bioinformatic analyses identified as differing in copy number between germline and soma — DNA/hAT (DNA transposon), LINE/L1 (non-LTR retrotransposon), LTR/Gypsy (LTR retrotransposon), and an ORF-containing unknown repeat. For each family, qPCR primers were designed based on the RepeatScout-generated consensus sequence using Primer3Plus (http://primer3plus.com/cgi-bin/dev/primer3plus.cgi). We amplified a ~200 bp diagnostic fragment from each repeat family using standard PCR from somatic genomic DNA collected and extracted in 2012 using custom primers (Additional file 4). These fragments were cloned into the PCR®4-TOPO® vector using a TOPO® TA Cloning® Kit (Invitrogen Life Science Technologies). Plasmids that contained an insert were linearized by digestion with SpeI (New England Biolabs), gel-purified, and quantified with a NanoDrop spectrophotometer (Thermo Scientific). Plasmid number per unit volume was estimated based on the molecular weight of plasmid plus insert. A 10-fold dilution series (500,000 to 50 plasmids/μl) of the cloned fragments was used to generate standard curves. Germline and somatic sample DNA was diluted to 50 genomes/ul based on the estimated genome size (germline, 15 Gb; soma, 3 Gb). Quantitative PCR assays were performed in duplicate using 1 ul of sample DNA or cloned DNA (for standard curves) with a LightCycler® 480 Real-Time PCR System following manufacturer’s protocols (Roche Applied Science).
Characterization of unknown repeat sequences
The longest potential ORFs were identified using two sets of parameters: relaxed and strict. For the relaxed analysis, an ORF was defined as a sequence >100 nucleotides in length that begins with a start codon (ATG) and does not have an in-frame stop codon (TAA, TAG or TGA). For the strict analysis, an ORF was defined as a sequence >100 nucleotides in length that begins with a start codon and ends with an in-frame stop codon. Custom Perl scripts were used to implement these searches. To test whether any unknown repeats are derived from known protein-coding genes, unknown repeats were used as queries to BLASTx against the protein database for Daphnia pulex (the closest relative of M. edax with a fully sequenced and well-annotated genome) , with an e-value threshold of 1e-5. BLASTx against the manually curated UniProt database was also performed, although this search returned slightly fewer hits; thus, only Daphnia results are presented.
Repeat element sequence divergence in somatic and germline genomes
To summarize global historical patterns of repeat proliferation and deletion in M. edax, we estimated divergences of repeats from their ancestral sequences (i.e. ancestor–descendant pairwise divergences) in the somatic and germline genomes. For each repeat element family, we first used RepeatScout  to estimate a consensus sequence from all genomic repeat copies; this consensus represents the repeat family’s master gene (i.e. ancestral) sequence. Next, we used this set of consensus sequences to mask the germline and somatic datasets with RepeatMasker (http://www.repeatmasker.org/), generating ancestor–descendant pairwise alignment files. We note that some of the sequences identified as confamilial may have been generated by multiple active master genes that differed from one another in sequence. In such a case, a single consensus sequence would not accurately represent the ancestral state of all individual element copies; some of the differences between “ancestor” and descendant sequences would correspond to substitutions that occurred along the active master element lineage. This would produce upwardly biased estimates of sequence divergence. To minimize this problem, we parsed the RepeatMasker-generated pairwise alignments to identify substitutions that likely occurred along active master element lineages, and we excluded these sites from our estimates of sequence divergence. This was done using the following steps. First, we collected all of the pairwise alignments between a given consensus sequence and its descendants in both germline and somatic genomes. To increase accuracy, we limited our analyses to repeat elements ≥ 80% identical to their respective consensus sequences, with a minimum overlap of 100 bp. Second, based on each pairwise alignment, we recorded information about substitutions (i.e. ancestral and derived base pair, position). Third, we compared such substitution information across all pairwise alignments of a given consensus sequence to identify substitutions shared among multiple repeat copies. The probability of two identical substitutions occurring independently in different repeat copies is low; thus, groups of two or more substitutions shared by two or more repeat copies likely reflect substitutions that occurred along the ancestral master element lineage. We did not consider substitutions occurring at CpG sites because of a high probability of convergence at such rapidly evolving sites. Fourth, we updated the pairwise divergence estimates from RepeatMasker, excluding the sites corresponding to shared substitutions. From these refined pairwise alignments, we estimated sequence divergence, correcting with the Jukes-Cantor model of nucleotide substitution. This process was automated using in-house Perl scripts, which are available upon request. We plotted the fraction of total shotgun reads as a function of sequence divergence from consensus; assuming equal rates of nucleotide substitution, such sequence divergence distributions are a proxy for age distributions. We tested for a difference between the sequence divergence cumulative probability distributions of germline and soma using a Kolmogorov-Smirnov test.
To verify that ~1% shotgun coverage gives sufficiently accurate estimates of sequence divergence distributions to support the main conclusions of this study (i.e. high proportions of elements minimally diverged from consensus), we randomly extracted 1% of the human genome five times in read lengths comparable to our 454 data using in-house C code (available upon request), calculated repeat element sequence divergence distributions as for the M. edax data, and compared them to published results for the fully sequenced genome.
Frequency estimates of repeat elements
The maximum likelihood (ML) estimates are the observed relative frequencies, or sample proportions, and . The null hypothesis H0: f s = f g leads to a pooled ML estimate .
Characterization of sequences excised during chromatin diminution
Given two pairs of parameter estimates and , we define the likelihood ratio, . Joint confidence regions for f s and f g can be constructed from likelihood ratios by finding a suitable κ2 and requiring that . The cut-off κ2 was determined using simulations of binomial random variables to represent repeat frequencies with success probabilities based on the observed relative frequencies and . For each simulation, the likelihood ratio was computed and the cutoff κ2 was chosen so that 5% of the simulated values exceeded the cut-off. For relative frequencies greater than 1e-4, the cut-off computed by simulation agreed with cut-offs computed using the usual χ2 -distribution with 2 degrees of freedom based on a normal approximation. Hypothesis tests with a rejection criterion X2 ≥ κ2 were also performed where the test statistic based on the likelihood ratio where are ML estimates assuming H0 and are ML estimates assuming H1. Again, the threshold κ2 was chosen from simulations using the observed relative frequencies to generate binomial random variables in order to obtain the desired probability of a Type 1 error.
We computed 95% joint-confidence regions for f s and f g . Because these regions are two-dimensional, we created one-dimensional summaries and depicted them in the form of confidence intervals. When these intervals do not overlap, the joint confidence region clearly excludes f g = f s . However, even when the intervals do overlap, the actual confidence region may be such that f g = f s would still be excluded. Thus, we performed hypothesis tests using the likelihood ratio for each superfamily with a null hypothesis H0: f g = f s , which is equivalent to , and an alternative hypothesis H1: f g ≠ f s , which is equivalent to with a cutoff κ2 chosen for a Type 1 error with probability α = 0.05. In addition, P-values for the test statistic χ2 were computed based on the percentile ranking of χ2 relative to simulated data.
For each superfamily that rejected the null hypothesis, we further constructed a one-sided confidence interval for the ratio f g /f s . For a superfamily with , we considered the family of null hypotheses and alternative hypotheses for λ < 1 and identified the smallest value λ such that the null hypothesis is rejected with α = 0.05. For a superfamily with , the analogous tests were performed with opposite inequalities and λ > 1. The resulting critical relative frequencies correspond to the end-point of a one-sided 95% confidence interval for the ratio . To demonstrate the sensitivity of this end-point, the process was repeated with α = 0.01, corresponding to producing one-sided 99% confidence intervals for the ratio.
For each repeat subfamily, (including classified and unknown repeats), we first computed the sample relative frequency within each genome (# subfamily reads / total reads). We then computed the ratio of relative frequencies between germline and soma as a measure of relative abundance. This measure yields information about excision during chromatin diminution; >> 1 indicates repeats that are disproportionately excised from the soma, whereas << 1 indicates repeats that are disproportionately retained in the soma. Repeat subfamilies were binned as 1) failing to reject the null hypothesis = 1, 2) favoring the alternative hypothesis >> 1 (i.e. disproportionately excised from the somatic genome), or 3) favoring the alternative hypothesis << 1 (i.e. disproportionately retained in the somatic genome). Hypothesis tests were likelihood ratio tests performed using a Poisson approximation to the binomial distribution (χ2 with 1 df) with a p-value cutoff of p = 0.05. For each bin, we plotted the cumulative probability of element age (percent divergence from consensus) and tested for significant differences between the mean divergence percentages for 1) = 1 and >> 1, and 2) = 1 and << 1 using 2-sample z-statistics.
This work was supported by [NSF-DEB 1021489 to RLM], [NSF-DMS 0734284 to DBW], and by Colorado State University. Some analyses were performed on the CSU ISTeC HPC System [NSF CNS-0923886]. This material is also based in part upon work supported by the National Science Foundation under Cooperative Agreement No. DBI-0939454. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. M. Settles and the Institute for Bioinformatics and Evolutionary Studies at the University of Idaho provided excellent assistance with 454 data collection. G. Drouin shared unpublished work, critical discussion, and comments on an earlier version of this manuscript. A. Bannigan provided assistance with photography. M. Frahry created the program to subsample the human genome at 1% coverage.
- Coyne RS, Chalker DL, Yao MC: Genome downsizing during ciliate development: nuclear division of labor through chromosome restructuring. Annu Rev Genet. 1996, 30 (1): 557-578. 10.1146/annurev.genet.30.1.557.PubMedView ArticleGoogle Scholar
- Tobler H: The differentiation of germ and somatic cell lines in nematodes. Res Prob Cell Diff. 1986, 13: 1-69.Google Scholar
- Beermann S: The diminution of heterochromatic chromosomal segments in Cyclops (Crustacea, Copepoda). Chromosoma. 1977, 60 (4): 297-10.1007/BF00292858.PubMedView ArticleGoogle Scholar
- Smith JJ, Antonacci F, Eichler EE, Amemiya CT: Programmed loss of millions of base pairs from a vertebrate genome. Proc Natl Acad Sci. 2009, 106 (27): 11212-11217. 10.1073/pnas.0902358106.PubMed CentralPubMedView ArticleGoogle Scholar
- Smith JJ, Baker C, Eichler EE, Amemiya CT: Genetic consequences of programmed genome rearrangement. Curr Biol. 2012, 22 (16): 1524-1529. 10.1016/j.cub.2012.06.028.PubMed CentralPubMedView ArticleGoogle Scholar
- Kubota S, Ishibashi T, Kohno S: A germline restricted, highly repetitive DNA sequence in Paramyxineatami: an interspecifically conserved, but somatically eliminated, element. Mol Gen Genet. 1997, 256 (3): 252-256. 10.1007/s004380050567.PubMedView ArticleGoogle Scholar
- Arnaiz O, Mathy N, Baudry C, Malinsky S, Aury J-M, Denby Wilkes C, Garnier O, Labadie K, Lauderdale BE, Le Mouël A, Marmignon A, Nowacki M, Poulain J, Prajer M, Wincker P, Meyer E, Duharcourt S, Duret L, Betermier M, Sperling L: The Paramecium germline genome provides a niche for intragenic parasitic DNA: evolutionary dynamics of internal eliminated sequences. PLoS Genet. 2012, 8 (10): e1002984-10.1371/journal.pgen.1002984.PubMed CentralPubMedView ArticleGoogle Scholar
- Rasch EM, Wyngaard GA: Genome sizes of cyclopoid copepods (Crustacea): evidence of evolutionary constraint. Biol J Linn Soc. 2006, 87 (4): 625-635. 10.1111/j.1095-8312.2006.00610.x.View ArticleGoogle Scholar
- Wyngaard GA, Rasch EM, Connelly BA: Unusual augmentation of germline genome size in Cyclops kolensis (Crustacea, Copepoda): further evidence in support of a revised model of chromatin diminution. Chrom Res. 2011, 19 (7): 911-923. 10.1007/s10577-011-9234-3.PubMedView ArticleGoogle Scholar
- Wang J, Mitreva M, Berriman M, Thorne A, Magrini V, Koutsovoulos G, Kumar S, Blaxter Mark L, Davis Richard E: Silencing of germline-expressed genes by DNA elimination in somatic cells. Dev Cell. 2012, 23 (5): 1072-1080. 10.1016/j.devcel.2012.09.020.PubMed CentralPubMedView ArticleGoogle Scholar
- Etter A, Bernard V, Kenzelmann M, Tobler H, Muller F: Ribosomal heterogeneity from chromatin diminution in Ascaris lumbricoides. Science. 1994, 265 (5174): 954-10.1126/science.8052853.PubMedView ArticleGoogle Scholar
- Hunter DJ, Williams K, Cartinhour S, Herrick G: Precise excision of telomere-bearing transposons during Oxytricha fallax macronuclear development. Genes Dev. 1989, 3 (12b): 2101-2112. 10.1101/gad.3.12b.2101.PubMedView ArticleGoogle Scholar
- Chalker DL, Yao MC: DNA elimination in ciliates: transposon domestication and genome surveillance. Annu Rev Genet. 2011, 45: 227-246. 10.1146/annurev-genet-110410-132432.PubMedView ArticleGoogle Scholar
- Standiford DM: The development of a large nucleolus during oogenesis in Acanthocyclops vernalis (Crustacea, copepoda) and its possible relationship to chromatin diminution. Biol Cell. 1988, 63 (1): 35-40. 10.1111/j.1768-322X.1988.tb00739.x.PubMedView ArticleGoogle Scholar
- Zagoskin MV, Marshak TL, Mukha DV, Grishanin AK: Chromatin diminution process regulates rRNA gene copy number in freshwater copepods. Act Nat. 2010, 2 (4): 52-Google Scholar
- Drouin G: Chromatin diminution in the copepod Mesocyclops edax: diminution of tandemly repeated DNA families from somatic cells. Genome. 2006, 49: 657-665. 10.1139/G06-022.PubMedView ArticleGoogle Scholar
- McKinnon C, Drouin G: Chromatin diminution in the copepod Mesocyclops edax: elimination of both highly repetitive and non-highly repetitive DNA. Genome. 2013, doi:10.1139/gen-2012-0097Google Scholar
- Gregory TR, Hebert PD: The modulation of DNA content: proximate causes and ultimate consequences. Genome Res. 1999, 9 (4): 317-324.PubMedGoogle Scholar
- Gregory TR, Hebert PDN, Kolasa J: Evolutionary implications of the relationship between genome size and body size in flatworms and copepods. Heredity. 2000, 84 (2): 201-208. 10.1046/j.1365-2540.2000.00661.x.PubMedView ArticleGoogle Scholar
- Wyngaard GA, Gregory TR: Temporal control of DNA replication and the adaptive value of chromatin diminution in copepods. J Exp Zool. 2001, 291 (4): 310-316. 10.1002/jez.1131.PubMedView ArticleGoogle Scholar
- Wyngaard GA, Rasch EM, Manning NM, Gasser K, Domangue R: The relationship between genome size, development rate, and body size in copepods. Hydrobiologia. 2005, 532 (1): 123-137.View ArticleGoogle Scholar
- Rasch EM, Wyngaard GA, Connelly BA: Heterochromatin endoreduplication prior to gametogenesis and chromatin diminution during early embryogenesis in Mesocyclops edax (Copepoda: Crustacea). J Morphol. 2008, 269 (4): 387-397. 10.1002/jmor.10576.PubMedView ArticleGoogle Scholar
- Sun C, Shepard DB, Chong RA, Arriaza JL, Hall K, Castoe TA, Feschotte C, Pollock DD, Mueller RL: LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders. Genome Biol Evol. 2012, 4 (2): 168-183. 10.1093/gbe/evr139.PubMed CentralPubMedView ArticleGoogle Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View ArticleGoogle Scholar
- Novick PA, Basta H, Floumanhaft M, McClure MA, Boissinot S: The evolutionary dynamics of autonomous non-LTR retrotransposons in the lizard Anolis carolinensis shows more similarity to fish than mammals. Mol Biol Evol. 2009, 26 (8): 1811-10.1093/molbev/msp090.PubMedView ArticleGoogle Scholar
- Terpilowska B: Quantitative investigations of DNA in early embryogenesis of Acanthocyclops gigas (Claus). Zool Polon. 1971, 21 (2): 163-175.Google Scholar
- Blumenstiel JP: Evolutionary dynamics of transposable elements in a small RNA world. Trends Genet. 2011, 27 (1): 23-31. 10.1016/j.tig.2010.10.003.PubMedView ArticleGoogle Scholar
- Lynch M: The Origins of Genome Architecture. 2007, Sunderland, MA: Sinauer Associates, IncGoogle Scholar
- Lynch M, Bobay L-M, Catania F, Gout J-F, Rho M: The repatterning of eukaryotic genomes by random genetic drift. Annu Rev Genom Hum Genet. 2011, 12 (1): 347-366. 10.1146/annurev-genom-082410-101412.View ArticleGoogle Scholar
- Agren JA, Wright SI: Co-evolution between transposable elements and their hosts: a major factor in genome size evolution?. Chrom Res. 2011, 19 (6): 777-786. 10.1007/s10577-011-9229-0.PubMedView ArticleGoogle Scholar
- Petrov DA, Aminetzach YT, Davis J, Bensasson D, Hirsch AE: Size matters: non-LTR retrotransposable elements and ectopic recombination in Drosophila. Mol Biol Evol. 2003, 20 (6): 880-892. 10.1093/molbev/msg102.PubMedView ArticleGoogle Scholar
- Nuzhdin SV: Sure facts, speculations, and open questions about the evolution of transposable element copy number. Genetica. 1999, 107: 129-137. 10.1023/A:1003957323876.PubMedView ArticleGoogle Scholar
- De Gobbi M, Viprakasit V, Hughes JR, Fisher C, Buckle VJ, Ayyub H, Gibbons RJ, Vernimmen D, Yoshinaga Y, de Jong P, Cheng JF, Rubin EM, Wood WG, Bowden D, Higgs DR: A regulatory SNP causes a human genetic disease by creating a new transcriptional promoter. Science. 2006, 26 (312): 1215-1217.View ArticleGoogle Scholar
- Cavalier-Smith T: Economy, speed and size matter: evolutionary forces driving nuclear genome miniaturization and expansion. Ann Bot. 2005, 95: 147-175. 10.1093/aob/mci010.PubMed CentralPubMedView ArticleGoogle Scholar
- Orgel LE: Selfish DNA: the ultimate parasite. Nature. 1980, 284 (5757): 604-10.1038/284604a0.PubMedView ArticleGoogle Scholar
- Kataoka K, Mochizuki K: Programmed DNA elimination in Tetrahymena: a small RNA-mediated genome surveillance mechanism. RNA Infrastructure and Networks. Edited by: Collins L. 2011, New York: Springer, 156-173. vol. 722View ArticleGoogle Scholar
- Baudry C, Malinsky S, Restituito M, Kapusta A, Rosa S, Meyer E, Bétermier M: PiggyMac, a domesticated piggyBac transposase involved in programmed genome rearrangements in the ciliate Paramecium tetraurelia. Genes Dev. 2009, 23 (21): 2478-2483. 10.1101/gad.547309.PubMed CentralPubMedView ArticleGoogle Scholar
- Hosono S, Faruqi AF, Dean FB, Du Y, Sun Z, Wu X, Du J, Kingsmore SF, Egholm M, Lasken RS: Unbiased whole-genome amplification directly from clinical samples. Genome Res. 2003, 13 (5): 954-964. 10.1101/gr.816903.PubMed CentralPubMedView ArticleGoogle Scholar
- Gomez-Alvarez V, Teal TK, Schmidt TM: Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009, 3: 1314-1317. 10.1038/ismej.2009.72.PubMedView ArticleGoogle Scholar
- Niu B, Fu L, Sun S, Li W: Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinforma. 2010, 11: 187-10.1186/1471-2105-11-187.View ArticleGoogle Scholar
- Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics. 2005, 21 (suppl 1): i351-i358. 10.1093/bioinformatics/bti1018.PubMedView ArticleGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999, 27 (2): 573-580. 10.1093/nar/27.2.573.PubMed CentralPubMedView ArticleGoogle Scholar
- Colbourne JK, Pfrender ME, Gilbert D, Thomas WK, Tucker A, Oakley TH, Tokishita S, Aerts A, Arnold GJ, Basu MK, Bauer DJ, Caceres CE, Carmel L, Casola C, Choi J, Detter JC, Dong Q, Dusheyko S, Eads BD, Frohlich T, Geiler-Samerotte KA, Gerlach D, Hatcher P, Jogdeo S, Krijgsveld J, Kriventseva EV, Kultz D, Laforsch C, Lindquist E, Lopez J, et al: The ecoresponsive genome of Daphnia pulex. Science. 2011, 331 (6017): 555-561. 10.1126/science.1197761.PubMed CentralPubMedView ArticleGoogle Scholar
- Casella G, Berger R: Statistical Inference. 2002, Duxbury: Pacific Grove, CAGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.